Table of Contents

Extract Structured Data from Unstructured Text

Structured extraction turns free-form text (emails, invoices, reports, logs) into typed, machine-readable data. LM-Kit.NET uses grammar-constrained sampling to guarantee that the LLM output is valid JSON conforming exactly to your schema. No post-processing or JSON repair needed.


Why This Matters

Two high-value enterprise problems that local structured extraction solves:

  1. Invoice and document automation. Accounts payable teams process thousands of invoices with varying layouts. Extraction pulls vendor names, line items, totals, and dates into structured records for ERP ingestion, all without sending sensitive financial data to external APIs.
  2. Compliance-driven PII detection. GDPR, HIPAA, and CCPA require organizations to identify personal data in unstructured content. Local extraction finds names, emails, phone numbers, and addresses on-premises, avoiding the compliance risk of sending PII to cloud services.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB

Recommended model: gemma3:4b or qwen3:4b. Larger models produce higher extraction accuracy on complex schemas.


Step 1: Create the Project

dotnet new console -n ExtractionQuickstart
cd ExtractionQuickstart
dotnet add package LM-Kit.NET

Step 2: Basic Extraction (Flat Schema)

This example extracts product information from a marketing paragraph.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// 1. Load model
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// 2. Define extraction schema
var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
    new("ProductName", ElementType.String, "Name of the product", isRequired: true),
    new("Price", ElementType.Double, "Price in USD"),
    new("ReleaseDate", ElementType.Date, "When the product launches"),
    new("Features", ElementType.StringArray, "List of key features")
};

// 3. Set content and extract
extractor.SetContent(
    "Introducing the AeroBook Pro 2025, our thinnest laptop yet at just 11.9mm. " +
    "Priced at $1,299, it features a 14-inch OLED display, 24-hour battery life, " +
    "and Wi-Fi 7 support. Available starting March 15, 2025."
);

TextExtractionResult result = extractor.Parse();

// 4. Access results
Console.WriteLine("Extracted Data:");
Console.WriteLine($"  Product: {result["ProductName"].Value}");
Console.WriteLine($"  Price:   ${result["Price"].As<double>():F2}");
Console.WriteLine($"  Release: {result["ReleaseDate"].As<DateTime>():yyyy-MM-dd}");
Console.WriteLine($"  Features:");
foreach (string feature in result["Features"].As<string[]>())
    Console.WriteLine($"    - {feature}");

Console.WriteLine($"\nConfidence: {result.Confidence:P0}");
Console.WriteLine($"\nFull JSON:\n{result.Json}");

Expected output:

Extracted Data:
  Product: AeroBook Pro 2025
  Price:   $1299.00
  Release: 2025-03-15
  Features:
    - 14-inch OLED display
    - 24-hour battery life
    - Wi-Fi 7 support

Confidence: 95%

Full JSON:
{"ProductName":"AeroBook Pro 2025","Price":1299.0,"ReleaseDate":"2025-03-15","Features":["14-inch OLED display","24-hour battery life","Wi-Fi 7 support"]}

Step 3: Nested Schema (Invoice Extraction)

Real-world extraction often requires nested objects and arrays. Here is an invoice schema with vendor info and line items:

extractor.Elements = new List<TextExtractionElement>
{
    new("InvoiceNumber", ElementType.String, "Invoice ID", isRequired: true),
    new("InvoiceDate", ElementType.Date, "Date the invoice was issued"),
    new("DueDate", ElementType.Date, "Payment due date"),

    // Nested object: vendor information
    new("Vendor",
        new[]
        {
            new TextExtractionElement("Name", ElementType.String, "Company name"),
            new TextExtractionElement("Email", ElementType.String, "Contact email"),
            new TextExtractionElement("TaxId", ElementType.String, "Tax identification number")
        },
        isArray: false,
        description: "Seller/supplier information"),

    // Array of objects: line items
    new("LineItems",
        new[]
        {
            new TextExtractionElement("Description", ElementType.String, "Item description"),
            new TextExtractionElement("Quantity", ElementType.Integer, "Number of units"),
            new TextExtractionElement("UnitPrice", ElementType.Double, "Price per unit"),
            new TextExtractionElement("Amount", ElementType.Double, "Line total (qty * unit price)")
        },
        isArray: true,
        description: "Itemized charges"),

    new("Subtotal", ElementType.Double, "Sum before tax"),
    new("Tax", ElementType.Double, "Tax amount"),
    new("Total", ElementType.Double, "Total amount due", isRequired: true)
};

Access nested results:

var result = extractor.Parse();

// Nested object access with dot notation
string vendorName = result["Vendor.Name"].Value.ToString();
string vendorEmail = result["Vendor.Email"].Value.ToString();

// Enumerate array items via InnerElements
foreach (var item in result.EnumerateAt("LineItems"))
{
    // Each item's InnerElements correspond to the schema fields in order
    var fields = item.InnerElements;
    string desc = fields[0].Value?.ToString();    // Description
    int qty = fields[1].As<int>();                // Quantity
    double price = fields[2].As<double>();        // UnitPrice
    double amount = fields[3].As<double>();       // Amount

    Console.WriteLine($"  {desc}: {qty} x ${price:F2} = ${amount:F2}");
}

// Indexed access using path notation (preferred for direct field access)
double firstItemPrice = result["LineItems[0].UnitPrice"].As<double>();

Step 4: Constraining Values with Format

Use TextExtractionElementFormat to restrict what the model can output:

// Enum constraint: only allow specific values
var priority = new TextExtractionElement("Priority", ElementType.String, "Task priority level");
priority.Format.AllowedValues = new List<string> { "Low", "Medium", "High", "Critical" };

// Case normalization
var countryCode = new TextExtractionElement("CountryCode", ElementType.String, "ISO country code");
countryCode.Format.CaseMode = TextExtractionElementFormat.TextCaseMode.UpperCase;
countryCode.Format.MaxLength = 2;

// Strip common prefixes
var invoiceNo = new TextExtractionElement("InvoiceNumber", ElementType.String, "Invoice ID");
invoiceNo.Format.TrimStartValues = new List<string> { "INV-", "INV/", "INV#", "#" };

// Format hint for validation
var email = new TextExtractionElement("Email", ElementType.String, "Contact email");
email.Format.FormatHint = TextExtractionElementFormat.PredefinedStringFormat.Email;

These constraints are enforced at the grammar level during sampling: the model physically cannot produce a value outside the allowed set.


Step 5: Extracting from Files (PDFs and Images)

TextExtraction accepts attachments for direct file processing:

// From a PDF file
extractor.SetContent(new Attachment("invoice.pdf"));

// From a specific page range
extractor.SetContent(new Attachment("report.pdf"), "3-5");

// From an image (scanned document)
extractor.SetContent(new Attachment("receipt.jpg"));

var result = extractor.Parse();

For image-based extraction, configure the inference modality:

using LMKit.Inference;

// Use vision capabilities for scanned documents
extractor.PreferredInferenceModality = InferenceModality.Vision;

// Or use Tesseract OCR for text extraction from images
extractor.OcrEngine = new TesseractOcr();

Step 6: JSON Schema Input (Alternative to Programmatic Definition)

If you already have a JSON Schema, load it directly:

string jsonSchema = """
{
    "$schema": "http://json-schema.org/draft-07/schema#",
    "title": "ContactInfo",
    "type": "object",
    "properties": {
        "full_name":    { "type": "string" },
        "email":        { "type": "string", "format": "email" },
        "phone":        { "type": "string" },
        "company":      { "type": "string" },
        "role":         { "type": "string", "enum": ["Engineer", "Manager", "Director", "VP", "C-Suite"] }
    },
    "required": ["full_name", "email"]
}
""";

extractor.SetElementsFromJsonSchema(jsonSchema);
extractor.SetContent("John Smith, CTO at Acme Corp. Reach him at john@acme.com or 555-0123.");

var result = extractor.Parse();
Console.WriteLine(result.Json);
// {"full_name":"John Smith","email":"john@acme.com","phone":"555-0123","company":"Acme Corp","role":"C-Suite"}

Schema Auto-Discovery

If you are unsure what fields to extract, let the model suggest a schema:

extractor.SetContent("...");
string suggestedSchema = extractor.SchemaDiscovery();
Console.WriteLine(suggestedSchema);  // prints a JSON Schema

// Review, edit if needed, then apply
extractor.SetElementsFromJsonSchema(suggestedSchema);
var result = extractor.Parse();

This is useful for exploring unfamiliar document types before locking down a production schema.


ElementType Reference

Type .NET Type When to Use
String string Names, descriptions, IDs, free-form text
Integer int Counts, quantities, whole numbers
Double double Prices, percentages, measurements
Bool bool Yes/no, true/false flags
Date DateTime Dates and timestamps
StringArray string[] Lists of tags, features, names
IntegerArray int[] Lists of numeric values
DoubleArray double[] Lists of prices, scores
DateArray DateTime[] Lists of dates

For nested objects and arrays of objects, use the constructor overload with innerElements:

// Single nested object
new TextExtractionElement("Address", innerElements, isArray: false)

// Array of nested objects
new TextExtractionElement("LineItems", innerElements, isArray: true)

Tips for High Accuracy

  1. Write clear Description strings. The description is the main signal the model uses to decide what to extract. Be specific: "Invoice total amount in USD including tax" is better than "Total".

  2. Use isRequired: true for critical fields. Required fields always appear in the output (possibly with null values), while optional fields may be omitted.

  3. Use AllowedValues for known categories. If a field has a finite set of valid values, enumerate them. Grammar sampling will enforce the constraint exactly.

  4. Use the Guidance property for domain context. This adds a semantic hint to guide the model:

    extractor.Guidance = "This is a European invoice. Dates are in DD/MM/YYYY format. " +
                         "Amounts use comma as decimal separator.";
    
  5. Larger models extract better. For complex nested schemas or ambiguous text, use gemma3:12b or qwen3:8b instead of 4B models.


Common Issues

Problem Cause Fix
Fields returned as null Model uncertain about the value Set extractor.NullOnDoubt = false for aggressive extraction, or improve the Description
Wrong date format Ambiguous date in source text Use the Guidance property to specify expected format
Array returns single item Model didn't detect multiple entries Ensure each item is clearly delineated in the source text, or increase MaximumContextLength
Slow extraction Schema too complex for small model Simplify the schema or use a larger model
JSON parse error Should never happen (grammar-constrained) Report as bug; grammar sampling guarantees valid JSON

Next Steps