Extract Structured Data from Unstructured Text
Structured extraction turns free-form text (emails, invoices, reports, logs) into typed, machine-readable data. LM-Kit.NET uses grammar-constrained sampling to guarantee that the LLM output is valid JSON conforming exactly to your schema. No post-processing or JSON repair needed.
Why This Matters
Two high-value enterprise problems that local structured extraction solves:
- Invoice and document automation. Accounts payable teams process thousands of invoices with varying layouts. Extraction pulls vendor names, line items, totals, and dates into structured records for ERP ingestion, all without sending sensitive financial data to external APIs.
- Compliance-driven PII detection. GDPR, HIPAA, and CCPA require organizations to identify personal data in unstructured content. Local extraction finds names, emails, phone numbers, and addresses on-premises, avoiding the compliance risk of sending PII to cloud services.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
Recommended model: gemma3:4b or qwen3:4b. Larger models produce higher extraction accuracy on complex schemas.
Step 1: Create the Project
dotnet new console -n ExtractionQuickstart
cd ExtractionQuickstart
dotnet add package LM-Kit.NET
Step 2: Basic Extraction (Flat Schema)
This example extracts product information from a marketing paragraph.
using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// 1. Load model
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// 2. Define extraction schema
var extractor = new TextExtraction(model);
extractor.Elements = new List<TextExtractionElement>
{
new("ProductName", ElementType.String, "Name of the product", isRequired: true),
new("Price", ElementType.Double, "Price in USD"),
new("ReleaseDate", ElementType.Date, "When the product launches"),
new("Features", ElementType.StringArray, "List of key features")
};
// 3. Set content and extract
extractor.SetContent(
"Introducing the AeroBook Pro 2025, our thinnest laptop yet at just 11.9mm. " +
"Priced at $1,299, it features a 14-inch OLED display, 24-hour battery life, " +
"and Wi-Fi 7 support. Available starting March 15, 2025."
);
TextExtractionResult result = extractor.Parse();
// 4. Access results
Console.WriteLine("Extracted Data:");
Console.WriteLine($" Product: {result["ProductName"].Value}");
Console.WriteLine($" Price: ${result["Price"].As<double>():F2}");
Console.WriteLine($" Release: {result["ReleaseDate"].As<DateTime>():yyyy-MM-dd}");
Console.WriteLine($" Features:");
foreach (string feature in result["Features"].As<string[]>())
Console.WriteLine($" - {feature}");
Console.WriteLine($"\nConfidence: {result.Confidence:P0}");
Console.WriteLine($"\nFull JSON:\n{result.Json}");
Expected output:
Extracted Data:
Product: AeroBook Pro 2025
Price: $1299.00
Release: 2025-03-15
Features:
- 14-inch OLED display
- 24-hour battery life
- Wi-Fi 7 support
Confidence: 95%
Full JSON:
{"ProductName":"AeroBook Pro 2025","Price":1299.0,"ReleaseDate":"2025-03-15","Features":["14-inch OLED display","24-hour battery life","Wi-Fi 7 support"]}
Step 3: Nested Schema (Invoice Extraction)
Real-world extraction often requires nested objects and arrays. Here is an invoice schema with vendor info and line items:
extractor.Elements = new List<TextExtractionElement>
{
new("InvoiceNumber", ElementType.String, "Invoice ID", isRequired: true),
new("InvoiceDate", ElementType.Date, "Date the invoice was issued"),
new("DueDate", ElementType.Date, "Payment due date"),
// Nested object: vendor information
new("Vendor",
new[]
{
new TextExtractionElement("Name", ElementType.String, "Company name"),
new TextExtractionElement("Email", ElementType.String, "Contact email"),
new TextExtractionElement("TaxId", ElementType.String, "Tax identification number")
},
isArray: false,
description: "Seller/supplier information"),
// Array of objects: line items
new("LineItems",
new[]
{
new TextExtractionElement("Description", ElementType.String, "Item description"),
new TextExtractionElement("Quantity", ElementType.Integer, "Number of units"),
new TextExtractionElement("UnitPrice", ElementType.Double, "Price per unit"),
new TextExtractionElement("Amount", ElementType.Double, "Line total (qty * unit price)")
},
isArray: true,
description: "Itemized charges"),
new("Subtotal", ElementType.Double, "Sum before tax"),
new("Tax", ElementType.Double, "Tax amount"),
new("Total", ElementType.Double, "Total amount due", isRequired: true)
};
Access nested results:
var result = extractor.Parse();
// Nested object access with dot notation
string vendorName = result["Vendor.Name"].Value.ToString();
string vendorEmail = result["Vendor.Email"].Value.ToString();
// Enumerate array items via InnerElements
foreach (var item in result.EnumerateAt("LineItems"))
{
// Each item's InnerElements correspond to the schema fields in order
var fields = item.InnerElements;
string desc = fields[0].Value?.ToString(); // Description
int qty = fields[1].As<int>(); // Quantity
double price = fields[2].As<double>(); // UnitPrice
double amount = fields[3].As<double>(); // Amount
Console.WriteLine($" {desc}: {qty} x ${price:F2} = ${amount:F2}");
}
// Indexed access using path notation (preferred for direct field access)
double firstItemPrice = result["LineItems[0].UnitPrice"].As<double>();
Step 4: Constraining Values with Format
Use TextExtractionElementFormat to restrict what the model can output:
// Enum constraint: only allow specific values
var priority = new TextExtractionElement("Priority", ElementType.String, "Task priority level");
priority.Format.AllowedValues = new List<string> { "Low", "Medium", "High", "Critical" };
// Case normalization
var countryCode = new TextExtractionElement("CountryCode", ElementType.String, "ISO country code");
countryCode.Format.CaseMode = TextExtractionElementFormat.TextCaseMode.UpperCase;
countryCode.Format.MaxLength = 2;
// Strip common prefixes
var invoiceNo = new TextExtractionElement("InvoiceNumber", ElementType.String, "Invoice ID");
invoiceNo.Format.TrimStartValues = new List<string> { "INV-", "INV/", "INV#", "#" };
// Format hint for validation
var email = new TextExtractionElement("Email", ElementType.String, "Contact email");
email.Format.FormatHint = TextExtractionElementFormat.PredefinedStringFormat.Email;
These constraints are enforced at the grammar level during sampling: the model physically cannot produce a value outside the allowed set.
Step 5: Extracting from Files (PDFs and Images)
TextExtraction accepts attachments for direct file processing:
// From a PDF file
extractor.SetContent(new Attachment("invoice.pdf"));
// From a specific page range
extractor.SetContent(new Attachment("report.pdf"), "3-5");
// From an image (scanned document)
extractor.SetContent(new Attachment("receipt.jpg"));
var result = extractor.Parse();
For image-based extraction, configure the inference modality:
using LMKit.Inference;
// Use vision capabilities for scanned documents
extractor.PreferredInferenceModality = InferenceModality.Vision;
// Or use Tesseract OCR for text extraction from images
extractor.OcrEngine = new TesseractOcr();
Step 6: JSON Schema Input (Alternative to Programmatic Definition)
If you already have a JSON Schema, load it directly:
string jsonSchema = """
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "ContactInfo",
"type": "object",
"properties": {
"full_name": { "type": "string" },
"email": { "type": "string", "format": "email" },
"phone": { "type": "string" },
"company": { "type": "string" },
"role": { "type": "string", "enum": ["Engineer", "Manager", "Director", "VP", "C-Suite"] }
},
"required": ["full_name", "email"]
}
""";
extractor.SetElementsFromJsonSchema(jsonSchema);
extractor.SetContent("John Smith, CTO at Acme Corp. Reach him at john@acme.com or 555-0123.");
var result = extractor.Parse();
Console.WriteLine(result.Json);
// {"full_name":"John Smith","email":"john@acme.com","phone":"555-0123","company":"Acme Corp","role":"C-Suite"}
Schema Auto-Discovery
If you are unsure what fields to extract, let the model suggest a schema:
extractor.SetContent("...");
string suggestedSchema = extractor.SchemaDiscovery();
Console.WriteLine(suggestedSchema); // prints a JSON Schema
// Review, edit if needed, then apply
extractor.SetElementsFromJsonSchema(suggestedSchema);
var result = extractor.Parse();
This is useful for exploring unfamiliar document types before locking down a production schema.
ElementType Reference
| Type | .NET Type | When to Use |
|---|---|---|
String |
string |
Names, descriptions, IDs, free-form text |
Integer |
int |
Counts, quantities, whole numbers |
Double |
double |
Prices, percentages, measurements |
Bool |
bool |
Yes/no, true/false flags |
Date |
DateTime |
Dates and timestamps |
StringArray |
string[] |
Lists of tags, features, names |
IntegerArray |
int[] |
Lists of numeric values |
DoubleArray |
double[] |
Lists of prices, scores |
DateArray |
DateTime[] |
Lists of dates |
For nested objects and arrays of objects, use the constructor overload with innerElements:
// Single nested object
new TextExtractionElement("Address", innerElements, isArray: false)
// Array of nested objects
new TextExtractionElement("LineItems", innerElements, isArray: true)
Tips for High Accuracy
Write clear
Descriptionstrings. The description is the main signal the model uses to decide what to extract. Be specific: "Invoice total amount in USD including tax" is better than "Total".Use
isRequired: truefor critical fields. Required fields always appear in the output (possibly with null values), while optional fields may be omitted.Use
AllowedValuesfor known categories. If a field has a finite set of valid values, enumerate them. Grammar sampling will enforce the constraint exactly.Use the
Guidanceproperty for domain context. This adds a semantic hint to guide the model:extractor.Guidance = "This is a European invoice. Dates are in DD/MM/YYYY format. " + "Amounts use comma as decimal separator.";Larger models extract better. For complex nested schemas or ambiguous text, use
gemma3:12borqwen3:8binstead of 4B models.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Fields returned as null | Model uncertain about the value | Set extractor.NullOnDoubt = false for aggressive extraction, or improve the Description |
| Wrong date format | Ambiguous date in source text | Use the Guidance property to specify expected format |
| Array returns single item | Model didn't detect multiple entries | Ensure each item is clearly delineated in the source text, or increase MaximumContextLength |
| Slow extraction | Schema too complex for small model | Simplify the schema or use a larger model |
| JSON parse error | Should never happen (grammar-constrained) | Report as bug; grammar sampling guarantees valid JSON |
Next Steps
- Load a Model and Generate Your First Response: if you haven't set up model loading yet.
- Build a RAG Pipeline Over Your Own Documents: combine extraction with RAG for document Q&A.
- Samples: Invoice Data Extraction: full invoice extraction demo.
- Samples: Structured Data Extraction: general extraction demo.
- Samples: PII Extraction: privacy-focused extraction.