Understanding Structured Data Extraction in LM-Kit.NET

TL;DR

Structured Data Extraction transforms unstructured content (text, documents, images) into organized, machine-readable data formats like JSON. In LM-Kit.NET, the TextExtraction class combines language model intelligence with symbolic AI layers (including grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) through the proprietary Dynamic Sampling framework. This neuro-symbolic approach ensures outputs always conform to your schema while reducing hallucinations by up to 75% compared to pure LLM approaches. The result: reliable automation of data entry, document processing, and information retrieval tasks with deterministic precision.

What is Structured Data Extraction?

Definition: Structured Data Extraction is the process of identifying and extracting specific pieces of information from unstructured or semi-structured content and organizing them into a predefined schema. Unlike free-form text generation, extraction produces deterministic, validated outputs that can be directly consumed by databases, APIs, or downstream applications.

The Extraction Pipeline

+-----------------+       +------------------+       +-----------------+
|  Unstructured   |       |   LM-Kit.NET     |       |   Structured    |
|    Content      | ----> |  TextExtraction  | ----> |   JSON Output   |
|                 |       |                  |       |                 |
| • Documents     |       | • Schema-aware   |       | • Type-safe     |
| • Images        |       | • Grammar-guided |       | • Validated     |
| • Text          |       | • Multi-modal    |       | • Ready to use  |
+-----------------+       +------------------+       +-----------------+

Key Differentiators from Text Generation

Aspect	Text Generation	Structured Extraction
Output Format	Free-form text	Schema-conformant JSON
Validation	None built-in	Type checking, required fields
Determinism	Variable outputs	Consistent structure
Use Case	Creative writing, chat	Data entry, automation

The Role of Structured Extraction in AI Applications

Automating Data Entry
- Extract invoice details (vendor, amounts, dates, line items)
- Parse resumes for candidate information
- Convert business cards to contact records
Document Understanding
- Extract key clauses from contracts
- Parse scientific papers for metadata
- Process forms and applications
Information Retrieval
- Extract product specifications from descriptions
- Parse event details from announcements
- Identify entities and relationships in reports
Data Migration & Integration
- Convert legacy documents to structured formats
- Normalize data from heterogeneous sources
- Feed extracted data to APIs and databases

How LM-Kit.NET Implements Structured Extraction

LM-Kit.NET's TextExtraction class combines language model intelligence with symbolic AI layers through the proprietary Dynamic Sampling framework, a neuro-symbolic approach that ensures outputs always match your schema while dramatically reducing errors.

Neuro-Symbolic Architecture

+--------------------------------------------------------------------------+
|                  TextExtraction Engine (Dynamic Sampling)                |
+--------------------------------------------------------------------------+
|                                                                          |
|  +---------------------------------------------------------------------+ |
|  |                      NEURAL LAYER (LLM)                             | |
|  |  Content Understanding • Semantic Interpretation • Context Parsing  | |
|  +---------------------------------------------------------------------+ |
|                                    |                                     |
|                                    v                                     |
|  +---------------------------------------------------------------------+ |
|  |                    SYMBOLIC AI LAYER                                | |
|  |  +-------------+ +-------------+ +-------------+ +-------------+    | |
|  |  |   Grammar   | |   Fuzzy     | |  Taxonomy   | | Rule-Based  |    | |
|  |  | Constraints | |   Logic     | |  Matching   | | Validation  |    | |
|  |  |   (GBNF)    | |(Perplexity) | |(Ontologies) | |(Expert Sys) |    | |
|  |  +-------------+ +-------------+ +-------------+ +-------------+    | |
|  +---------------------------------------------------------------------+ |
|                                    |                                     |
|                                    v                                     |
|  +---------------------------------------------------------------------+ |
|  |              VALIDATED OUTPUT (Schema-Compliant JSON)               | |
|  +---------------------------------------------------------------------+ |
|                                                                          |
+--------------------------------------------------------------------------+

The Dynamic Sampling Advantage

LM-Kit's Dynamic Sampling integrates multiple symbolic AI techniques that activate dynamically based on content type and extraction context:

Symbolic Component	Role in Extraction
Grammar Constraints (GBNF)	Enforces valid JSON structure at generation time
Fuzzy Logic (Fuzzifiers)	Assesses token confidence via contextual perplexity
Taxonomy Matching	Validates values against known categorizations
Ontology Validation	Ensures semantic consistency across fields
Rule-Based Expert Systems	Applies domain-specific extraction rules
Auxiliary Content Lookup	Extends context beyond the attention window

Performance Impact:

75% fewer errors compared to pure LLM approaches
2× faster processing through speculative grammar validation
100% schema compliance via symbolic enforcement
Zero hallucinations in structured fields

Supported Element Types

LM-Kit.NET supports rich data types for extraction:

Type	Description	Example
`String`	Text values	Names, descriptions
`Integer`	Whole numbers	Quantities, IDs
`Double`	Decimal numbers	Prices, percentages
`Bool`	True/false	Flags, checkboxes
`Date`	Calendar dates	Due dates, birth dates
`DateTime`	Date with time	Timestamps
`StringArray`	List of strings	Tags, categories
`IntegerArray`	List of integers	Line item quantities
`DoubleArray`	List of decimals	Price lists
`Object`	Nested structure	Address components
`ObjectArray`	List of objects	Line items, entries

Format Constraints

Each element can have formatting rules:

Case normalization: Uppercase, Lowercase, TitleCase
Length limits: MaxLength, MinLength
Date formats: Custom date/time patterns
Allowed values: Enum-style constraints
Required fields: Mandatory vs optional

Practical Implementation in LM-Kit.NET

Basic Extraction Example

using LMKit.Model;
using LMKit.Extraction;

// Load a capable model
var model = LM.LoadFromModelID("gemma3:12b");

// Create extraction instance
var extractor = new TextExtraction(model);

// Define what to extract
extractor.Elements.Add(new TextExtractionElement("company_name", ElementType.String)
{
    Description = "The name of the company or organization"
});
extractor.Elements.Add(new TextExtractionElement("invoice_number", ElementType.String)
{
    Description = "The unique invoice identifier"
});
extractor.Elements.Add(new TextExtractionElement("total_amount", ElementType.Double)
{
    Description = "The total amount due",
    IsRequired = true
});
extractor.Elements.Add(new TextExtractionElement("due_date", ElementType.Date)
{
    Description = "When payment is due"
});

// Set content to extract from
extractor.SetContent("Invoice #INV-2024-0042 from Acme Corp. Total: $1,234.56. Due: March 15, 2024.");

// Extract structured data
var result = extractor.Parse(CancellationToken.None);

Console.WriteLine(result.Json);
// Output:
// {
//   "company_name": "Acme Corp",
//   "invoice_number": "INV-2024-0042",
//   "total_amount": 1234.56,
//   "due_date": "2024-03-15"
// }

Nested Object Extraction

// Define line items with nested structure
var lineItem = new TextExtractionElement("line_items", ElementType.ObjectArray)
{
    Description = "Individual items on the invoice",
    InnerElements = new List<TextExtractionElement>
    {
        new("description", ElementType.String) { Description = "Item description" },
        new("quantity", ElementType.Integer) { Description = "Number of units" },
        new("unit_price", ElementType.Double) { Description = "Price per unit" },
        new("total", ElementType.Double) { Description = "Line total" }
    }
};

extractor.Elements.Add(lineItem);

Extraction from Documents

// Extract from PDF invoice
var attachment = new Attachment("invoice.pdf");
extractor.SetContent(attachment);

// Or from image (with optional OCR)
var imageAttachment = new Attachment("scanned_document.png");
extractor.SetContent(imageAttachment);
extractor.PreferredInferenceModality = InferenceModality.Vision;

Schema Discovery

Let LM-Kit.NET automatically suggest an extraction schema:

// Provide sample content
extractor.SetContent(sampleDocument);

// Discover optimal schema
var discoveredElements = await extractor.SchemaDiscoveryAsync(
    "Extract all relevant business information",
    CancellationToken.None
);

// Review and use discovered elements
foreach (var element in discoveredElements)
{
    Console.WriteLine($"Found: {element.Name} ({element.Type})");
    extractor.Elements.Add(element);
}

Using JSON Schema Definition

// Define schema using standard JSON Schema
string jsonSchema = """
{
    "type": "object",
    "properties": {
        "name": { "type": "string" },
        "age": { "type": "integer" },
        "email": { "type": "string", "format": "email" }
    },
    "required": ["name", "email"]
}
""";

extractor.SetElementsFromJsonSchema(jsonSchema);

Best Practices for Reliable Extraction

1. Write Clear Element Descriptions

The Description property significantly impacts extraction accuracy:

// ❌ Vague description
new TextExtractionElement("amount", ElementType.Double)
{
    Description = "The amount"
};

// ✅ Specific description
new TextExtractionElement("total_amount", ElementType.Double)
{
    Description = "The final total amount due including tax, in the document's currency"
};

2. Use Guidance for Context

extractor.Guidance = "This is a US tax form. Dates are in MM/DD/YYYY format. " +
                     "Dollar amounts may include commas as thousand separators.";

3. Handle Uncertainty Gracefully

// Return null for uncertain values instead of guessing
extractor.NullOnDoubt = true;

4. Choose the Right Modality

// Text-only content
extractor.PreferredInferenceModality = InferenceModality.Text;

// Scanned documents or images
extractor.PreferredInferenceModality = InferenceModality.Vision;

// Mixed content (let LM-Kit decide)
extractor.PreferredInferenceModality = InferenceModality.Multimodal;

Key Terms

Schema: The structure defining what fields to extract and their types
Element: A single field to extract (name, type, description, constraints)
Dynamic Sampling: LM-Kit's neuro-symbolic inference framework combining LLMs with symbolic AI
Grammar-Constrained Generation: Technique ensuring LLM output conforms to a formal grammar (GBNF/JSON schema)
Neuro-Symbolic AI: Integration of neural networks (LLMs) with symbolic reasoning (rules, grammars, logic)
Speculative Grammar: Fast-path validation that accepts grammar-compliant tokens without full vocabulary analysis
Contextual Perplexity: Measure of model uncertainty used to trigger symbolic validation
Auxiliary Content: Extended context beyond the attention window for validation lookups
Modality: The type of content being processed (text, vision, multimodal)
Schema Discovery: Automatic detection of optimal extraction schema from sample content

TextExtraction: Core extraction class
TextExtractionElement: Define extraction fields
ElementType: Supported data types
TextExtractionResult: Extraction results
Attachment: Document input handling

Extraction: Broader overview of all extraction capabilities in LM-Kit.NET
Classification: Document classification and routing
Symbolic AI: Rule-based reasoning powering Dynamic Sampling
Dynamic Sampling: LM-Kit's adaptive neuro-symbolic inference
Intelligent Document Processing (IDP): End-to-end document automation
Named Entity Recognition (NER): Entity identification
Grammar Sampling: Constrained output generation
Function Calling: Structured model interactions
LLM: Large Language Models powering content understanding
Inference: The model execution process behind extraction
Embeddings: Vector representations used in semantic extraction tasks
Prompt Engineering: Crafting guidance and instructions for extraction

External Resources

LM-Kit Structured Data Extraction: Product overview with neuro-symbolic capabilities
LM-Kit Dynamic Sampling Blog: Introduction to Dynamic Sampling
JSON Schema Specification: Standard for defining JSON structure
LM-Kit Structured Data Extraction Demo: Step-by-step tutorial
LM-Kit Invoice Extraction Demo: Real-world invoice processing

Summary

Structured Data Extraction in LM-Kit.NET transforms unstructured content into validated, schema-conformant JSON through the TextExtraction class powered by Dynamic Sampling. This neuro-symbolic approach combines the semantic understanding of language models with symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, ontology validation, and rule-based expert systems) to achieve 75% fewer errors, 2× faster processing, and 100% schema compliance. The result: reliable automation of document processing, data entry, and information retrieval tasks across text, images, PDFs, and Office documents, with zero hallucinations in structured fields, all running locally for maximum privacy and performance.

Table of Contents