Table of Contents

📋 Understanding Structured Data Extraction in LM-Kit.NET


📄 TL;DR

Structured Data Extraction transforms unstructured content (text, documents, images) into organized, machine-readable data formats like JSON. In LM-Kit.NET, the TextExtraction class combines language model intelligence with symbolic AI layers (including grammar constraints, fuzzy logic, taxonomy matching, and rule-based validation) through the proprietary Dynamic Sampling framework. This neuro-symbolic approach ensures outputs always conform to your schema while reducing hallucinations by up to 75% compared to pure LLM approaches. The result: reliable automation of data entry, document processing, and information retrieval tasks with deterministic precision.


📚 What is Structured Data Extraction?

Definition: Structured Data Extraction is the process of identifying and extracting specific pieces of information from unstructured or semi-structured content and organizing them into a predefined schema. Unlike free-form text generation, extraction produces deterministic, validated outputs that can be directly consumed by databases, APIs, or downstream applications.

The Extraction Pipeline

+-----------------+     +------------------+     +-----------------+
|  Unstructured   |     |   LM-Kit.NET     |     |   Structured    |
|    Content      | ----> |  TextExtraction  | ----> |   JSON Output   |
|                 |     |                  |     |                 |
| • Documents     |     | • Schema-aware   |     | • Type-safe     |
| • Images        |     | • Grammar-guided |     | • Validated     |
| • Text          |     | • Multi-modal    |     | • Ready to use  |
+-----------------+     +------------------+     +-----------------+

Key Differentiators from Text Generation

Aspect Text Generation Structured Extraction
Output Format Free-form text Schema-conformant JSON
Validation None built-in Type checking, required fields
Determinism Variable outputs Consistent structure
Use Case Creative writing, chat Data entry, automation

🔍 The Role of Structured Extraction in AI Applications

  1. Automating Data Entry

    • Extract invoice details (vendor, amounts, dates, line items)
    • Parse resumes for candidate information
    • Convert business cards to contact records
  2. Document Understanding

    • Extract key clauses from contracts
    • Parse scientific papers for metadata
    • Process forms and applications
  3. Information Retrieval

    • Extract product specifications from descriptions
    • Parse event details from announcements
    • Identify entities and relationships in reports
  4. Data Migration & Integration

    • Convert legacy documents to structured formats
    • Normalize data from heterogeneous sources
    • Feed extracted data to APIs and databases

⚙️ How LM-Kit.NET Implements Structured Extraction

LM-Kit.NET's TextExtraction class combines language model intelligence with symbolic AI layers through the proprietary Dynamic Sampling framework, a neuro-symbolic approach that ensures outputs always match your schema while dramatically reducing errors.

Neuro-Symbolic Architecture

+--------------------------------------------------------------------------+
|                  TextExtraction Engine (Dynamic Sampling)                |
+--------------------------------------------------------------------------+
|                                                                          |
|  +---------------------------------------------------------------------+ |
|  |                      NEURAL LAYER (LLM)                             | |
|  |  Content Understanding • Semantic Interpretation • Context Parsing  | |
|  +---------------------------------------------------------------------+ |
|                                    |                                     |
|                                    v                                     |
|  +---------------------------------------------------------------------+ |
|  |                    SYMBOLIC AI LAYER                                | |
|  |  +-------------+ +-------------+ +-------------+ +-------------+    | |
|  |  |   Grammar   | |   Fuzzy     | |  Taxonomy   | | Rule-Based  |    | |
|  |  | Constraints | |   Logic     | |  Matching   | | Validation  |    | |
|  |  |   (GBNF)    | |(Perplexity) | |(Ontologies) | |(Expert Sys) |    | |
|  |  +-------------+ +-------------+ +-------------+ +-------------+    | |
|  +---------------------------------------------------------------------+ |
|                                    |                                     |
|                                    v                                     |
|  +---------------------------------------------------------------------+ |
|  |              VALIDATED OUTPUT (Schema-Compliant JSON)               | |
|  +---------------------------------------------------------------------+ |
|                                                                          |
+--------------------------------------------------------------------------+

The Dynamic Sampling Advantage

LM-Kit's Dynamic Sampling integrates multiple symbolic AI techniques that activate dynamically based on content type and extraction context:

Symbolic Component Role in Extraction
Grammar Constraints (GBNF) Enforces valid JSON structure at generation time
Fuzzy Logic (Fuzzifiers) Assesses token confidence via contextual perplexity
Taxonomy Matching Validates values against known categorizations
Ontology Validation Ensures semantic consistency across fields
Rule-Based Expert Systems Applies domain-specific extraction rules
Auxiliary Content Lookup Extends context beyond the attention window

Performance Impact:

  • 75% fewer errors compared to pure LLM approaches
  • 2× faster processing through speculative grammar validation
  • 100% schema compliance via symbolic enforcement
  • Zero hallucinations in structured fields

Supported Element Types

LM-Kit.NET supports rich data types for extraction:

Type Description Example
String Text values Names, descriptions
Integer Whole numbers Quantities, IDs
Double Decimal numbers Prices, percentages
Bool True/false Flags, checkboxes
Date Calendar dates Due dates, birth dates
DateTime Date with time Timestamps
StringArray List of strings Tags, categories
IntegerArray List of integers Line item quantities
DoubleArray List of decimals Price lists
Object Nested structure Address components
ObjectArray List of objects Line items, entries

Format Constraints

Each element can have formatting rules:

  • Case normalization: Uppercase, Lowercase, TitleCase
  • Length limits: MaxLength, MinLength
  • Date formats: Custom date/time patterns
  • Allowed values: Enum-style constraints
  • Required fields: Mandatory vs optional

🛠️ Practical Implementation in LM-Kit.NET

Basic Extraction Example

using LMKit.Model;
using LMKit.Extraction;

// Load a capable model
var model = LM.LoadFromModelID("gemma3:12b");

// Create extraction instance
var extractor = new TextExtraction(model);

// Define what to extract
extractor.Elements.Add(new TextExtractionElement("company_name", ElementType.String)
{
    Description = "The name of the company or organization"
});
extractor.Elements.Add(new TextExtractionElement("invoice_number", ElementType.String)
{
    Description = "The unique invoice identifier"
});
extractor.Elements.Add(new TextExtractionElement("total_amount", ElementType.Double)
{
    Description = "The total amount due",
    IsRequired = true
});
extractor.Elements.Add(new TextExtractionElement("due_date", ElementType.Date)
{
    Description = "When payment is due"
});

// Set content to extract from
extractor.SetContent("Invoice #INV-2024-0042 from Acme Corp. Total: $1,234.56. Due: March 15, 2024.");

// Extract structured data
var result = extractor.Parse(CancellationToken.None);

Console.WriteLine(result.Json);
// Output:
// {
//   "company_name": "Acme Corp",
//   "invoice_number": "INV-2024-0042",
//   "total_amount": 1234.56,
//   "due_date": "2024-03-15"
// }

Nested Object Extraction

// Define line items with nested structure
var lineItem = new TextExtractionElement("line_items", ElementType.ObjectArray)
{
    Description = "Individual items on the invoice",
    InnerElements = new List<TextExtractionElement>
    {
        new("description", ElementType.String) { Description = "Item description" },
        new("quantity", ElementType.Integer) { Description = "Number of units" },
        new("unit_price", ElementType.Double) { Description = "Price per unit" },
        new("total", ElementType.Double) { Description = "Line total" }
    }
};

extractor.Elements.Add(lineItem);

Extraction from Documents

// Extract from PDF invoice
var attachment = new Attachment("invoice.pdf");
extractor.SetContent(attachment);

// Or from image (with optional OCR)
var imageAttachment = new Attachment("scanned_document.png");
extractor.SetContent(imageAttachment);
extractor.PreferredInferenceModality = InferenceModality.Vision;

Schema Discovery

Let LM-Kit.NET automatically suggest an extraction schema:

// Provide sample content
extractor.SetContent(sampleDocument);

// Discover optimal schema
var discoveredElements = await extractor.SchemaDiscoveryAsync(
    "Extract all relevant business information",
    CancellationToken.None
);

// Review and use discovered elements
foreach (var element in discoveredElements)
{
    Console.WriteLine($"Found: {element.Name} ({element.Type})");
    extractor.Elements.Add(element);
}

Using JSON Schema Definition

// Define schema using standard JSON Schema
string jsonSchema = """
{
    "type": "object",
    "properties": {
        "name": { "type": "string" },
        "age": { "type": "integer" },
        "email": { "type": "string", "format": "email" }
    },
    "required": ["name", "email"]
}
""";

extractor.SetElementsFromJsonSchema(jsonSchema);

🎯 Best Practices for Reliable Extraction

1. Write Clear Element Descriptions

The Description property significantly impacts extraction accuracy:

// ❌ Vague description
new TextExtractionElement("amount", ElementType.Double)
{
    Description = "The amount"
};

// ✅ Specific description
new TextExtractionElement("total_amount", ElementType.Double)
{
    Description = "The final total amount due including tax, in the document's currency"
};

2. Use Guidance for Context

extractor.Guidance = "This is a US tax form. Dates are in MM/DD/YYYY format. " +
                     "Dollar amounts may include commas as thousand separators.";

3. Handle Uncertainty Gracefully

// Return null for uncertain values instead of guessing
extractor.NullOnDoubt = true;

4. Choose the Right Modality

// Text-only content
extractor.PreferredInferenceModality = InferenceModality.Text;

// Scanned documents or images
extractor.PreferredInferenceModality = InferenceModality.Vision;

// Mixed content (let LM-Kit decide)
extractor.PreferredInferenceModality = InferenceModality.Multimodal;

📖 Key Terms

  • Schema: The structure defining what fields to extract and their types
  • Element: A single field to extract (name, type, description, constraints)
  • Dynamic Sampling: LM-Kit's neuro-symbolic inference framework combining LLMs with symbolic AI
  • Grammar-Constrained Generation: Technique ensuring LLM output conforms to a formal grammar (GBNF/JSON schema)
  • Neuro-Symbolic AI: Integration of neural networks (LLMs) with symbolic reasoning (rules, grammars, logic)
  • Speculative Grammar: Fast-path validation that accepts grammar-compliant tokens without full vocabulary analysis
  • Contextual Perplexity: Measure of model uncertainty used to trigger symbolic validation
  • Auxiliary Content: Extended context beyond the attention window for validation lookups
  • Modality: The type of content being processed (text, vision, multimodal)
  • Schema Discovery: Automatic detection of optimal extraction schema from sample content



🌐 External Resources


📝 Summary

Structured Data Extraction in LM-Kit.NET transforms unstructured content into validated, schema-conformant JSON through the TextExtraction class powered by Dynamic Sampling. This neuro-symbolic approach combines the semantic understanding of language models with symbolic AI layers (grammar constraints, fuzzy logic, taxonomy matching, ontology validation, and rule-based expert systems) to achieve 75% fewer errors, 2× faster processing, and 100% schema compliance. The result: reliable automation of document processing, data entry, and information retrieval tasks across text, images, PDFs, and Office documents, with zero hallucinations in structured fields, all running locally for maximum privacy and performance.