Understanding Extraction in LM-Kit.NET

TL;DR

Extraction is the process of pulling structured, meaningful information from unstructured sources such as free text, scanned documents, images, and PDFs. In LM-Kit.NET, extraction spans multiple specialized classes: TextExtraction for schema-based structured data extraction, NamedEntityRecognition for entity identification, PiiExtraction for sensitive data detection, and DocumentSplitting for detecting logical document boundaries. All extraction classes leverage LM-Kit's proprietary Dynamic Sampling framework, combining language models with symbolic AI to produce outputs that are schema-compliant, confidence-scored, and hallucination-resistant.

What is Extraction?

Definition: Extraction in AI refers to the automated identification and retrieval of specific pieces of information from content that lacks inherent structure. Rather than generating new text, extraction locates, classifies, and normalizes existing information into a format suitable for downstream processing.

The Extraction Spectrum

+--------------------------------------------------------------------------+
|                    Extraction Task Spectrum                              |
+--------------------------------------------------------------------------+
|                                                                          |
|  Low Structure                                    High Structure         |
|  <----------------------------------------------------------->           |
|                                                                          |
|  +-------------+  +-------------+  +-------------+  +-------------+      |
|  |  Keyword    |  |   Entity    |  |  Field      |  |  Table      |      |
|  |  Extraction |  | Recognition |  |  Extraction |  |  Extraction |      |
|  |             |  |             |  |             |  |             |      |
|  | "important" |  | "John Smith"|  | invoice_num:|  | Row/Column  |      |
|  | "urgent"    |  | "Acme Corp" |  |   "INV-001" |  | alignment   |      |
|  +-------------+  +-------------+  +-------------+  +-------------+      |
|                                                                          |
|  Simpler                                           More Complex          |
|                                                                          |
+--------------------------------------------------------------------------+

Extraction vs Generation

Aspect	Text Generation	Extraction
Goal	Create new content	Retrieve existing information
Output	Open-ended text	Structured, bounded data
Validation	Subjective quality	Objective correctness
Hallucination risk	Inherent	Mitigated through symbolic constraints

Extraction Capabilities in LM-Kit.NET

LM-Kit.NET provides a comprehensive extraction toolkit across two namespaces:

Architecture

+--------------------------------------------------------------------------+
|                  LM-Kit.NET Extraction Architecture                      |
+--------------------------------------------------------------------------+
|                                                                          |
|  +-------------------------------------------------------------------+   |
|  |                       Input Layer                                 |   |
|  |  Text • PDF • Image • Office (Word, Excel, PPT) • HTML            |   |
|  +-------------------------------------------------------------------+   |
|                                  |                                       |
|                                  v                                       |
|  +-------------------------------------------------------------------+   |
|  |                    Extraction Classes                             |   |
|  |                                                                   |   |
|  |  LMKit.Extraction              LMKit.TextAnalysis                 |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |  | TextExtraction      |       | NamedEntityRecognition  |        |   |
|  |  | • Schema-based      |       | • Person, Org, Location |        |   |
|  |  | • JSON output       |       | • Date, Money, Product  |        |   |
|  |  | • Nested objects    |       | • Custom entity types   |        |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |  | DocumentSplitting   |       | PiiExtraction           |        |   |
|  |  | • Page boundaries   |       | • SSN, Credit Card      |        |   |
|  |  | • Vision-based      |       | • Email, Phone, IP      |        |   |
|  |  | • Labels per segment|       | • Custom PII types      |        |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |                                                                   |   |
|  +-------------------------------------------------------------------+   |
|                                  |                                       |
|                                  v                                       |
|  +-------------------------------------------------------------------+   |
|  |                    Dynamic Sampling Layer                          |  |
|  |  Grammar Constraints • Perplexity Assessment • Fuzzy Validation   |   |
|  +-------------------------------------------------------------------+   |
|                                  |                                       |
|                                  v                                       |
|  +-------------------------------------------------------------------+   |
|  |                      Output Layer                                  |  |
|  |  JSON • Entity Lists • Confidence Scores • Validation Flags       |   |
|  +-------------------------------------------------------------------+   |
|                                                                          |
+--------------------------------------------------------------------------+

Structured Data Extraction

The TextExtraction class is the primary tool for pulling typed fields from content into a predefined JSON schema.

Basic Usage

using LMKit.Extraction;
using LMKit.Model;

var model = LM.LoadFromModelID("gemma3:12b");
var extractor = new TextExtraction(model);

// Define what to extract
extractor.Elements.Add(new TextExtractionElement("company", ElementType.String)
{
    Description = "Name of the company"
});
extractor.Elements.Add(new TextExtractionElement("revenue", ElementType.Double)
{
    Description = "Annual revenue in USD"
});
extractor.Elements.Add(new TextExtractionElement("founded", ElementType.Date)
{
    Description = "Date the company was founded"
});

// Provide content
extractor.SetContent("Acme Corp was established on March 12, 2005. Last year the company reported $4.2M in revenue.");

// Extract
var result = extractor.Parse(CancellationToken.None);

Console.WriteLine(result.Json);
// {"company": "Acme Corp", "revenue": 4200000.0, "founded": "2005-03-12"}
Console.WriteLine($"Confidence: {result.Confidence:P1}");

Nested Object Extraction

extractor.Elements.Add(new TextExtractionElement("line_items", ElementType.ObjectArray)
{
    Description = "Individual items on the invoice",
    InnerElements = new List<TextExtractionElement>
    {
        new("description", ElementType.String),
        new("quantity", ElementType.Integer),
        new("unit_price", ElementType.Double),
        new("amount", ElementType.Double)
    }
});

Extracting from Documents and Images

using LMKit.Data;

// From PDF
extractor.SetContent(new Attachment("invoice.pdf"));
var result = extractor.Parse(CancellationToken.None);

// From image with vision mode
extractor.SetContent(new Attachment("receipt_photo.jpg"));
extractor.PreferredInferenceModality = InferenceModality.Vision;
var result = extractor.Parse(CancellationToken.None);

// From specific pages
extractor.SetContent(new Attachment("long_report.pdf"), "1-3");

Schema Discovery

When you do not know the structure of a document in advance, LM-Kit.NET can infer the schema automatically:

extractor.SetContent(new Attachment("unknown_document.pdf"));

// Let the model discover what fields exist
var discoveredResult = extractor.SchemaDiscovery(CancellationToken.None);
Console.WriteLine(discoveredResult.Json);

Accessing Results

// By name
string company = result.GetValue<string>("company");
double revenue = result.GetValue<double>("revenue");

// By path for nested objects
double firstItemPrice = result.GetValue<double>("line_items[0].unit_price");

// Enumerate arrays
foreach (var item in result.EnumerateAt("line_items"))
{
    Console.WriteLine($"{item["description"].Value}: {item["amount"].Value}");
}

// Check confidence per field
float companyConfidence = result.GetConfidence("company");
bool needsReview = result.HumanVerificationRequired;

Named Entity Recognition

The NamedEntityRecognition class identifies and classifies entities within text or documents.

using LMKit.TextAnalysis;
using LMKit.Model;

var model = LM.LoadFromModelID("gemma3:4b");
var ner = new NamedEntityRecognition(model);

var entities = ner.Recognize(
    "John Smith signed the agreement with Acme Corp on January 15, 2024 for $50,000.",
    CancellationToken.None
);

foreach (var entity in entities)
{
    Console.WriteLine($"{entity.EntityDefinition.Type}: {entity.Value} ({entity.Confidence:P1})");
}
// Person: John Smith (97.2%)
// Organization: Acme Corp (98.5%)
// Date: January 15, 2024 (99.1%)
// Money: $50,000 (96.8%)

Built-In Entity Types

Entity Type	Examples
Person	John Smith, Dr. Martinez
Organization	Acme Corp, United Nations
Location	New York, 123 Main Street
Date	January 15, 2024; next Monday
Money	$50,000; 1,200 EUR
Percent	15%, 0.5 percent
Product	iPhone 15, Model X
Event	CES 2024, World Cup

Custom Entity Types

var customDefinitions = new List<EntityDefinition>
{
    new EntityDefinition(NamedEntityType.Custom, "Medical Condition",
        "A disease, disorder, or health condition"),
    new EntityDefinition(NamedEntityType.Custom, "Medication",
        "A drug or pharmaceutical product"),
    new EntityDefinition(NamedEntityType.Person)
};

var ner = new NamedEntityRecognition(model, customDefinitions);

PII Extraction

The PiiExtraction class detects personally identifiable information for compliance and data protection.

using LMKit.TextAnalysis;
using LMKit.Model;

var model = LM.LoadFromModelID("qwen3.5:4b");
var pii = new PiiExtraction(model);

var entities = pii.Extract(
    "Contact Jane Doe at jane.doe@example.com or (555) 123-4567. Her SSN is 123-45-6789.",
    CancellationToken.None
);

foreach (var entity in entities)
{
    Console.WriteLine($"{entity.EntityDefinition.Type}: {entity.Value}");
}
// Person: Jane Doe
// EmailAddress: jane.doe@example.com
// PhoneNumber: (555) 123-4567
// SSN: 123-45-6789

Built-In PII Types

PII Type	Description
Person	Full names
Organization	Company and institution names
Location	Addresses, cities, countries
EmailAddress	Email addresses
PhoneNumber	Phone and fax numbers
CreditCard	Credit/debit card numbers
SSN	Social Security Numbers
Date	Dates of birth, other dates
IPAddress	IPv4 and IPv6 addresses
URL	Web addresses

Document Splitting

The DocumentSplitting class uses vision models to detect logical document boundaries within multi-page files.

using LMKit.Extraction;
using LMKit.Model;

var model = LM.LoadFromModelID("qwen2-vl:7b");

var splitter = new DocumentSplitting(model)
{
    Guidance = "The file contains a mix of invoices and contracts."
};

// Detect boundaries
var result = splitter.Split(new Attachment("scanned_batch.pdf"), CancellationToken.None);

Console.WriteLine($"Found {result.DocumentCount} document(s)");
foreach (var segment in result.Segments)
{
    Console.WriteLine($"  Pages {segment.StartPage}-{segment.EndPage}: {segment.Label}");
}

// Optionally split into separate PDF files
var result = splitter.Split(
    new Attachment("scanned_batch.pdf"),
    splitDocument: true,
    outputDirectory: "./output",
    CancellationToken.None
);

foreach (string path in result.Documents)
{
    Console.WriteLine($"Created: {path}");
}

The Dynamic Sampling Advantage

All extraction classes in LM-Kit.NET benefit from the Dynamic Sampling framework, which combines neural language model output with symbolic validation:

+--------------------------------------------------------------------------+
|                  Dynamic Sampling in Extraction                           |
+--------------------------------------------------------------------------+
|                                                                          |
|  LLM generates candidate token                                          |
|           |                                                              |
|           v                                                              |
|  +-------------------+                                                   |
|  | Grammar Check     |  Does the token satisfy the JSON schema?          |
|  +-------------------+                                                   |
|           |                                                              |
|           v                                                              |
|  +-------------------+                                                   |
|  | Perplexity Check  |  Is the model confident in this token?            |
|  +-------------------+                                                   |
|           |                                                              |
|           v                                                              |
|  +-------------------+                                                   |
|  | Auxiliary Lookup   |  Does external knowledge confirm the value?       |
|  +-------------------+                                                   |
|           |                                                              |
|           v                                                              |
|  Accept or explore alternatives                                          |
|                                                                          |
+--------------------------------------------------------------------------+

This neuro-symbolic approach delivers:

100% schema compliance through grammar enforcement
75% fewer errors compared to pure LLM extraction
Per-field confidence scores for human-in-the-loop workflows
Automatic type coercion (e.g., "March 15, 2024" to 2024-03-15)

Extraction Use Cases

1. Invoice Processing

Extract vendor information, line items, totals, and payment terms from invoices in any format. Combine with Classification to route invoices by type.

2. Compliance and Redaction

Detect PII across documents and automatically redact sensitive information to meet GDPR, HIPAA, or other regulatory requirements.

3. Contract Analysis

Identify parties, dates, obligations, and key clauses from legal agreements using both NER and structured extraction.

4. Resume Parsing

Extract candidate details (name, contact, experience, skills, education) from resumes in PDF, Word, or image format.

5. Mailroom Automation

Combine document splitting with classification and extraction to process batches of scanned mail: detect boundaries, classify each document, then extract relevant fields.

6. Medical Records

Parse patient records, lab results, and clinical notes using custom entity definitions while keeping all processing on-device for HIPAA compliance.

Key Terms

Extraction: Retrieving specific information from unstructured content
Schema: A predefined structure defining what fields to extract and their data types
Named Entity: A real-world object (person, place, organization, date) identified in text
PII (Personally Identifiable Information): Data that can identify an individual
Document Splitting: Detecting logical boundaries between documents within a multi-page file
Confidence Score: A value between 0 and 1 indicating how certain the model is about an extracted value
Human-in-the-Loop (HITL): Routing low-confidence extractions for manual review
Schema Discovery: Automatically inferring the extraction schema from document content
Element Type: The data type of an extraction field (String, Integer, Double, Date, Object, ObjectArray)

TextExtraction: Schema-based structured data extraction
TextExtractionElement: Extraction field definition
TextExtractionResult: Extraction output with typed access
DocumentSplitting: Vision-based document boundary detection
DocumentSplittingResult: Splitting result with segments
NamedEntityRecognition: Entity identification and classification
PiiExtraction: Sensitive data detection
Attachment: Universal document input

Structured Data Extraction: Deep dive into the TextExtraction class and Dynamic Sampling
Named Entity Recognition (NER): Detailed guide to entity identification
Classification: Assigning labels to content before or after extraction
Intelligent Document Processing (IDP): End-to-end document automation pipeline
Dynamic Sampling: The neuro-symbolic framework powering reliable extraction
Grammar Sampling: Grammar constraints ensuring schema-compliant output
Symbolic AI: Rule-based validation in the extraction pipeline
Vision Language Models (VLM): Multimodal models for image-based extraction
RAG (Retrieval-Augmented Generation): Combining extracted data with retrieval pipelines
Embeddings: Vector representations for semantic matching during extraction
LLM: Language models powering extraction intelligence
Inference: Model execution process for extraction tasks
Prompt Engineering: Crafting guidance to improve extraction accuracy

External Resources

LM-Kit Invoice Extraction Demo: Real-world invoice extraction example
LM-Kit Structured Data Extraction Demo: Schema-based extraction example
LM-Kit NER Demo: Named entity recognition example
LM-Kit PII Extraction Demo: PII detection example
LM-Kit Document Splitting Demo: Vision-based document splitting example

Summary

Extraction encompasses the full range of techniques for pulling structured information from unstructured content. In LM-Kit.NET, extraction capabilities span four specialized classes: TextExtraction for schema-based JSON extraction with nested objects, arrays, and typed fields; NamedEntityRecognition for identifying people, organizations, locations, dates, and custom entity types; PiiExtraction for detecting sensitive personal information; and DocumentSplitting for vision-based detection of logical document boundaries within multi-page files. All extraction classes accept multimodal inputs (text, PDFs, images, Office documents), produce results with per-field confidence scores, and leverage the Dynamic Sampling framework to combine LLM intelligence with symbolic validation for schema-compliant, hallucination-resistant outputs. Whether automating invoice processing, enforcing compliance through PII detection, or parsing contracts, LM-Kit.NET's extraction toolkit provides the precision and reliability needed for production workflows.

Table of Contents