Understanding Extraction in LM-Kit.NET
TL;DR
Extraction is the process of pulling structured, meaningful information from unstructured sources such as free text, scanned documents, images, and PDFs. In LM-Kit.NET, extraction spans multiple specialized classes: TextExtraction for schema-based structured data extraction, NamedEntityRecognition for entity identification, PiiExtraction for sensitive data detection, and DocumentSplitting for detecting logical document boundaries. All extraction classes leverage LM-Kit's proprietary Dynamic Sampling framework, combining language models with symbolic AI to produce outputs that are schema-compliant, confidence-scored, and hallucination-resistant.
What is Extraction?
Definition: Extraction in AI refers to the automated identification and retrieval of specific pieces of information from content that lacks inherent structure. Rather than generating new text, extraction locates, classifies, and normalizes existing information into a format suitable for downstream processing.
The Extraction Spectrum
+--------------------------------------------------------------------------+
| Extraction Task Spectrum |
+--------------------------------------------------------------------------+
| |
| Low Structure High Structure |
| <-----------------------------------------------------------> |
| |
| +-------------+ +-------------+ +-------------+ +-------------+ |
| | Keyword | | Entity | | Field | | Table | |
| | Extraction | | Recognition | | Extraction | | Extraction | |
| | | | | | | | | |
| | "important" | | "John Smith"| | invoice_num:| | Row/Column | |
| | "urgent" | | "Acme Corp" | | "INV-001" | | alignment | |
| +-------------+ +-------------+ +-------------+ +-------------+ |
| |
| Simpler More Complex |
| |
+--------------------------------------------------------------------------+
Extraction vs Generation
| Aspect | Text Generation | Extraction |
|---|---|---|
| Goal | Create new content | Retrieve existing information |
| Output | Open-ended text | Structured, bounded data |
| Validation | Subjective quality | Objective correctness |
| Hallucination risk | Inherent | Mitigated through symbolic constraints |
Extraction Capabilities in LM-Kit.NET
LM-Kit.NET provides a comprehensive extraction toolkit across two namespaces:
Architecture
+--------------------------------------------------------------------------+
| LM-Kit.NET Extraction Architecture |
+--------------------------------------------------------------------------+
| |
| +-------------------------------------------------------------------+ |
| | Input Layer | |
| | Text • PDF • Image • Office (Word, Excel, PPT) • HTML | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | Extraction Classes | |
| | | |
| | LMKit.Extraction LMKit.TextAnalysis | |
| | +---------------------+ +-------------------------+ | |
| | | TextExtraction | | NamedEntityRecognition | | |
| | | • Schema-based | | • Person, Org, Location | | |
| | | • JSON output | | • Date, Money, Product | | |
| | | • Nested objects | | • Custom entity types | | |
| | +---------------------+ +-------------------------+ | |
| | +---------------------+ +-------------------------+ | |
| | | DocumentSplitting | | PiiExtraction | | |
| | | • Page boundaries | | • SSN, Credit Card | | |
| | | • Vision-based | | • Email, Phone, IP | | |
| | | • Labels per segment| | • Custom PII types | | |
| | +---------------------+ +-------------------------+ | |
| | | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | Dynamic Sampling Layer | |
| | Grammar Constraints • Perplexity Assessment • Fuzzy Validation | |
| +-------------------------------------------------------------------+ |
| | |
| v |
| +-------------------------------------------------------------------+ |
| | Output Layer | |
| | JSON • Entity Lists • Confidence Scores • Validation Flags | |
| +-------------------------------------------------------------------+ |
| |
+--------------------------------------------------------------------------+
Structured Data Extraction
The TextExtraction class is the primary tool for pulling typed fields from content into a predefined JSON schema.
Basic Usage
using LMKit.Extraction;
using LMKit.Model;
var model = LM.LoadFromModelID("gemma3:12b");
var extractor = new TextExtraction(model);
// Define what to extract
extractor.Elements.Add(new TextExtractionElement("company", ElementType.String)
{
Description = "Name of the company"
});
extractor.Elements.Add(new TextExtractionElement("revenue", ElementType.Double)
{
Description = "Annual revenue in USD"
});
extractor.Elements.Add(new TextExtractionElement("founded", ElementType.Date)
{
Description = "Date the company was founded"
});
// Provide content
extractor.SetContent("Acme Corp was established on March 12, 2005. Last year the company reported $4.2M in revenue.");
// Extract
var result = extractor.Parse(CancellationToken.None);
Console.WriteLine(result.Json);
// {"company": "Acme Corp", "revenue": 4200000.0, "founded": "2005-03-12"}
Console.WriteLine($"Confidence: {result.Confidence:P1}");
Nested Object Extraction
extractor.Elements.Add(new TextExtractionElement("line_items", ElementType.ObjectArray)
{
Description = "Individual items on the invoice",
InnerElements = new List<TextExtractionElement>
{
new("description", ElementType.String),
new("quantity", ElementType.Integer),
new("unit_price", ElementType.Double),
new("amount", ElementType.Double)
}
});
Extracting from Documents and Images
using LMKit.Data;
// From PDF
extractor.SetContent(new Attachment("invoice.pdf"));
var result = extractor.Parse(CancellationToken.None);
// From image with vision mode
extractor.SetContent(new Attachment("receipt_photo.jpg"));
extractor.PreferredInferenceModality = InferenceModality.Vision;
var result = extractor.Parse(CancellationToken.None);
// From specific pages
extractor.SetContent(new Attachment("long_report.pdf"), "1-3");
Schema Discovery
When you do not know the structure of a document in advance, LM-Kit.NET can infer the schema automatically:
extractor.SetContent(new Attachment("unknown_document.pdf"));
// Let the model discover what fields exist
var discoveredResult = extractor.SchemaDiscovery(CancellationToken.None);
Console.WriteLine(discoveredResult.Json);
Accessing Results
// By name
string company = result.GetValue<string>("company");
double revenue = result.GetValue<double>("revenue");
// By path for nested objects
double firstItemPrice = result.GetValue<double>("line_items[0].unit_price");
// Enumerate arrays
foreach (var item in result.EnumerateAt("line_items"))
{
Console.WriteLine($"{item["description"].Value}: {item["amount"].Value}");
}
// Check confidence per field
float companyConfidence = result.GetConfidence("company");
bool needsReview = result.HumanVerificationRequired;
Named Entity Recognition
The NamedEntityRecognition class identifies and classifies entities within text or documents.
using LMKit.TextAnalysis;
using LMKit.Model;
var model = LM.LoadFromModelID("gemma3:4b");
var ner = new NamedEntityRecognition(model);
var entities = ner.Recognize(
"John Smith signed the agreement with Acme Corp on January 15, 2024 for $50,000.",
CancellationToken.None
);
foreach (var entity in entities)
{
Console.WriteLine($"{entity.EntityDefinition.Type}: {entity.Value} ({entity.Confidence:P1})");
}
// Person: John Smith (97.2%)
// Organization: Acme Corp (98.5%)
// Date: January 15, 2024 (99.1%)
// Money: $50,000 (96.8%)
Built-In Entity Types
| Entity Type | Examples |
|---|---|
| Person | John Smith, Dr. Martinez |
| Organization | Acme Corp, United Nations |
| Location | New York, 123 Main Street |
| Date | January 15, 2024; next Monday |
| Money | $50,000; 1,200 EUR |
| Percent | 15%, 0.5 percent |
| Product | iPhone 15, Model X |
| Event | CES 2024, World Cup |
Custom Entity Types
var customDefinitions = new List<EntityDefinition>
{
new EntityDefinition(NamedEntityType.Custom, "Medical Condition",
"A disease, disorder, or health condition"),
new EntityDefinition(NamedEntityType.Custom, "Medication",
"A drug or pharmaceutical product"),
new EntityDefinition(NamedEntityType.Person)
};
var ner = new NamedEntityRecognition(model, customDefinitions);
PII Extraction
The PiiExtraction class detects personally identifiable information for compliance and data protection.
using LMKit.TextAnalysis;
using LMKit.Model;
var model = LM.LoadFromModelID("qwen3:4b");
var pii = new PiiExtraction(model);
var entities = pii.Extract(
"Contact Jane Doe at jane.doe@example.com or (555) 123-4567. Her SSN is 123-45-6789.",
CancellationToken.None
);
foreach (var entity in entities)
{
Console.WriteLine($"{entity.EntityDefinition.Type}: {entity.Value}");
}
// Person: Jane Doe
// EmailAddress: jane.doe@example.com
// PhoneNumber: (555) 123-4567
// SSN: 123-45-6789
Built-In PII Types
| PII Type | Description |
|---|---|
| Person | Full names |
| Organization | Company and institution names |
| Location | Addresses, cities, countries |
| EmailAddress | Email addresses |
| PhoneNumber | Phone and fax numbers |
| CreditCard | Credit/debit card numbers |
| SSN | Social Security Numbers |
| Date | Dates of birth, other dates |
| IPAddress | IPv4 and IPv6 addresses |
| URL | Web addresses |
Document Splitting
The DocumentSplitting class uses vision models to detect logical document boundaries within multi-page files.
using LMKit.Extraction;
using LMKit.Model;
var model = LM.LoadFromModelID("qwen2-vl:7b");
var splitter = new DocumentSplitting(model)
{
Guidance = "The file contains a mix of invoices and contracts."
};
// Detect boundaries
var result = splitter.Split(new Attachment("scanned_batch.pdf"), CancellationToken.None);
Console.WriteLine($"Found {result.DocumentCount} document(s)");
foreach (var segment in result.Segments)
{
Console.WriteLine($" Pages {segment.StartPage}-{segment.EndPage}: {segment.Label}");
}
// Optionally split into separate PDF files
var result = splitter.Split(
new Attachment("scanned_batch.pdf"),
splitDocument: true,
outputDirectory: "./output",
CancellationToken.None
);
foreach (string path in result.Documents)
{
Console.WriteLine($"Created: {path}");
}
The Dynamic Sampling Advantage
All extraction classes in LM-Kit.NET benefit from the Dynamic Sampling framework, which combines neural language model output with symbolic validation:
+--------------------------------------------------------------------------+
| Dynamic Sampling in Extraction |
+--------------------------------------------------------------------------+
| |
| LLM generates candidate token |
| | |
| v |
| +-------------------+ |
| | Grammar Check | Does the token satisfy the JSON schema? |
| +-------------------+ |
| | |
| v |
| +-------------------+ |
| | Perplexity Check | Is the model confident in this token? |
| +-------------------+ |
| | |
| v |
| +-------------------+ |
| | Auxiliary Lookup | Does external knowledge confirm the value? |
| +-------------------+ |
| | |
| v |
| Accept or explore alternatives |
| |
+--------------------------------------------------------------------------+
This neuro-symbolic approach delivers:
- 100% schema compliance through grammar enforcement
- 75% fewer errors compared to pure LLM extraction
- Per-field confidence scores for human-in-the-loop workflows
- Automatic type coercion (e.g., "March 15, 2024" to
2024-03-15)
Extraction Use Cases
1. Invoice Processing
Extract vendor information, line items, totals, and payment terms from invoices in any format. Combine with Classification to route invoices by type.
2. Compliance and Redaction
Detect PII across documents and automatically redact sensitive information to meet GDPR, HIPAA, or other regulatory requirements.
3. Contract Analysis
Identify parties, dates, obligations, and key clauses from legal agreements using both NER and structured extraction.
4. Resume Parsing
Extract candidate details (name, contact, experience, skills, education) from resumes in PDF, Word, or image format.
5. Mailroom Automation
Combine document splitting with classification and extraction to process batches of scanned mail: detect boundaries, classify each document, then extract relevant fields.
6. Medical Records
Parse patient records, lab results, and clinical notes using custom entity definitions while keeping all processing on-device for HIPAA compliance.
Key Terms
- Extraction: Retrieving specific information from unstructured content
- Schema: A predefined structure defining what fields to extract and their data types
- Named Entity: A real-world object (person, place, organization, date) identified in text
- PII (Personally Identifiable Information): Data that can identify an individual
- Document Splitting: Detecting logical boundaries between documents within a multi-page file
- Confidence Score: A value between 0 and 1 indicating how certain the model is about an extracted value
- Human-in-the-Loop (HITL): Routing low-confidence extractions for manual review
- Schema Discovery: Automatically inferring the extraction schema from document content
- Element Type: The data type of an extraction field (String, Integer, Double, Date, Object, ObjectArray)
Related API Documentation
TextExtraction: Schema-based structured data extractionTextExtractionElement: Extraction field definitionTextExtractionResult: Extraction output with typed accessDocumentSplitting: Vision-based document boundary detectionDocumentSplittingResult: Splitting result with segmentsNamedEntityRecognition: Entity identification and classificationPiiExtraction: Sensitive data detectionAttachment: Universal document input
Related Glossary Topics
- Structured Data Extraction: Deep dive into the TextExtraction class and Dynamic Sampling
- Named Entity Recognition (NER): Detailed guide to entity identification
- Classification: Assigning labels to content before or after extraction
- Intelligent Document Processing (IDP): End-to-end document automation pipeline
- Dynamic Sampling: The neuro-symbolic framework powering reliable extraction
- Grammar Sampling: Grammar constraints ensuring schema-compliant output
- Symbolic AI: Rule-based validation in the extraction pipeline
- Vision Language Models (VLM): Multimodal models for image-based extraction
- RAG (Retrieval-Augmented Generation): Combining extracted data with retrieval pipelines
- Embeddings: Vector representations for semantic matching during extraction
- LLM: Language models powering extraction intelligence
- Inference: Model execution process for extraction tasks
- Prompt Engineering: Crafting guidance to improve extraction accuracy
External Resources
- LM-Kit Invoice Extraction Demo: Real-world invoice extraction example
- LM-Kit Structured Data Extraction Demo: Schema-based extraction example
- LM-Kit NER Demo: Named entity recognition example
- LM-Kit PII Extraction Demo: PII detection example
- LM-Kit Document Splitting Demo: Vision-based document splitting example
Summary
Extraction encompasses the full range of techniques for pulling structured information from unstructured content. In LM-Kit.NET, extraction capabilities span four specialized classes: TextExtraction for schema-based JSON extraction with nested objects, arrays, and typed fields; NamedEntityRecognition for identifying people, organizations, locations, dates, and custom entity types; PiiExtraction for detecting sensitive personal information; and DocumentSplitting for vision-based detection of logical document boundaries within multi-page files. All extraction classes accept multimodal inputs (text, PDFs, images, Office documents), produce results with per-field confidence scores, and leverage the Dynamic Sampling framework to combine LLM intelligence with symbolic validation for schema-compliant, hallucination-resistant outputs. Whether automating invoice processing, enforcing compliance through PII detection, or parsing contracts, LM-Kit.NET's extraction toolkit provides the precision and reliability needed for production workflows.