Table of Contents

Understanding Extraction in LM-Kit.NET


TL;DR

Extraction is the process of pulling structured, meaningful information from unstructured sources such as free text, scanned documents, images, and PDFs. In LM-Kit.NET, extraction spans multiple specialized classes: TextExtraction for schema-based structured data extraction, NamedEntityRecognition for entity identification, PiiExtraction for sensitive data detection, and DocumentSplitting for detecting logical document boundaries. All extraction classes leverage LM-Kit's proprietary Dynamic Sampling framework, combining language models with symbolic AI to produce outputs that are schema-compliant, confidence-scored, and hallucination-resistant.


What is Extraction?

Definition: Extraction in AI refers to the automated identification and retrieval of specific pieces of information from content that lacks inherent structure. Rather than generating new text, extraction locates, classifies, and normalizes existing information into a format suitable for downstream processing.

The Extraction Spectrum

+--------------------------------------------------------------------------+
|                    Extraction Task Spectrum                              |
+--------------------------------------------------------------------------+
|                                                                          |
|  Low Structure                                    High Structure         |
|  <----------------------------------------------------------->           |
|                                                                          |
|  +-------------+  +-------------+  +-------------+  +-------------+      |
|  |  Keyword    |  |   Entity    |  |  Field      |  |  Table      |      |
|  |  Extraction |  | Recognition |  |  Extraction |  |  Extraction |      |
|  |             |  |             |  |             |  |             |      |
|  | "important" |  | "John Smith"|  | invoice_num:|  | Row/Column  |      |
|  | "urgent"    |  | "Acme Corp" |  |   "INV-001" |  | alignment   |      |
|  +-------------+  +-------------+  +-------------+  +-------------+      |
|                                                                          |
|  Simpler                                           More Complex          |
|                                                                          |
+--------------------------------------------------------------------------+

Extraction vs Generation

Aspect Text Generation Extraction
Goal Create new content Retrieve existing information
Output Open-ended text Structured, bounded data
Validation Subjective quality Objective correctness
Hallucination risk Inherent Mitigated through symbolic constraints

Extraction Capabilities in LM-Kit.NET

LM-Kit.NET provides a comprehensive extraction toolkit across two namespaces:

Architecture

+--------------------------------------------------------------------------+
|                  LM-Kit.NET Extraction Architecture                      |
+--------------------------------------------------------------------------+
|                                                                          |
|  +-------------------------------------------------------------------+   |
|  |                       Input Layer                                 |   |
|  |  Text • PDF • Image • Office (Word, Excel, PPT) • HTML            |   |
|  +-------------------------------------------------------------------+   |
|                                  |                                       |
|                                  v                                       |
|  +-------------------------------------------------------------------+   |
|  |                    Extraction Classes                             |   |
|  |                                                                   |   |
|  |  LMKit.Extraction              LMKit.TextAnalysis                 |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |  | TextExtraction      |       | NamedEntityRecognition  |        |   |
|  |  | • Schema-based      |       | • Person, Org, Location |        |   |
|  |  | • JSON output       |       | • Date, Money, Product  |        |   |
|  |  | • Nested objects    |       | • Custom entity types   |        |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |  | DocumentSplitting   |       | PiiExtraction           |        |   |
|  |  | • Page boundaries   |       | • SSN, Credit Card      |        |   |
|  |  | • Vision-based      |       | • Email, Phone, IP      |        |   |
|  |  | • Labels per segment|       | • Custom PII types      |        |   |
|  |  +---------------------+       +-------------------------+        |   |
|  |                                                                   |   |
|  +-------------------------------------------------------------------+   |
|                                  |                                       |
|                                  v                                       |
|  +-------------------------------------------------------------------+   |
|  |                    Dynamic Sampling Layer                          |  |
|  |  Grammar Constraints • Perplexity Assessment • Fuzzy Validation   |   |
|  +-------------------------------------------------------------------+   |
|                                  |                                       |
|                                  v                                       |
|  +-------------------------------------------------------------------+   |
|  |                      Output Layer                                  |  |
|  |  JSON • Entity Lists • Confidence Scores • Validation Flags       |   |
|  +-------------------------------------------------------------------+   |
|                                                                          |
+--------------------------------------------------------------------------+

Structured Data Extraction

The TextExtraction class is the primary tool for pulling typed fields from content into a predefined JSON schema.

Basic Usage

using LMKit.Extraction;
using LMKit.Model;

var model = LM.LoadFromModelID("gemma3:12b");
var extractor = new TextExtraction(model);

// Define what to extract
extractor.Elements.Add(new TextExtractionElement("company", ElementType.String)
{
    Description = "Name of the company"
});
extractor.Elements.Add(new TextExtractionElement("revenue", ElementType.Double)
{
    Description = "Annual revenue in USD"
});
extractor.Elements.Add(new TextExtractionElement("founded", ElementType.Date)
{
    Description = "Date the company was founded"
});

// Provide content
extractor.SetContent("Acme Corp was established on March 12, 2005. Last year the company reported $4.2M in revenue.");

// Extract
var result = extractor.Parse(CancellationToken.None);

Console.WriteLine(result.Json);
// {"company": "Acme Corp", "revenue": 4200000.0, "founded": "2005-03-12"}
Console.WriteLine($"Confidence: {result.Confidence:P1}");

Nested Object Extraction

extractor.Elements.Add(new TextExtractionElement("line_items", ElementType.ObjectArray)
{
    Description = "Individual items on the invoice",
    InnerElements = new List<TextExtractionElement>
    {
        new("description", ElementType.String),
        new("quantity", ElementType.Integer),
        new("unit_price", ElementType.Double),
        new("amount", ElementType.Double)
    }
});

Extracting from Documents and Images

using LMKit.Data;

// From PDF
extractor.SetContent(new Attachment("invoice.pdf"));
var result = extractor.Parse(CancellationToken.None);

// From image with vision mode
extractor.SetContent(new Attachment("receipt_photo.jpg"));
extractor.PreferredInferenceModality = InferenceModality.Vision;
var result = extractor.Parse(CancellationToken.None);

// From specific pages
extractor.SetContent(new Attachment("long_report.pdf"), "1-3");

Schema Discovery

When you do not know the structure of a document in advance, LM-Kit.NET can infer the schema automatically:

extractor.SetContent(new Attachment("unknown_document.pdf"));

// Let the model discover what fields exist
var discoveredResult = extractor.SchemaDiscovery(CancellationToken.None);
Console.WriteLine(discoveredResult.Json);

Accessing Results

// By name
string company = result.GetValue<string>("company");
double revenue = result.GetValue<double>("revenue");

// By path for nested objects
double firstItemPrice = result.GetValue<double>("line_items[0].unit_price");

// Enumerate arrays
foreach (var item in result.EnumerateAt("line_items"))
{
    Console.WriteLine($"{item["description"].Value}: {item["amount"].Value}");
}

// Check confidence per field
float companyConfidence = result.GetConfidence("company");
bool needsReview = result.HumanVerificationRequired;

Named Entity Recognition

The NamedEntityRecognition class identifies and classifies entities within text or documents.

using LMKit.TextAnalysis;
using LMKit.Model;

var model = LM.LoadFromModelID("gemma3:4b");
var ner = new NamedEntityRecognition(model);

var entities = ner.Recognize(
    "John Smith signed the agreement with Acme Corp on January 15, 2024 for $50,000.",
    CancellationToken.None
);

foreach (var entity in entities)
{
    Console.WriteLine($"{entity.EntityDefinition.Type}: {entity.Value} ({entity.Confidence:P1})");
}
// Person: John Smith (97.2%)
// Organization: Acme Corp (98.5%)
// Date: January 15, 2024 (99.1%)
// Money: $50,000 (96.8%)

Built-In Entity Types

Entity Type Examples
Person John Smith, Dr. Martinez
Organization Acme Corp, United Nations
Location New York, 123 Main Street
Date January 15, 2024; next Monday
Money $50,000; 1,200 EUR
Percent 15%, 0.5 percent
Product iPhone 15, Model X
Event CES 2024, World Cup

Custom Entity Types

var customDefinitions = new List<EntityDefinition>
{
    new EntityDefinition(NamedEntityType.Custom, "Medical Condition",
        "A disease, disorder, or health condition"),
    new EntityDefinition(NamedEntityType.Custom, "Medication",
        "A drug or pharmaceutical product"),
    new EntityDefinition(NamedEntityType.Person)
};

var ner = new NamedEntityRecognition(model, customDefinitions);

PII Extraction

The PiiExtraction class detects personally identifiable information for compliance and data protection.

using LMKit.TextAnalysis;
using LMKit.Model;

var model = LM.LoadFromModelID("qwen3:4b");
var pii = new PiiExtraction(model);

var entities = pii.Extract(
    "Contact Jane Doe at jane.doe@example.com or (555) 123-4567. Her SSN is 123-45-6789.",
    CancellationToken.None
);

foreach (var entity in entities)
{
    Console.WriteLine($"{entity.EntityDefinition.Type}: {entity.Value}");
}
// Person: Jane Doe
// EmailAddress: jane.doe@example.com
// PhoneNumber: (555) 123-4567
// SSN: 123-45-6789

Built-In PII Types

PII Type Description
Person Full names
Organization Company and institution names
Location Addresses, cities, countries
EmailAddress Email addresses
PhoneNumber Phone and fax numbers
CreditCard Credit/debit card numbers
SSN Social Security Numbers
Date Dates of birth, other dates
IPAddress IPv4 and IPv6 addresses
URL Web addresses

Document Splitting

The DocumentSplitting class uses vision models to detect logical document boundaries within multi-page files.

using LMKit.Extraction;
using LMKit.Model;

var model = LM.LoadFromModelID("qwen2-vl:7b");

var splitter = new DocumentSplitting(model)
{
    Guidance = "The file contains a mix of invoices and contracts."
};

// Detect boundaries
var result = splitter.Split(new Attachment("scanned_batch.pdf"), CancellationToken.None);

Console.WriteLine($"Found {result.DocumentCount} document(s)");
foreach (var segment in result.Segments)
{
    Console.WriteLine($"  Pages {segment.StartPage}-{segment.EndPage}: {segment.Label}");
}

// Optionally split into separate PDF files
var result = splitter.Split(
    new Attachment("scanned_batch.pdf"),
    splitDocument: true,
    outputDirectory: "./output",
    CancellationToken.None
);

foreach (string path in result.Documents)
{
    Console.WriteLine($"Created: {path}");
}

The Dynamic Sampling Advantage

All extraction classes in LM-Kit.NET benefit from the Dynamic Sampling framework, which combines neural language model output with symbolic validation:

+--------------------------------------------------------------------------+
|                  Dynamic Sampling in Extraction                           |
+--------------------------------------------------------------------------+
|                                                                          |
|  LLM generates candidate token                                          |
|           |                                                              |
|           v                                                              |
|  +-------------------+                                                   |
|  | Grammar Check     |  Does the token satisfy the JSON schema?          |
|  +-------------------+                                                   |
|           |                                                              |
|           v                                                              |
|  +-------------------+                                                   |
|  | Perplexity Check  |  Is the model confident in this token?            |
|  +-------------------+                                                   |
|           |                                                              |
|           v                                                              |
|  +-------------------+                                                   |
|  | Auxiliary Lookup   |  Does external knowledge confirm the value?       |
|  +-------------------+                                                   |
|           |                                                              |
|           v                                                              |
|  Accept or explore alternatives                                          |
|                                                                          |
+--------------------------------------------------------------------------+

This neuro-symbolic approach delivers:

  • 100% schema compliance through grammar enforcement
  • 75% fewer errors compared to pure LLM extraction
  • Per-field confidence scores for human-in-the-loop workflows
  • Automatic type coercion (e.g., "March 15, 2024" to 2024-03-15)

Extraction Use Cases

1. Invoice Processing

Extract vendor information, line items, totals, and payment terms from invoices in any format. Combine with Classification to route invoices by type.

2. Compliance and Redaction

Detect PII across documents and automatically redact sensitive information to meet GDPR, HIPAA, or other regulatory requirements.

3. Contract Analysis

Identify parties, dates, obligations, and key clauses from legal agreements using both NER and structured extraction.

4. Resume Parsing

Extract candidate details (name, contact, experience, skills, education) from resumes in PDF, Word, or image format.

5. Mailroom Automation

Combine document splitting with classification and extraction to process batches of scanned mail: detect boundaries, classify each document, then extract relevant fields.

6. Medical Records

Parse patient records, lab results, and clinical notes using custom entity definitions while keeping all processing on-device for HIPAA compliance.


Key Terms

  • Extraction: Retrieving specific information from unstructured content
  • Schema: A predefined structure defining what fields to extract and their data types
  • Named Entity: A real-world object (person, place, organization, date) identified in text
  • PII (Personally Identifiable Information): Data that can identify an individual
  • Document Splitting: Detecting logical boundaries between documents within a multi-page file
  • Confidence Score: A value between 0 and 1 indicating how certain the model is about an extracted value
  • Human-in-the-Loop (HITL): Routing low-confidence extractions for manual review
  • Schema Discovery: Automatically inferring the extraction schema from document content
  • Element Type: The data type of an extraction field (String, Integer, Double, Date, Object, ObjectArray)



External Resources


Summary

Extraction encompasses the full range of techniques for pulling structured information from unstructured content. In LM-Kit.NET, extraction capabilities span four specialized classes: TextExtraction for schema-based JSON extraction with nested objects, arrays, and typed fields; NamedEntityRecognition for identifying people, organizations, locations, dates, and custom entity types; PiiExtraction for detecting sensitive personal information; and DocumentSplitting for vision-based detection of logical document boundaries within multi-page files. All extraction classes accept multimodal inputs (text, PDFs, images, Office documents), produce results with per-field confidence scores, and leverage the Dynamic Sampling framework to combine LLM intelligence with symbolic validation for schema-compliant, hallucination-resistant outputs. Whether automating invoice processing, enforcing compliance through PII detection, or parsing contracts, LM-Kit.NET's extraction toolkit provides the precision and reliability needed for production workflows.