Validate Extracted Entities with Built-In Format Validators
When extracting structured data from documents, the LLM might produce values that look correct but fail format validation (e.g., an IBAN with a wrong check digit, an email missing the domain). LM-Kit.NET includes an automatic entity validation pipeline: it detects the semantic type of each extracted field, applies format-specific validators, attempts repairs when possible, and flags low-confidence or invalid values for human review.
For background on structured extraction, see the Structured Data Extraction glossary entry.
Why This Matters
Two production problems that entity validation solves:
- Silent data corruption. An LLM extracts an IBAN as
DE89370400440532013001but the original document saysDE89370400440532013000. Without validation, the wrong number flows into your ERP. The built-in IBAN validator catches the checksum mismatch and flags it for human review. - Confidence-driven review queues. Processing thousands of invoices, you cannot manually review every extraction. The
HumanVerificationRequiredflag and per-fieldConfidenceScorelet you route only the uncertain results to a human reviewer, saving hours of manual work.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
Recommended model: gemma3:4b or qwen3:4b.
Step 1: Create the Project
dotnet new console -n EntityValidation
cd EntityValidation
dotnet add package LM-Kit.NET
Step 2: Understand the Validation Pipeline
Validation happens automatically after extraction. For every result element, the pipeline:
- Detects the entity kind. Based on the field name, the extraction engine maps to one of the
EntityKindvalues (e.g., a field named"CustomerEmail"maps toEntityKind.EmailAddress). You can read the detected kind fromTextExtractionElement.DetectedEntityKind. Detection works across 23 languages. - Runs the format validator. For supported entity kinds, a format-specific validator checks the value (regex, checksum, or structural rules).
- Attempts repair. If the value is close to valid (e.g., an email with an extra space), the validator tries to fix it. The original value is preserved in
EntityValidationResult.OriginalValue. - Reports the status. Each element gets an
EntityValidationResultwith one of four statuses.
| Status | Meaning |
|---|---|
NotApplicable |
No validator exists for this entity kind |
Valid |
The value passes format validation |
Repaired |
The original was invalid but was auto-corrected |
Invalid |
The value does not conform and could not be repaired |
Validated entity kinds include: EmailAddress, PhoneNumber, FaxNumber, Iban, SwiftBic, IpV4Address, IpV6Address, MacAddress, PostalCode, Uri, WebsiteUri, Uuid, Guid, CurrencyCode, Isbn10, Isbn13, Issn.
Step 3: Extract and Validate Contact Information
using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Extraction.Taxonomy;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// Define schema with fields that trigger entity kind detection
var extractor = new TextExtraction(model)
{
HumanVerificationThreshold = 0.8f, // Flag elements below 80% confidence
NullOnDoubt = true // Return null for uncertain extractions
};
extractor.Elements = new List<TextExtractionElement>
{
new("CompanyName", ElementType.String, "Name of the company"),
new("ContactEmail", ElementType.String, "Primary contact email address"),
new("PhoneNumber", ElementType.String, "Phone number with country code"),
new("WebsiteUrl", ElementType.String, "Company website URL"),
new("TaxId", ElementType.String, "Tax identification number"),
};
extractor.SetContent(
"Acme Corporation\n" +
"Contact: sales@acme-corp.com\n" +
"Phone: +1 (555) 012-3456\n" +
"Web: https://www.acme-corp.com\n" +
"Tax ID: US12-3456789"
);
TextExtractionResult result = extractor.Parse();
Step 4: Inspect Validation Results
Console.WriteLine($"Overall confidence: {result.Confidence:P0}");
Console.WriteLine($"Human review needed: {result.HumanVerificationRequired}\n");
foreach (var element in result.Elements)
{
string fieldName = element.TextExtractionElement.Name;
object value = element.Value;
EntityValidationResult validation = element.Validation;
Console.WriteLine($"Field: {fieldName}");
Console.WriteLine($" Value: {value ?? "(null)"}");
Console.WriteLine($" Confidence: {(element.ConfidenceScore >= 0 ? $"{element.ConfidenceScore:P0}" : "N/A")}");
Console.WriteLine($" Entity kind: {validation.EntityKind}");
Console.WriteLine($" Validation: {validation.Status}");
if (validation.Status == EntityValidationStatus.Repaired)
{
Console.WriteLine($" Original value: {validation.OriginalValue}");
}
if (validation.Status == EntityValidationStatus.Invalid)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine(" ** INVALID: needs human review **");
Console.ResetColor();
}
Console.WriteLine();
}
Example output:
Field: ContactEmail
Value: sales@acme-corp.com
Confidence: 95%
Entity kind: EmailAddress
Validation: Valid
Field: PhoneNumber
Value: +15550123456
Confidence: 88%
Entity kind: PhoneNumber
Validation: Repaired
Original value: +1 (555) 012-3456
Step 5: Access Values by Path
TextExtractionResult supports path-based access for quick value retrieval.
// Get typed values
string email = result.GetValue<string>("ContactEmail");
string phone = result["PhoneNumber"].As<string>();
// Check confidence for a specific field
float? emailConfidence = result.GetConfidence("ContactEmail");
// Try-get pattern for optional fields
if (result.TryGetValue("TaxId", out object taxId) && taxId != null)
{
Console.WriteLine($"Tax ID: {taxId}");
}
Step 6: Build a Human Review Queue
Use HumanVerificationRequired and per-field validation to route uncertain results.
// After extraction
if (result.HumanVerificationRequired)
{
Console.WriteLine("=== Items requiring human review ===\n");
foreach (var element in result.Elements)
{
bool needsReview = false;
string reason = "";
// Low confidence
if (element.ConfidenceScore >= 0 && element.ConfidenceScore < extractor.HumanVerificationThreshold)
{
needsReview = true;
reason = $"Low confidence ({element.ConfidenceScore:P0})";
}
// Validation failure
if (element.Validation.Status == EntityValidationStatus.Invalid)
{
needsReview = true;
reason = $"Invalid {element.Validation.EntityKind}";
}
if (needsReview)
{
Console.WriteLine($" [{element.TextExtractionElement.Name}] {reason}");
Console.WriteLine($" Extracted value: {element.Value}");
}
}
}
else
{
Console.WriteLine("All extractions passed validation. No human review needed.");
}
Step 7: Use Pattern Constraints for Custom Formats
For fields that follow a known pattern (e.g., invoice numbers, part codes), set a Pattern on the element format. The extraction engine uses this pattern during grammar-constrained generation and validates against it post-extraction.
var invoiceElement = new TextExtractionElement(
"InvoiceNumber", ElementType.String, "Invoice reference number");
invoiceElement.Format.Pattern = @"INV-\d{4}-\d{6}"; // e.g., INV-2025-000142
var dateElement = new TextExtractionElement(
"InvoiceDate", ElementType.Date, "Date the invoice was issued");
var ibanElement = new TextExtractionElement(
"BankIban", ElementType.String, "Beneficiary IBAN");
// No pattern needed: EntityKindDetector maps "BankIban" to EntityKind.Iban,
// and the built-in IBAN validator handles checksum verification.
extractor.Elements = new List<TextExtractionElement>
{
invoiceElement,
dateElement,
ibanElement
};
Step 8: Track Extraction Progress
For large documents processed in multiple passes, subscribe to the Progress event.
extractor.Progress += (sender, args) =>
{
Console.WriteLine($"[{args.Phase}] Pass {args.PassIndex + 1}/{args.TotalPasses}");
};
What to Read Next
- Extract Structured Data from Unstructured Text: basic extraction patterns
- Extract Invoice Data from PDFs and Images: real-world invoice extraction
- Auto-Discover Extraction Schemas from Unknown Documents: schema inference
- Build a Self-Healing Extraction Pipeline with Fallbacks: error recovery
- Structured Data Extraction: concept overview