Table of Contents

Validate Extracted Entities with Built-In Format Validators

When extracting structured data from documents, the LLM might produce values that look correct but fail format validation (e.g., an IBAN with a wrong check digit, an email missing the domain). LM-Kit.NET includes an automatic entity validation pipeline: it detects the semantic type of each extracted field, applies format-specific validators, attempts repairs when possible, and flags low-confidence or invalid values for human review.

For background on structured extraction, see the Structured Data Extraction glossary entry.


Why This Matters

Two production problems that entity validation solves:

  1. Silent data corruption. An LLM extracts an IBAN as DE89370400440532013001 but the original document says DE89370400440532013000. Without validation, the wrong number flows into your ERP. The built-in IBAN validator catches the checksum mismatch and flags it for human review.
  2. Confidence-driven review queues. Processing thousands of invoices, you cannot manually review every extraction. The HumanVerificationRequired flag and per-field ConfidenceScore let you route only the uncertain results to a human reviewer, saving hours of manual work.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB

Recommended model: gemma3:4b or qwen3:4b.


Step 1: Create the Project

dotnet new console -n EntityValidation
cd EntityValidation
dotnet add package LM-Kit.NET

Step 2: Understand the Validation Pipeline

Validation happens automatically after extraction. For every result element, the pipeline:

  1. Detects the entity kind. Based on the field name, the extraction engine maps to one of the EntityKind values (e.g., a field named "CustomerEmail" maps to EntityKind.EmailAddress). You can read the detected kind from TextExtractionElement.DetectedEntityKind. Detection works across 23 languages.
  2. Runs the format validator. For supported entity kinds, a format-specific validator checks the value (regex, checksum, or structural rules).
  3. Attempts repair. If the value is close to valid (e.g., an email with an extra space), the validator tries to fix it. The original value is preserved in EntityValidationResult.OriginalValue.
  4. Reports the status. Each element gets an EntityValidationResult with one of four statuses.
Status Meaning
NotApplicable No validator exists for this entity kind
Valid The value passes format validation
Repaired The original was invalid but was auto-corrected
Invalid The value does not conform and could not be repaired

Validated entity kinds include: EmailAddress, PhoneNumber, FaxNumber, Iban, SwiftBic, IpV4Address, IpV6Address, MacAddress, PostalCode, Uri, WebsiteUri, Uuid, Guid, CurrencyCode, Isbn10, Isbn13, Issn.


Step 3: Extract and Validate Contact Information

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Extraction.Taxonomy;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// Define schema with fields that trigger entity kind detection
var extractor = new TextExtraction(model)
{
    HumanVerificationThreshold = 0.8f,  // Flag elements below 80% confidence
    NullOnDoubt = true                   // Return null for uncertain extractions
};

extractor.Elements = new List<TextExtractionElement>
{
    new("CompanyName", ElementType.String, "Name of the company"),
    new("ContactEmail", ElementType.String, "Primary contact email address"),
    new("PhoneNumber", ElementType.String, "Phone number with country code"),
    new("WebsiteUrl", ElementType.String, "Company website URL"),
    new("TaxId", ElementType.String, "Tax identification number"),
};

extractor.SetContent(
    "Acme Corporation\n" +
    "Contact: sales@acme-corp.com\n" +
    "Phone: +1 (555) 012-3456\n" +
    "Web: https://www.acme-corp.com\n" +
    "Tax ID: US12-3456789"
);

TextExtractionResult result = extractor.Parse();

Step 4: Inspect Validation Results

Console.WriteLine($"Overall confidence: {result.Confidence:P0}");
Console.WriteLine($"Human review needed: {result.HumanVerificationRequired}\n");

foreach (var element in result.Elements)
{
    string fieldName = element.TextExtractionElement.Name;
    object value = element.Value;
    EntityValidationResult validation = element.Validation;

    Console.WriteLine($"Field: {fieldName}");
    Console.WriteLine($"  Value: {value ?? "(null)"}");
    Console.WriteLine($"  Confidence: {(element.ConfidenceScore >= 0 ? $"{element.ConfidenceScore:P0}" : "N/A")}");
    Console.WriteLine($"  Entity kind: {validation.EntityKind}");
    Console.WriteLine($"  Validation: {validation.Status}");

    if (validation.Status == EntityValidationStatus.Repaired)
    {
        Console.WriteLine($"  Original value: {validation.OriginalValue}");
    }

    if (validation.Status == EntityValidationStatus.Invalid)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine("  ** INVALID: needs human review **");
        Console.ResetColor();
    }

    Console.WriteLine();
}

Example output:

Field: ContactEmail
  Value: sales@acme-corp.com
  Confidence: 95%
  Entity kind: EmailAddress
  Validation: Valid

Field: PhoneNumber
  Value: +15550123456
  Confidence: 88%
  Entity kind: PhoneNumber
  Validation: Repaired
  Original value: +1 (555) 012-3456

Step 5: Access Values by Path

TextExtractionResult supports path-based access for quick value retrieval.

// Get typed values
string email = result.GetValue<string>("ContactEmail");
string phone = result["PhoneNumber"].As<string>();

// Check confidence for a specific field
float? emailConfidence = result.GetConfidence("ContactEmail");

// Try-get pattern for optional fields
if (result.TryGetValue("TaxId", out object taxId) && taxId != null)
{
    Console.WriteLine($"Tax ID: {taxId}");
}

Step 6: Build a Human Review Queue

Use HumanVerificationRequired and per-field validation to route uncertain results.

// After extraction
if (result.HumanVerificationRequired)
{
    Console.WriteLine("=== Items requiring human review ===\n");

    foreach (var element in result.Elements)
    {
        bool needsReview = false;
        string reason = "";

        // Low confidence
        if (element.ConfidenceScore >= 0 && element.ConfidenceScore < extractor.HumanVerificationThreshold)
        {
            needsReview = true;
            reason = $"Low confidence ({element.ConfidenceScore:P0})";
        }

        // Validation failure
        if (element.Validation.Status == EntityValidationStatus.Invalid)
        {
            needsReview = true;
            reason = $"Invalid {element.Validation.EntityKind}";
        }

        if (needsReview)
        {
            Console.WriteLine($"  [{element.TextExtractionElement.Name}] {reason}");
            Console.WriteLine($"    Extracted value: {element.Value}");
        }
    }
}
else
{
    Console.WriteLine("All extractions passed validation. No human review needed.");
}

Step 7: Use Pattern Constraints for Custom Formats

For fields that follow a known pattern (e.g., invoice numbers, part codes), set a Pattern on the element format. The extraction engine uses this pattern during grammar-constrained generation and validates against it post-extraction.

var invoiceElement = new TextExtractionElement(
    "InvoiceNumber", ElementType.String, "Invoice reference number");
invoiceElement.Format.Pattern = @"INV-\d{4}-\d{6}";  // e.g., INV-2025-000142

var dateElement = new TextExtractionElement(
    "InvoiceDate", ElementType.Date, "Date the invoice was issued");

var ibanElement = new TextExtractionElement(
    "BankIban", ElementType.String, "Beneficiary IBAN");
// No pattern needed: EntityKindDetector maps "BankIban" to EntityKind.Iban,
// and the built-in IBAN validator handles checksum verification.

extractor.Elements = new List<TextExtractionElement>
{
    invoiceElement,
    dateElement,
    ibanElement
};

Step 8: Track Extraction Progress

For large documents processed in multiple passes, subscribe to the Progress event.

extractor.Progress += (sender, args) =>
{
    Console.WriteLine($"[{args.Phase}] Pass {args.PassIndex + 1}/{args.TotalPasses}");
};

Share