Table of Contents

Auto-Discover Extraction Schemas from Unknown Documents

When building document processing pipelines, you usually know the document structure upfront and define an extraction schema manually. But some workflows involve documents with unpredictable or evolving structures: vendor invoices in dozens of formats, regulatory forms that change quarterly, or legacy archives with inconsistent layouts. LM-Kit.NET's SchemaDiscovery feature analyzes a document and generates an optimal JSON schema automatically, identifying the fields, types, and structure present in the content. This tutorial builds a schema discovery pipeline that handles unknown document types and generates extraction schemas on the fly.


Why Schema Discovery Matters

Two enterprise problems that automatic schema discovery solves:

  1. Heterogeneous vendor documents. A procurement department receives invoices, packing slips, and certificates from hundreds of vendors, each with unique layouts. Defining a schema for every vendor is impractical. Schema discovery analyzes a sample document and proposes the extraction schema, which can then be reviewed, refined, and reused for that vendor's documents.
  2. Legacy document digitization. Organizations migrating paper archives to digital systems encounter decades of forms, reports, and correspondence in varying formats. Schema discovery automates the first step of digitization: understanding what data each document type contains, without manual analysis of every format.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n SchemaDiscovery
cd SchemaDiscovery
dotnet add package LM-Kit.NET

Step 2: Understand the Discovery Process

  Unknown        ┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
  document ───►  │ SchemaDiscovery │ ──► │ Review/Refine   │ ──► │ Parse with      │
                 │ (analyze doc)   │     │ (optional)      │     │ discovered      │
                 │                 │     │                 │     │ schema          │
                 └─────────────────┘     └─────────────────┘     └─────────────────┘
                        │                                               │
                        ▼                                               ▼
                 JSON Schema                                    Structured JSON
                 (auto-generated)                               (extracted data)

SchemaDiscovery uses the LLM to analyze the document content and propose a JSON schema describing the data fields present. The discovered schema can be used directly or refined before extraction.


Step 3: Discover Schema from Text

using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
    "PURCHASE ORDER #PO-2025-0847\n\n" +
    "Date: February 3, 2025\n" +
    "Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
    "Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
    "Vendor: Precision Parts Supply Co.\n" +
    "Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
    "Items:\n" +
    "  1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50)  Qty: 5000  Unit: $0.12  Total: $600.00\n" +
    "  2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8)           Qty: 5000  Unit: $0.08  Total: $400.00\n" +
    "  3. Flat Washer M10 (SKU: FW-M10)                   Qty: 10000 Unit: $0.03  Total: $300.00\n" +
    "  4. Spring Lock Washer M10 (SKU: SLW-M10)           Qty: 5000  Unit: $0.05  Total: $250.00\n\n" +
    "Subtotal: $1,550.00\n" +
    "Shipping: $85.00\n" +
    "Tax (8.25%): $127.88\n" +
    "Total: $1,762.88\n\n" +
    "Payment Terms: Net 30\n" +
    "Required Delivery Date: February 20, 2025\n" +
    "Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";

var extractor = new TextExtraction(model);
extractor.SetContent(unknownDocument);

Console.WriteLine("=== Schema Discovery ===\n");
Console.WriteLine("Analyzing document to discover extraction schema...\n");

string discoveredSchema = extractor.SchemaDiscovery();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Discovered JSON Schema:");
Console.ResetColor();
Console.WriteLine(discoveredSchema);
Console.WriteLine();

Step 4: Extract Data Using the Discovered Schema

Use the discovered schema directly for extraction:

using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
    "PURCHASE ORDER #PO-2025-0847\n\n" +
    "Date: February 3, 2025\n" +
    "Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
    "Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
    "Vendor: Precision Parts Supply Co.\n" +
    "Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
    "Items:\n" +
    "  1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50)  Qty: 5000  Unit: $0.12  Total: $600.00\n" +
    "  2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8)           Qty: 5000  Unit: $0.08  Total: $400.00\n" +
    "  3. Flat Washer M10 (SKU: FW-M10)                   Qty: 10000 Unit: $0.03  Total: $300.00\n" +
    "  4. Spring Lock Washer M10 (SKU: SLW-M10)           Qty: 5000  Unit: $0.05  Total: $250.00\n\n" +
    "Subtotal: $1,550.00\n" +
    "Shipping: $85.00\n" +
    "Tax (8.25%): $127.88\n" +
    "Total: $1,762.88\n\n" +
    "Payment Terms: Net 30\n" +
    "Required Delivery Date: February 20, 2025\n" +
    "Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";

var extractor = new TextExtraction(model);
extractor.SetContent(unknownDocument);

Console.WriteLine("=== Schema Discovery ===\n");
Console.WriteLine("Analyzing document to discover extraction schema...\n");

string discoveredSchema = extractor.SchemaDiscovery();

// ──────────────────────────────────────
// 3. Apply the discovered schema
// ──────────────────────────────────────
Console.WriteLine("=== Extraction with Discovered Schema ===\n");

extractor.SetElementsFromJsonSchema(discoveredSchema);
extractor.SetContent(unknownDocument);

TextExtractionResult result = extractor.Parse();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Extracted Data:");
Console.ResetColor();

// Pretty-print the JSON
using JsonDocument doc = JsonDocument.Parse(result.Json);
string prettyJson = JsonSerializer.Serialize(doc, new JsonSerializerOptions { WriteIndented = true });
Console.WriteLine(prettyJson);

Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"\nConfidence: {result.Confidence:P0}");
Console.ResetColor();

Step 5: Schema Discovery from PDF Documents

Discover schemas from PDF files and images:

using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
    "PURCHASE ORDER #PO-2025-0847\n\n" +
    "Date: February 3, 2025\n" +
    "Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
    "Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
    "Vendor: Precision Parts Supply Co.\n" +
    "Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
    "Items:\n" +
    "  1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50)  Qty: 5000  Unit: $0.12  Total: $600.00\n" +
    "  2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8)           Qty: 5000  Unit: $0.08  Total: $400.00\n" +
    "  3. Flat Washer M10 (SKU: FW-M10)                   Qty: 10000 Unit: $0.03  Total: $300.00\n" +
    "  4. Spring Lock Washer M10 (SKU: SLW-M10)           Qty: 5000  Unit: $0.05  Total: $250.00\n\n" +
    "Subtotal: $1,550.00\n" +
    "Shipping: $85.00\n" +
    "Tax (8.25%): $127.88\n" +
    "Total: $1,762.88\n\n" +
    "Payment Terms: Net 30\n" +
    "Required Delivery Date: February 20, 2025\n" +
    "Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";

var extractor = new TextExtraction(model);

Console.WriteLine("\n=== PDF Schema Discovery ===\n");

string pdfPath = "unknown_form.pdf";

if (File.Exists(pdfPath))
{
    var attachment = new Attachment(pdfPath);

    extractor.SetContent(attachment);

    Console.Write("Analyzing PDF structure... ");
    string pdfSchema = extractor.SchemaDiscovery();
    Console.WriteLine("done.\n");

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine("Discovered Schema from PDF:");
    Console.ResetColor();
    Console.WriteLine(pdfSchema);

    // Save the schema for reuse
    string schemaFile = Path.ChangeExtension(pdfPath, ".schema.json");
    File.WriteAllText(schemaFile, pdfSchema);
    Console.WriteLine($"\nSchema saved to: {schemaFile}");
}

Step 6: Async Schema Discovery

For UI applications or batch processing, use the async variant:

using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
    "PURCHASE ORDER #PO-2025-0847\n\n" +
    "Date: February 3, 2025\n" +
    "Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
    "Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
    "Vendor: Precision Parts Supply Co.\n" +
    "Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
    "Items:\n" +
    "  1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50)  Qty: 5000  Unit: $0.12  Total: $600.00\n" +
    "  2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8)           Qty: 5000  Unit: $0.08  Total: $400.00\n" +
    "  3. Flat Washer M10 (SKU: FW-M10)                   Qty: 10000 Unit: $0.03  Total: $300.00\n" +
    "  4. Spring Lock Washer M10 (SKU: SLW-M10)           Qty: 5000  Unit: $0.05  Total: $250.00\n\n" +
    "Subtotal: $1,550.00\n" +
    "Shipping: $85.00\n" +
    "Tax (8.25%): $127.88\n" +
    "Total: $1,762.88\n\n" +
    "Payment Terms: Net 30\n" +
    "Required Delivery Date: February 20, 2025\n" +
    "Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";

var extractor = new TextExtraction(model);

Console.WriteLine("\n=== Async Schema Discovery ===\n");

extractor.SetContent(unknownDocument);

string asyncSchema = await extractor.SchemaDiscoveryAsync();

Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Discovered Schema (async):");
Console.ResetColor();
Console.WriteLine(asyncSchema);

Step 7: Build a Schema Library from Document Samples

Process a folder of sample documents to build a reusable schema library:

using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
    "PURCHASE ORDER #PO-2025-0847\n\n" +
    "Date: February 3, 2025\n" +
    "Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
    "Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
    "Vendor: Precision Parts Supply Co.\n" +
    "Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
    "Items:\n" +
    "  1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50)  Qty: 5000  Unit: $0.12  Total: $600.00\n" +
    "  2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8)           Qty: 5000  Unit: $0.08  Total: $400.00\n" +
    "  3. Flat Washer M10 (SKU: FW-M10)                   Qty: 10000 Unit: $0.03  Total: $300.00\n" +
    "  4. Spring Lock Washer M10 (SKU: SLW-M10)           Qty: 5000  Unit: $0.05  Total: $250.00\n\n" +
    "Subtotal: $1,550.00\n" +
    "Shipping: $85.00\n" +
    "Tax (8.25%): $127.88\n" +
    "Total: $1,762.88\n\n" +
    "Payment Terms: Net 30\n" +
    "Required Delivery Date: February 20, 2025\n" +
    "Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";

var extractor = new TextExtraction(model);

Console.WriteLine("\n=== Schema Library Builder ===\n");

string samplesFolder = "document_samples";
string schemaLibraryFolder = "schemas";

if (!Directory.Exists(samplesFolder))
{
    Console.WriteLine($"Create a '{samplesFolder}' folder with sample documents, then run again.");
    return;
}

Directory.CreateDirectory(schemaLibraryFolder);

string[] sampleFiles = Directory.GetFiles(samplesFolder)
    .Where(f => new[] { ".pdf", ".txt", ".docx", ".png", ".jpg" }
        .Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

Console.WriteLine($"Analyzing {sampleFiles.Length} sample document(s)...\n");

foreach (string sampleFile in sampleFiles)
{
    string fileName = Path.GetFileName(sampleFile);
    Console.Write($"  {fileName}... ");

    try
    {
        var attachment = new Attachment(sampleFile);
        extractor.SetContent(attachment);

        string schema = extractor.SchemaDiscovery();

        string schemaPath = Path.Combine(
            schemaLibraryFolder,
            Path.ChangeExtension(fileName, ".schema.json"));

        File.WriteAllText(schemaPath, schema);

        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine("schema discovered");
        Console.ResetColor();
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"failed: {ex.Message}");
        Console.ResetColor();
    }
}

Console.WriteLine($"\nSchemas saved to: {Path.GetFullPath(schemaLibraryFolder)}");

Step 8: Discover, Extract, and Validate Pipeline

Combine schema discovery with extraction and validation in one pipeline:

using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
    "PURCHASE ORDER #PO-2025-0847\n\n" +
    "Date: February 3, 2025\n" +
    "Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
    "Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
    "Vendor: Precision Parts Supply Co.\n" +
    "Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
    "Items:\n" +
    "  1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50)  Qty: 5000  Unit: $0.12  Total: $600.00\n" +
    "  2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8)           Qty: 5000  Unit: $0.08  Total: $400.00\n" +
    "  3. Flat Washer M10 (SKU: FW-M10)                   Qty: 10000 Unit: $0.03  Total: $300.00\n" +
    "  4. Spring Lock Washer M10 (SKU: SLW-M10)           Qty: 5000  Unit: $0.05  Total: $250.00\n\n" +
    "Subtotal: $1,550.00\n" +
    "Shipping: $85.00\n" +
    "Tax (8.25%): $127.88\n" +
    "Total: $1,762.88\n\n" +
    "Payment Terms: Net 30\n" +
    "Required Delivery Date: February 20, 2025\n" +
    "Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";

var extractor = new TextExtraction(model);

Console.WriteLine("\n=== Full Discovery-to-Extraction Pipeline ===\n");

string[] documents =
{
    // A shipping manifest
    "SHIPPING MANIFEST #SM-20250205\n" +
    "Origin: Shanghai Port, China\n" +
    "Destination: Los Angeles Port, USA\n" +
    "Vessel: MV Pacific Star\n" +
    "ETD: 2025-02-10  ETA: 2025-03-05\n" +
    "Container: MSCU-4521897\n" +
    "Weight: 18,500 kg  Volume: 33.2 CBM\n" +
    "Contents: Industrial machinery parts (HS Code: 8483.40)\n" +
    "Shipper: Donghua Manufacturing Ltd.\n" +
    "Consignee: Pacific Industrial Corp.\n" +
    "Insurance Value: $245,000 USD",

    // A lab test report
    "LABORATORY TEST REPORT\n" +
    "Report #: LTR-2025-0392\n" +
    "Test Date: February 4, 2025\n" +
    "Sample ID: WQ-2025-0084\n" +
    "Sample Type: Municipal Water Supply\n" +
    "Collection Point: Main Distribution Line, Station 14\n" +
    "pH Level: 7.2 (Pass, range: 6.5-8.5)\n" +
    "Turbidity: 0.8 NTU (Pass, max: 1.0 NTU)\n" +
    "Total Coliform: 0 CFU/100mL (Pass, max: 0)\n" +
    "Lead: <0.001 mg/L (Pass, max: 0.015 mg/L)\n" +
    "Chlorine Residual: 0.6 mg/L (Pass, range: 0.2-4.0 mg/L)\n" +
    "Overall Result: PASS\n" +
    "Analyst: Dr. Sarah Martinez\n" +
    "Approved By: James Wilson, Lab Director"
};

foreach (string doc in documents)
{
    // Print first line as document identifier
    string firstLine = doc.Split('\n')[0];
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"Document: {firstLine}");
    Console.ResetColor();

    // Step 1: Discover schema
    extractor.SetContent(doc);
    string schema = extractor.SchemaDiscovery();
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  Schema fields discovered");
    Console.ResetColor();

    // Step 2: Extract with discovered schema
    extractor.SetElementsFromJsonSchema(schema);
    extractor.SetContent(doc);
    TextExtractionResult result = extractor.Parse();

    // Step 3: Print results
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"  Confidence: {result.Confidence:P0}");
    Console.ResetColor();
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  {result.Json}");
    Console.ResetColor();
    Console.WriteLine();
}

When to Use Schema Discovery vs. Manual Schemas

Scenario Approach
Known, stable document format Define schema manually for best accuracy
Many vendor formats, low volume per vendor Discover schema from first document, save for reuse
Legacy archive with unknown formats Discover and review before batch processing
Evolving forms (regulatory, government) Periodically re-discover and compare to existing schema
Prototyping a new extraction pipeline Discover first, then refine the schema manually

Schema discovery is a starting point. For production pipelines, review the discovered schema and adjust field names, types, and descriptions to match your data model. The discovered schema captures what the LLM finds in the document; your domain expertise ensures the output matches business requirements.


Model Selection

Model ID VRAM Discovery Quality Best For
gemma3:4b ~3.5 GB Good Simple documents with clear structure
qwen3:8b ~6 GB Very good Complex documents with nested data
gemma3:12b ~8 GB Excellent Dense documents with subtle field boundaries
qwen3:14b ~10 GB Excellent Highest accuracy for critical schema discovery

Schema discovery benefits from larger models because the task requires understanding both the document content and its organizational structure. Use qwen3:8b or larger for production schema discovery.


Common Issues

Problem Cause Fix
Schema too granular (too many fields) Model splits data into very fine fields Merge related fields manually; use Guidance to suggest grouping
Schema too broad (few generic fields) Document lacks clear structure Provide Guidance describing expected field types
Nested arrays not detected Model flattens tabular data Use a larger model; provide Guidance like "Line items form an array"
Wrong field types Model guesses incorrect types Review and correct the schema; dates as strings are common
Discovery takes too long Large document with many pages Use SetContent(attachment, pageRange: "1-3") for schema discovery on a sample

Next Steps

Share