Auto-Discover Extraction Schemas from Unknown Documents
When building document processing pipelines, you usually know the document structure upfront and define an extraction schema manually. But some workflows involve documents with unpredictable or evolving structures: vendor invoices in dozens of formats, regulatory forms that change quarterly, or legacy archives with inconsistent layouts. LM-Kit.NET's SchemaDiscovery feature analyzes a document and generates an optimal JSON schema automatically, identifying the fields, types, and structure present in the content. This tutorial builds a schema discovery pipeline that handles unknown document types and generates extraction schemas on the fly.
Why Schema Discovery Matters
Two enterprise problems that automatic schema discovery solves:
- Heterogeneous vendor documents. A procurement department receives invoices, packing slips, and certificates from hundreds of vendors, each with unique layouts. Defining a schema for every vendor is impractical. Schema discovery analyzes a sample document and proposes the extraction schema, which can then be reviewed, refined, and reused for that vendor's documents.
- Legacy document digitization. Organizations migrating paper archives to digital systems encounter decades of forms, reports, and correspondence in varying formats. Schema discovery automates the first step of digitization: understanding what data each document type contains, without manual analysis of every format.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
| Disk | ~3 GB free for model download |
Step 1: Create the Project
dotnet new console -n SchemaDiscovery
cd SchemaDiscovery
dotnet add package LM-Kit.NET
Step 2: Understand the Discovery Process
Unknown ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
document ───► │ SchemaDiscovery │ ──► │ Review/Refine │ ──► │ Parse with │
│ (analyze doc) │ │ (optional) │ │ discovered │
│ │ │ │ │ schema │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
JSON Schema Structured JSON
(auto-generated) (extracted data)
SchemaDiscovery uses the LLM to analyze the document content and propose a JSON schema describing the data fields present. The discovered schema can be used directly or refined before extraction.
Step 3: Discover Schema from Text
using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
"PURCHASE ORDER #PO-2025-0847\n\n" +
"Date: February 3, 2025\n" +
"Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
"Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
"Vendor: Precision Parts Supply Co.\n" +
"Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
"Items:\n" +
" 1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50) Qty: 5000 Unit: $0.12 Total: $600.00\n" +
" 2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8) Qty: 5000 Unit: $0.08 Total: $400.00\n" +
" 3. Flat Washer M10 (SKU: FW-M10) Qty: 10000 Unit: $0.03 Total: $300.00\n" +
" 4. Spring Lock Washer M10 (SKU: SLW-M10) Qty: 5000 Unit: $0.05 Total: $250.00\n\n" +
"Subtotal: $1,550.00\n" +
"Shipping: $85.00\n" +
"Tax (8.25%): $127.88\n" +
"Total: $1,762.88\n\n" +
"Payment Terms: Net 30\n" +
"Required Delivery Date: February 20, 2025\n" +
"Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";
var extractor = new TextExtraction(model);
extractor.SetContent(unknownDocument);
Console.WriteLine("=== Schema Discovery ===\n");
Console.WriteLine("Analyzing document to discover extraction schema...\n");
string discoveredSchema = extractor.SchemaDiscovery();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Discovered JSON Schema:");
Console.ResetColor();
Console.WriteLine(discoveredSchema);
Console.WriteLine();
Step 4: Extract Data Using the Discovered Schema
Use the discovered schema directly for extraction:
using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
"PURCHASE ORDER #PO-2025-0847\n\n" +
"Date: February 3, 2025\n" +
"Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
"Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
"Vendor: Precision Parts Supply Co.\n" +
"Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
"Items:\n" +
" 1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50) Qty: 5000 Unit: $0.12 Total: $600.00\n" +
" 2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8) Qty: 5000 Unit: $0.08 Total: $400.00\n" +
" 3. Flat Washer M10 (SKU: FW-M10) Qty: 10000 Unit: $0.03 Total: $300.00\n" +
" 4. Spring Lock Washer M10 (SKU: SLW-M10) Qty: 5000 Unit: $0.05 Total: $250.00\n\n" +
"Subtotal: $1,550.00\n" +
"Shipping: $85.00\n" +
"Tax (8.25%): $127.88\n" +
"Total: $1,762.88\n\n" +
"Payment Terms: Net 30\n" +
"Required Delivery Date: February 20, 2025\n" +
"Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";
var extractor = new TextExtraction(model);
extractor.SetContent(unknownDocument);
Console.WriteLine("=== Schema Discovery ===\n");
Console.WriteLine("Analyzing document to discover extraction schema...\n");
string discoveredSchema = extractor.SchemaDiscovery();
// ──────────────────────────────────────
// 3. Apply the discovered schema
// ──────────────────────────────────────
Console.WriteLine("=== Extraction with Discovered Schema ===\n");
extractor.SetElementsFromJsonSchema(discoveredSchema);
extractor.SetContent(unknownDocument);
TextExtractionResult result = extractor.Parse();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Extracted Data:");
Console.ResetColor();
// Pretty-print the JSON
using JsonDocument doc = JsonDocument.Parse(result.Json);
string prettyJson = JsonSerializer.Serialize(doc, new JsonSerializerOptions { WriteIndented = true });
Console.WriteLine(prettyJson);
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"\nConfidence: {result.Confidence:P0}");
Console.ResetColor();
Step 5: Schema Discovery from PDF Documents
Discover schemas from PDF files and images:
using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
"PURCHASE ORDER #PO-2025-0847\n\n" +
"Date: February 3, 2025\n" +
"Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
"Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
"Vendor: Precision Parts Supply Co.\n" +
"Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
"Items:\n" +
" 1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50) Qty: 5000 Unit: $0.12 Total: $600.00\n" +
" 2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8) Qty: 5000 Unit: $0.08 Total: $400.00\n" +
" 3. Flat Washer M10 (SKU: FW-M10) Qty: 10000 Unit: $0.03 Total: $300.00\n" +
" 4. Spring Lock Washer M10 (SKU: SLW-M10) Qty: 5000 Unit: $0.05 Total: $250.00\n\n" +
"Subtotal: $1,550.00\n" +
"Shipping: $85.00\n" +
"Tax (8.25%): $127.88\n" +
"Total: $1,762.88\n\n" +
"Payment Terms: Net 30\n" +
"Required Delivery Date: February 20, 2025\n" +
"Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";
var extractor = new TextExtraction(model);
Console.WriteLine("\n=== PDF Schema Discovery ===\n");
string pdfPath = "unknown_form.pdf";
if (File.Exists(pdfPath))
{
var attachment = new Attachment(pdfPath);
extractor.SetContent(attachment);
Console.Write("Analyzing PDF structure... ");
string pdfSchema = extractor.SchemaDiscovery();
Console.WriteLine("done.\n");
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Discovered Schema from PDF:");
Console.ResetColor();
Console.WriteLine(pdfSchema);
// Save the schema for reuse
string schemaFile = Path.ChangeExtension(pdfPath, ".schema.json");
File.WriteAllText(schemaFile, pdfSchema);
Console.WriteLine($"\nSchema saved to: {schemaFile}");
}
Step 6: Async Schema Discovery
For UI applications or batch processing, use the async variant:
using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
"PURCHASE ORDER #PO-2025-0847\n\n" +
"Date: February 3, 2025\n" +
"Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
"Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
"Vendor: Precision Parts Supply Co.\n" +
"Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
"Items:\n" +
" 1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50) Qty: 5000 Unit: $0.12 Total: $600.00\n" +
" 2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8) Qty: 5000 Unit: $0.08 Total: $400.00\n" +
" 3. Flat Washer M10 (SKU: FW-M10) Qty: 10000 Unit: $0.03 Total: $300.00\n" +
" 4. Spring Lock Washer M10 (SKU: SLW-M10) Qty: 5000 Unit: $0.05 Total: $250.00\n\n" +
"Subtotal: $1,550.00\n" +
"Shipping: $85.00\n" +
"Tax (8.25%): $127.88\n" +
"Total: $1,762.88\n\n" +
"Payment Terms: Net 30\n" +
"Required Delivery Date: February 20, 2025\n" +
"Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";
var extractor = new TextExtraction(model);
Console.WriteLine("\n=== Async Schema Discovery ===\n");
extractor.SetContent(unknownDocument);
string asyncSchema = await extractor.SchemaDiscoveryAsync();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine("Discovered Schema (async):");
Console.ResetColor();
Console.WriteLine(asyncSchema);
Step 7: Build a Schema Library from Document Samples
Process a folder of sample documents to build a reusable schema library:
using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
"PURCHASE ORDER #PO-2025-0847\n\n" +
"Date: February 3, 2025\n" +
"Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
"Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
"Vendor: Precision Parts Supply Co.\n" +
"Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
"Items:\n" +
" 1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50) Qty: 5000 Unit: $0.12 Total: $600.00\n" +
" 2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8) Qty: 5000 Unit: $0.08 Total: $400.00\n" +
" 3. Flat Washer M10 (SKU: FW-M10) Qty: 10000 Unit: $0.03 Total: $300.00\n" +
" 4. Spring Lock Washer M10 (SKU: SLW-M10) Qty: 5000 Unit: $0.05 Total: $250.00\n\n" +
"Subtotal: $1,550.00\n" +
"Shipping: $85.00\n" +
"Tax (8.25%): $127.88\n" +
"Total: $1,762.88\n\n" +
"Payment Terms: Net 30\n" +
"Required Delivery Date: February 20, 2025\n" +
"Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";
var extractor = new TextExtraction(model);
Console.WriteLine("\n=== Schema Library Builder ===\n");
string samplesFolder = "document_samples";
string schemaLibraryFolder = "schemas";
if (!Directory.Exists(samplesFolder))
{
Console.WriteLine($"Create a '{samplesFolder}' folder with sample documents, then run again.");
return;
}
Directory.CreateDirectory(schemaLibraryFolder);
string[] sampleFiles = Directory.GetFiles(samplesFolder)
.Where(f => new[] { ".pdf", ".txt", ".docx", ".png", ".jpg" }
.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
Console.WriteLine($"Analyzing {sampleFiles.Length} sample document(s)...\n");
foreach (string sampleFile in sampleFiles)
{
string fileName = Path.GetFileName(sampleFile);
Console.Write($" {fileName}... ");
try
{
var attachment = new Attachment(sampleFile);
extractor.SetContent(attachment);
string schema = extractor.SchemaDiscovery();
string schemaPath = Path.Combine(
schemaLibraryFolder,
Path.ChangeExtension(fileName, ".schema.json"));
File.WriteAllText(schemaPath, schema);
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine("schema discovered");
Console.ResetColor();
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"failed: {ex.Message}");
Console.ResetColor();
}
}
Console.WriteLine($"\nSchemas saved to: {Path.GetFullPath(schemaLibraryFolder)}");
Step 8: Discover, Extract, and Validate Pipeline
Combine schema discovery with extraction and validation in one pipeline:
using System.Text;
using System.Text.Json;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Analyze an unknown document
// ──────────────────────────────────────
string unknownDocument =
"PURCHASE ORDER #PO-2025-0847\n\n" +
"Date: February 3, 2025\n" +
"Ship To: Warehouse B, 1200 Industrial Parkway, Austin, TX 78701\n" +
"Bill To: Meridian Manufacturing Corp., 500 Commerce Drive, Suite 400, Dallas, TX 75201\n\n" +
"Vendor: Precision Parts Supply Co.\n" +
"Vendor Contact: Robert Chen, robert.chen@precisionparts.com\n\n" +
"Items:\n" +
" 1. Stainless Steel Bolt M10x50 (SKU: SSB-M10-50) Qty: 5000 Unit: $0.12 Total: $600.00\n" +
" 2. Hex Nut M10 Grade 8 (SKU: HN-M10-G8) Qty: 5000 Unit: $0.08 Total: $400.00\n" +
" 3. Flat Washer M10 (SKU: FW-M10) Qty: 10000 Unit: $0.03 Total: $300.00\n" +
" 4. Spring Lock Washer M10 (SKU: SLW-M10) Qty: 5000 Unit: $0.05 Total: $250.00\n\n" +
"Subtotal: $1,550.00\n" +
"Shipping: $85.00\n" +
"Tax (8.25%): $127.88\n" +
"Total: $1,762.88\n\n" +
"Payment Terms: Net 30\n" +
"Required Delivery Date: February 20, 2025\n" +
"Special Instructions: Deliver to loading dock C. Require signed proof of delivery.";
var extractor = new TextExtraction(model);
Console.WriteLine("\n=== Full Discovery-to-Extraction Pipeline ===\n");
string[] documents =
{
// A shipping manifest
"SHIPPING MANIFEST #SM-20250205\n" +
"Origin: Shanghai Port, China\n" +
"Destination: Los Angeles Port, USA\n" +
"Vessel: MV Pacific Star\n" +
"ETD: 2025-02-10 ETA: 2025-03-05\n" +
"Container: MSCU-4521897\n" +
"Weight: 18,500 kg Volume: 33.2 CBM\n" +
"Contents: Industrial machinery parts (HS Code: 8483.40)\n" +
"Shipper: Donghua Manufacturing Ltd.\n" +
"Consignee: Pacific Industrial Corp.\n" +
"Insurance Value: $245,000 USD",
// A lab test report
"LABORATORY TEST REPORT\n" +
"Report #: LTR-2025-0392\n" +
"Test Date: February 4, 2025\n" +
"Sample ID: WQ-2025-0084\n" +
"Sample Type: Municipal Water Supply\n" +
"Collection Point: Main Distribution Line, Station 14\n" +
"pH Level: 7.2 (Pass, range: 6.5-8.5)\n" +
"Turbidity: 0.8 NTU (Pass, max: 1.0 NTU)\n" +
"Total Coliform: 0 CFU/100mL (Pass, max: 0)\n" +
"Lead: <0.001 mg/L (Pass, max: 0.015 mg/L)\n" +
"Chlorine Residual: 0.6 mg/L (Pass, range: 0.2-4.0 mg/L)\n" +
"Overall Result: PASS\n" +
"Analyst: Dr. Sarah Martinez\n" +
"Approved By: James Wilson, Lab Director"
};
foreach (string doc in documents)
{
// Print first line as document identifier
string firstLine = doc.Split('\n')[0];
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"Document: {firstLine}");
Console.ResetColor();
// Step 1: Discover schema
extractor.SetContent(doc);
string schema = extractor.SchemaDiscovery();
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" Schema fields discovered");
Console.ResetColor();
// Step 2: Extract with discovered schema
extractor.SetElementsFromJsonSchema(schema);
extractor.SetContent(doc);
TextExtractionResult result = extractor.Parse();
// Step 3: Print results
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($" Confidence: {result.Confidence:P0}");
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" {result.Json}");
Console.ResetColor();
Console.WriteLine();
}
When to Use Schema Discovery vs. Manual Schemas
| Scenario | Approach |
|---|---|
| Known, stable document format | Define schema manually for best accuracy |
| Many vendor formats, low volume per vendor | Discover schema from first document, save for reuse |
| Legacy archive with unknown formats | Discover and review before batch processing |
| Evolving forms (regulatory, government) | Periodically re-discover and compare to existing schema |
| Prototyping a new extraction pipeline | Discover first, then refine the schema manually |
Schema discovery is a starting point. For production pipelines, review the discovered schema and adjust field names, types, and descriptions to match your data model. The discovered schema captures what the LLM finds in the document; your domain expertise ensures the output matches business requirements.
Model Selection
| Model ID | VRAM | Discovery Quality | Best For |
|---|---|---|---|
gemma3:4b |
~3.5 GB | Good | Simple documents with clear structure |
qwen3:8b |
~6 GB | Very good | Complex documents with nested data |
gemma3:12b |
~8 GB | Excellent | Dense documents with subtle field boundaries |
qwen3:14b |
~10 GB | Excellent | Highest accuracy for critical schema discovery |
Schema discovery benefits from larger models because the task requires understanding both the document content and its organizational structure. Use qwen3:8b or larger for production schema discovery.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Schema too granular (too many fields) | Model splits data into very fine fields | Merge related fields manually; use Guidance to suggest grouping |
| Schema too broad (few generic fields) | Document lacks clear structure | Provide Guidance describing expected field types |
| Nested arrays not detected | Model flattens tabular data | Use a larger model; provide Guidance like "Line items form an array" |
| Wrong field types | Model guesses incorrect types | Review and correct the schema; dates as strings are common |
| Discovery takes too long | Large document with many pages | Use SetContent(attachment, pageRange: "1-3") for schema discovery on a sample |
Next Steps
- Extract Structured Data from Unstructured Text: manual schema definition for known document types.
- Build a Classification and Extraction Pipeline: classify first, then extract with type-specific schemas.
- Extract Invoice Data from PDFs and Images: specialized invoice extraction with predefined schemas.
- Build a Self-Healing Extraction Pipeline with Fallbacks: add retry and fallback strategies to extraction.