Extract Invoice Data from PDFs and Images
Accounts payable teams process invoices that arrive in every format: scanned PDFs, email attachments, photos of paper invoices. Extracting vendor names, amounts, dates, and line items manually is slow and error-prone. LM-Kit.NET's TextExtraction class with a JSON schema pulls structured invoice data from any document format. This tutorial builds an invoice extractor that handles PDFs, images, and batch processing.
Why Local Invoice Extraction Matters
Two enterprise problems that on-device extraction solves:
- Financial documents stay private. Invoices contain bank details, payment terms, and vendor relationships. Processing them through a cloud extraction API means a third party sees your financial data. Local extraction keeps every invoice on your infrastructure.
- Integrate into existing AP workflows. Extracted data feeds directly into your ERP, accounting system, or approval workflow. No external API dependency means no downtime, no rate limits, and consistent throughput during end-of-month processing spikes.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
| Disk | ~3 GB free for model download |
Step 1: Create the Project
dotnet new console -n InvoiceQuickstart
cd InvoiceQuickstart
dotnet add package LM-Kit.NET
Step 2: Define the Invoice Schema
TextExtraction uses a schema to know what fields to pull. Define the schema in code:
using System.Text;
using System.Text.Json;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Define invoice extraction schema
// ──────────────────────────────────────
var extractor = new TextExtraction(model)
{
Title = "Invoice",
Description = "Extract structured data from an invoice document.",
NullOnDoubt = true,
Elements = new List<TextExtractionElement>
{
new("invoice_number", TextExtractionElement.ElementType.String,
"The unique invoice identifier", isRequired: true),
new("invoice_date", TextExtractionElement.ElementType.String,
"The date the invoice was issued (YYYY-MM-DD format)", isRequired: true),
new("due_date", TextExtractionElement.ElementType.String,
"The payment due date (YYYY-MM-DD format)"),
new("vendor_name", TextExtractionElement.ElementType.String,
"The name of the company or person issuing the invoice", isRequired: true),
new("vendor_address", TextExtractionElement.ElementType.String,
"The full postal address of the vendor"),
new("customer_name", TextExtractionElement.ElementType.String,
"The name of the customer or recipient"),
new("subtotal", TextExtractionElement.ElementType.Number,
"The total amount before tax"),
new("tax_amount", TextExtractionElement.ElementType.Number,
"The tax amount"),
new("total_amount", TextExtractionElement.ElementType.Number,
"The total amount due including tax", isRequired: true),
new("currency", TextExtractionElement.ElementType.String,
"The currency code (e.g., USD, EUR, GBP)"),
new("line_items", new List<TextExtractionElement>
{
new("description", TextExtractionElement.ElementType.String,
"Description of the item or service"),
new("quantity", TextExtractionElement.ElementType.Number,
"Quantity of the item"),
new("unit_price", TextExtractionElement.ElementType.Number,
"Price per unit"),
new("amount", TextExtractionElement.ElementType.Number,
"Total amount for this line item")
}, isArray: true, description: "Individual line items on the invoice")
}
};
Step 3: Extract from a PDF Invoice
string invoicePath = "invoice_sample.pdf";
var attachment = new Attachment(invoicePath);
extractor.SetContent(attachment);
Console.WriteLine($"Extracting data from {Path.GetFileName(invoicePath)}...\n");
TextExtractionResult result = extractor.Parse();
// Access individual fields
Console.WriteLine($"Invoice #: {result.GetValue<string>("invoice_number")}");
Console.WriteLine($"Date: {result.GetValue<string>("invoice_date")}");
Console.WriteLine($"Vendor: {result.GetValue<string>("vendor_name")}");
Console.WriteLine($"Total: {result.GetValue<double>("total_amount")}");
Console.WriteLine($"Currency: {result.GetValue<string>("currency")}");
Console.WriteLine($"Confidence: {result.Confidence:P0}\n");
// Access line items
Console.WriteLine("Line items:");
foreach (var item in result.EnumerateAt("line_items"))
{
string desc = item["description"]?.Value?.ToString() ?? "N/A";
object qty = item["quantity"]?.Value ?? "N/A";
object amount = item["amount"]?.Value ?? "N/A";
Console.WriteLine($" {desc} (qty: {qty}, amount: {amount})");
}
Step 4: Extract from an Invoice Image
Process photos and scanned images of invoices:
using LMKit.Graphics;
string imagePath = "invoice_photo.jpg";
var image = new ImageBuffer(imagePath);
extractor.SetContent(image);
TextExtractionResult imageResult = extractor.Parse();
Console.WriteLine($"Invoice from image:");
Console.WriteLine($" Invoice #: {imageResult.GetValue<string>("invoice_number")}");
Console.WriteLine($" Vendor: {imageResult.GetValue<string>("vendor_name")}");
Console.WriteLine($" Total: {imageResult.GetValue<double>("total_amount")}");
Console.WriteLine($" Confidence: {imageResult.Confidence:P0}");
Step 5: Get the Full JSON Output
TextExtraction produces grammar-constrained JSON that matches your schema exactly:
extractor.SetContent(new Attachment("invoice_sample.pdf"));
TextExtractionResult result = extractor.Parse();
// Get raw JSON
string json = result.Json;
Console.WriteLine("Raw JSON output:\n");
Console.WriteLine(json);
// Parse with System.Text.Json for further processing
using JsonDocument doc = result.JsonDocument;
JsonElement root = doc.RootElement;
if (root.TryGetProperty("total_amount", out JsonElement totalElement))
{
double total = totalElement.GetDouble();
Console.WriteLine($"\nTotal for approval: {total:C}");
}
Step 6: Batch Invoice Processing
Process a folder of invoices and export structured data:
string[] invoiceFiles = Directory.GetFiles("invoices")
.Where(f => new[] { ".pdf", ".png", ".jpg", ".jpeg", ".tiff" }
.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
var output = new List<string>();
output.Add("file,invoice_number,vendor,date,total,currency,confidence");
Console.WriteLine($"Processing {invoiceFiles.Length} invoices...\n");
foreach (string file in invoiceFiles)
{
string fileName = Path.GetFileName(file);
Console.Write($" {fileName}... ");
extractor.SetContent(new Attachment(file));
TextExtractionResult r = extractor.Parse();
string invoiceNum = r.GetValue<string>("invoice_number") ?? "N/A";
string vendor = r.GetValue<string>("vendor_name") ?? "N/A";
string date = r.GetValue<string>("invoice_date") ?? "N/A";
double total = r.GetValue<double>("total_amount");
string currency = r.GetValue<string>("currency") ?? "N/A";
Console.WriteLine($"#{invoiceNum} from {vendor}: {total} {currency}");
output.Add($"\"{fileName}\",\"{invoiceNum}\",\"{vendor}\",\"{date}\",{total},\"{currency}\",{r.Confidence:F2}");
}
File.WriteAllLines("invoice_data.csv", output);
Console.WriteLine($"\nExported {invoiceFiles.Length} invoices to invoice_data.csv");
Step 7: Schema from JSON
Instead of defining elements in code, load the schema from a JSON file:
{
"title": "Invoice",
"description": "Extract invoice data",
"type": "object",
"properties": {
"invoice_number": { "type": "string", "description": "Invoice ID" },
"vendor_name": { "type": "string", "description": "Issuing company" },
"total_amount": { "type": "number", "description": "Total due" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"amount": { "type": "number" }
}
}
}
},
"required": ["invoice_number", "vendor_name", "total_amount"]
}
Load in code:
string schemaJson = File.ReadAllText("invoice_schema.json");
var extractor = new TextExtraction(model);
extractor.SetElementsFromJsonSchema(schemaJson);
extractor.SetContent(new Attachment("invoice.pdf"));
TextExtractionResult result = extractor.Parse();
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Missing field values (null) | Field not found in document | Check NullOnDoubt; if true, uncertain fields return null. Set to false to force extraction |
| Wrong date format | Model uses local format | Add format hint in element description: "YYYY-MM-DD format" |
| Line items not extracted | Schema mismatch or complex table layout | Use a larger model; add Guidance describing the table structure |
| Low confidence on scanned PDFs | Poor image quality | Set extractor.OcrEngine to a configured OCR engine for pre-processing |
| Slow on multi-page invoices | Processing all pages | Use SetContent(attachment, pageRange: "1-2") to limit to relevant pages |
Next Steps
- Extract Structured Data from Unstructured Text: general-purpose schema-driven extraction.
- Convert Documents to Markdown with VLM OCR: convert documents before extraction.
- Automatically Split Multi-Document PDFs with AI Vision: isolate individual invoices from bulk-scanned PDFs before extraction.
- Process PDFs and Images with Built-In Document Tools: use PdfSplit, DocumentText, and OCR tools in agent workflows.
- Samples: Invoice Data Extraction: invoice extraction demo.
- Samples: Structured Data Extraction: structured extraction demo.