Extract Invoice Data from PDFs and Images

Accounts payable teams process invoices that arrive in every format: scanned PDFs, email attachments, photos of paper invoices. Extracting vendor names, amounts, dates, and line items manually is slow and error-prone. LM-Kit.NET's TextExtraction class with a JSON schema pulls structured invoice data from any document format. This tutorial builds an invoice extractor that handles PDFs, images, and batch processing.

Why Local Invoice Extraction Matters

Two enterprise problems that on-device extraction solves:

Financial documents stay private. Invoices contain bank details, payment terms, and vendor relationships. Processing them through a cloud extraction API means a third party sees your financial data. Local extraction keeps every invoice on your infrastructure.
Integrate into existing AP workflows. Extracted data feeds directly into your ERP, accounting system, or approval workflow. No external API dependency means no downtime, no rate limits, and consistent throughput during end-of-month processing spikes.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	4+ GB
Disk	~3 GB free for model download

Step 1: Create the Project

dotnet new console -n InvoiceQuickstart
cd InvoiceQuickstart
dotnet add package LM-Kit.NET

Step 2: Define the Invoice Schema

TextExtraction uses a schema to know what fields to pull. Define the schema in code:

using System.Text;
using System.Text.Json;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Define invoice extraction schema
// ──────────────────────────────────────
var extractor = new TextExtraction(model)
{
    Title = "Invoice",
    Description = "Extract structured data from an invoice document.",
    NullOnDoubt = true,
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", ElementType.String,
            "The unique invoice identifier", isRequired: true),
        new("invoice_date", ElementType.String,
            "The date the invoice was issued (YYYY-MM-DD format)", isRequired: true),
        new("due_date", ElementType.String,
            "The payment due date (YYYY-MM-DD format)"),
        new("vendor_name", ElementType.String,
            "The name of the company or person issuing the invoice", isRequired: true),
        new("vendor_address", ElementType.String,
            "The full postal address of the vendor"),
        new("customer_name", ElementType.String,
            "The name of the customer or recipient"),
        new("subtotal", ElementType.Double,
            "The total amount before tax"),
        new("tax_amount", ElementType.Double,
            "The tax amount"),
        new("total_amount", ElementType.Double,
            "The total amount due including tax", isRequired: true),
        new("currency", ElementType.String,
            "The currency code (e.g., USD, EUR, GBP)"),
        new("line_items", new List<TextExtractionElement>
        {
            new("description", ElementType.String,
                "Description of the item or service"),
            new("quantity", ElementType.Double,
                "Quantity of the item"),
            new("unit_price", ElementType.Double,
                "Price per unit"),
            new("amount", ElementType.Double,
                "Total amount for this line item")
        }, isArray: true, description: "Individual line items on the invoice")
    }
};

Step 3: Extract from a PDF Invoice

using System.Text;
using System.Text.Json;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Define invoice extraction schema
// ──────────────────────────────────────
var extractor = new TextExtraction(model)
{
    Title = "Invoice",
    Description = "Extract structured data from an invoice document.",
    NullOnDoubt = true,
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", ElementType.String,
            "The unique invoice identifier", isRequired: true),
        new("invoice_date", ElementType.String,
            "The date the invoice was issued (YYYY-MM-DD format)", isRequired: true),
        new("due_date", ElementType.String,
            "The payment due date (YYYY-MM-DD format)"),
        new("vendor_name", ElementType.String,
            "The name of the company or person issuing the invoice", isRequired: true),
        new("vendor_address", ElementType.String,
            "The full postal address of the vendor"),
        new("customer_name", ElementType.String,
            "The name of the customer or recipient"),
        new("subtotal", ElementType.Double,
            "The total amount before tax"),
        new("tax_amount", ElementType.Double,
            "The tax amount"),
        new("total_amount", ElementType.Double,
            "The total amount due including tax", isRequired: true),
        new("currency", ElementType.String,
            "The currency code (e.g., USD, EUR, GBP)"),
        new("line_items", new List<TextExtractionElement>
        {
            new("description", ElementType.String,
                "Description of the item or service"),
            new("quantity", ElementType.Double,
                "Quantity of the item"),
            new("unit_price", ElementType.Double,
                "Price per unit"),
            new("amount", ElementType.Double,
                "Total amount for this line item")
        }, isArray: true, description: "Individual line items on the invoice")
    }
};

string invoicePath = "invoice_sample.pdf";
var attachment = new Attachment(invoicePath);

extractor.SetContent(attachment);

Console.WriteLine($"Extracting data from {Path.GetFileName(invoicePath)}...\n");

TextExtractionResult result = extractor.Parse();

// Access individual fields
Console.WriteLine($"Invoice #:   {result.GetValue<string>("invoice_number")}");
Console.WriteLine($"Date:        {result.GetValue<string>("invoice_date")}");
Console.WriteLine($"Vendor:      {result.GetValue<string>("vendor_name")}");
Console.WriteLine($"Total:       {result.GetValue<double>("total_amount")}");
Console.WriteLine($"Currency:    {result.GetValue<string>("currency")}");
Console.WriteLine($"Confidence:  {result.Confidence:P0}\n");

// Access line items
Console.WriteLine("Line items:");
foreach (var item in result.EnumerateAt("line_items"))
{
    string desc = item["description"]?.Value?.ToString() ?? "N/A";
    object qty = item["quantity"]?.Value ?? "N/A";
    object amount = item["amount"]?.Value ?? "N/A";
    Console.WriteLine($"  {desc} (qty: {qty}, amount: {amount})");
}

Step 4: Extract from an Invoice Image

Process photos and scanned images of invoices:

using LMKit.Graphics;
using LMKit.Media.Image;

string imagePath = "invoice_photo.jpg";
using var image = ImageBuffer.LoadAsRGB(imagePath);

extractor.SetContent(image);

TextExtractionResult imageResult = extractor.Parse();

Console.WriteLine($"Invoice from image:");
Console.WriteLine($"  Invoice #: {imageResult.GetValue<string>("invoice_number")}");
Console.WriteLine($"  Vendor:    {imageResult.GetValue<string>("vendor_name")}");
Console.WriteLine($"  Total:     {imageResult.GetValue<double>("total_amount")}");
Console.WriteLine($"  Confidence: {imageResult.Confidence:P0}");

Step 5: Get the Full JSON Output

TextExtraction produces grammar-constrained JSON that matches your schema exactly:

extractor.SetContent(new Attachment("invoice_sample.pdf"));
TextExtractionResult result = extractor.Parse();

// Get raw JSON
string json = result.Json;
Console.WriteLine("Raw JSON output:\n");
Console.WriteLine(json);

// Parse with System.Text.Json for further processing
using JsonDocument doc = result.JsonDocument;
JsonElement root = doc.RootElement;

if (root.TryGetProperty("total_amount", out JsonElement totalElement))
{
    double total = totalElement.GetDouble();
    Console.WriteLine($"\nTotal for approval: {total:C}");
}

Step 6: Batch Invoice Processing

Process a folder of invoices and export structured data:

using System.Text;
using System.Text.Json;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Define invoice extraction schema
// ──────────────────────────────────────
var extractor = new TextExtraction(model)
{
    Title = "Invoice",
    Description = "Extract structured data from an invoice document.",
    NullOnDoubt = true,
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", ElementType.String,
            "The unique invoice identifier", isRequired: true),
        new("invoice_date", ElementType.String,
            "The date the invoice was issued (YYYY-MM-DD format)", isRequired: true),
        new("due_date", ElementType.String,
            "The payment due date (YYYY-MM-DD format)"),
        new("vendor_name", ElementType.String,
            "The name of the company or person issuing the invoice", isRequired: true),
        new("vendor_address", ElementType.String,
            "The full postal address of the vendor"),
        new("customer_name", ElementType.String,
            "The name of the customer or recipient"),
        new("subtotal", ElementType.Double,
            "The total amount before tax"),
        new("tax_amount", ElementType.Double,
            "The tax amount"),
        new("total_amount", ElementType.Double,
            "The total amount due including tax", isRequired: true),
        new("currency", ElementType.String,
            "The currency code (e.g., USD, EUR, GBP)"),
        new("line_items", new List<TextExtractionElement>
        {
            new("description", ElementType.String,
                "Description of the item or service"),
            new("quantity", ElementType.Double,
                "Quantity of the item"),
            new("unit_price", ElementType.Double,
                "Price per unit"),
            new("amount", ElementType.Double,
                "Total amount for this line item")
        }, isArray: true, description: "Individual line items on the invoice")
    }
};

string[] invoiceFiles = Directory.GetFiles("invoices")
    .Where(f => new[] { ".pdf", ".png", ".jpg", ".jpeg", ".tiff" }
        .Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

var output = new List<string>();
output.Add("file,invoice_number,vendor,date,total,currency,confidence");

Console.WriteLine($"Processing {invoiceFiles.Length} invoices...\n");

foreach (string file in invoiceFiles)
{
    string fileName = Path.GetFileName(file);
    Console.Write($"  {fileName}... ");

    extractor.SetContent(new Attachment(file));
    TextExtractionResult r = extractor.Parse();

    string invoiceNum = r.GetValue<string>("invoice_number") ?? "N/A";
    string vendor = r.GetValue<string>("vendor_name") ?? "N/A";
    string date = r.GetValue<string>("invoice_date") ?? "N/A";
    double total = r.GetValue<double>("total_amount");
    string currency = r.GetValue<string>("currency") ?? "N/A";

    Console.WriteLine($"#{invoiceNum} from {vendor}: {total} {currency}");

    output.Add($"\"{fileName}\",\"{invoiceNum}\",\"{vendor}\",\"{date}\",{total},\"{currency}\",{r.Confidence:F2}");
}

File.WriteAllLines("invoice_data.csv", output);
Console.WriteLine($"\nExported {invoiceFiles.Length} invoices to invoice_data.csv");

Step 7: Schema from JSON

Instead of defining elements in code, load the schema from a JSON file:

{
    "title": "Invoice",
    "description": "Extract invoice data",
    "type": "object",
    "properties": {
        "invoice_number": { "type": "string", "description": "Invoice ID" },
        "vendor_name": { "type": "string", "description": "Issuing company" },
        "total_amount": { "type": "number", "description": "Total due" },
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": { "type": "string" },
                    "amount": { "type": "number" }
                }
            }
        }
    },
    "required": ["invoice_number", "vendor_name", "total_amount"]
}

Load in code:

using System.Text;
using System.Text.Json;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

string schemaJson = File.ReadAllText("invoice_schema.json");

var extractor = new TextExtraction(model);
extractor.SetElementsFromJsonSchema(schemaJson);

extractor.SetContent(new Attachment("invoice.pdf"));
TextExtractionResult result = extractor.Parse();

Common Issues

Problem	Cause	Fix
Missing field values (null)	Field not found in document	Check `NullOnDoubt`; if true, uncertain fields return null. Set to false to force extraction
Wrong date format	Model uses local format	Add format hint in element description: "YYYY-MM-DD format"
Line items not extracted	Schema mismatch or complex table layout	Use a larger model; add `Guidance` describing the table structure
Low confidence on scanned PDFs	Poor image quality	Set `extractor.OcrEngine` to a configured OCR engine for pre-processing
Slow on multi-page invoices	Processing all pages	Use `SetContent(attachment, pageRange: "1-2")` to limit to relevant pages

Next Steps

Extract Structured Data from Unstructured Text: general-purpose schema-driven extraction.
Convert Documents to Markdown with VLM OCR: convert documents before extraction.
Automatically Split Multi-Document PDFs with AI Vision: isolate individual invoices from bulk-scanned PDFs before extraction.
Process PDFs and Images with Built-In Document Tools: use PdfSplit, DocumentText, and OCR tools in agent workflows.
Samples: Invoice Data Extraction: invoice extraction demo.
Samples: Structured Data Extraction: structured extraction demo.

Table of Contents