Table of Contents

Extract Invoice Data from PDFs and Images

Accounts payable teams process invoices that arrive in every format: scanned PDFs, email attachments, photos of paper invoices. Extracting vendor names, amounts, dates, and line items manually is slow and error-prone. LM-Kit.NET's TextExtraction class with a JSON schema pulls structured invoice data from any document format. This tutorial builds an invoice extractor that handles PDFs, images, and batch processing.


Why Local Invoice Extraction Matters

Two enterprise problems that on-device extraction solves:

  1. Financial documents stay private. Invoices contain bank details, payment terms, and vendor relationships. Processing them through a cloud extraction API means a third party sees your financial data. Local extraction keeps every invoice on your infrastructure.
  2. Integrate into existing AP workflows. Extracted data feeds directly into your ERP, accounting system, or approval workflow. No external API dependency means no downtime, no rate limits, and consistent throughput during end-of-month processing spikes.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n InvoiceQuickstart
cd InvoiceQuickstart
dotnet add package LM-Kit.NET

Step 2: Define the Invoice Schema

TextExtraction uses a schema to know what fields to pull. Define the schema in code:

using System.Text;
using System.Text.Json;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Define invoice extraction schema
// ──────────────────────────────────────
var extractor = new TextExtraction(model)
{
    Title = "Invoice",
    Description = "Extract structured data from an invoice document.",
    NullOnDoubt = true,
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", TextExtractionElement.ElementType.String,
            "The unique invoice identifier", isRequired: true),
        new("invoice_date", TextExtractionElement.ElementType.String,
            "The date the invoice was issued (YYYY-MM-DD format)", isRequired: true),
        new("due_date", TextExtractionElement.ElementType.String,
            "The payment due date (YYYY-MM-DD format)"),
        new("vendor_name", TextExtractionElement.ElementType.String,
            "The name of the company or person issuing the invoice", isRequired: true),
        new("vendor_address", TextExtractionElement.ElementType.String,
            "The full postal address of the vendor"),
        new("customer_name", TextExtractionElement.ElementType.String,
            "The name of the customer or recipient"),
        new("subtotal", TextExtractionElement.ElementType.Number,
            "The total amount before tax"),
        new("tax_amount", TextExtractionElement.ElementType.Number,
            "The tax amount"),
        new("total_amount", TextExtractionElement.ElementType.Number,
            "The total amount due including tax", isRequired: true),
        new("currency", TextExtractionElement.ElementType.String,
            "The currency code (e.g., USD, EUR, GBP)"),
        new("line_items", new List<TextExtractionElement>
        {
            new("description", TextExtractionElement.ElementType.String,
                "Description of the item or service"),
            new("quantity", TextExtractionElement.ElementType.Number,
                "Quantity of the item"),
            new("unit_price", TextExtractionElement.ElementType.Number,
                "Price per unit"),
            new("amount", TextExtractionElement.ElementType.Number,
                "Total amount for this line item")
        }, isArray: true, description: "Individual line items on the invoice")
    }
};

Step 3: Extract from a PDF Invoice

string invoicePath = "invoice_sample.pdf";
var attachment = new Attachment(invoicePath);

extractor.SetContent(attachment);

Console.WriteLine($"Extracting data from {Path.GetFileName(invoicePath)}...\n");

TextExtractionResult result = extractor.Parse();

// Access individual fields
Console.WriteLine($"Invoice #:   {result.GetValue<string>("invoice_number")}");
Console.WriteLine($"Date:        {result.GetValue<string>("invoice_date")}");
Console.WriteLine($"Vendor:      {result.GetValue<string>("vendor_name")}");
Console.WriteLine($"Total:       {result.GetValue<double>("total_amount")}");
Console.WriteLine($"Currency:    {result.GetValue<string>("currency")}");
Console.WriteLine($"Confidence:  {result.Confidence:P0}\n");

// Access line items
Console.WriteLine("Line items:");
foreach (var item in result.EnumerateAt("line_items"))
{
    string desc = item["description"]?.Value?.ToString() ?? "N/A";
    object qty = item["quantity"]?.Value ?? "N/A";
    object amount = item["amount"]?.Value ?? "N/A";
    Console.WriteLine($"  {desc} (qty: {qty}, amount: {amount})");
}

Step 4: Extract from an Invoice Image

Process photos and scanned images of invoices:

using LMKit.Graphics;

string imagePath = "invoice_photo.jpg";
var image = new ImageBuffer(imagePath);

extractor.SetContent(image);

TextExtractionResult imageResult = extractor.Parse();

Console.WriteLine($"Invoice from image:");
Console.WriteLine($"  Invoice #: {imageResult.GetValue<string>("invoice_number")}");
Console.WriteLine($"  Vendor:    {imageResult.GetValue<string>("vendor_name")}");
Console.WriteLine($"  Total:     {imageResult.GetValue<double>("total_amount")}");
Console.WriteLine($"  Confidence: {imageResult.Confidence:P0}");

Step 5: Get the Full JSON Output

TextExtraction produces grammar-constrained JSON that matches your schema exactly:

extractor.SetContent(new Attachment("invoice_sample.pdf"));
TextExtractionResult result = extractor.Parse();

// Get raw JSON
string json = result.Json;
Console.WriteLine("Raw JSON output:\n");
Console.WriteLine(json);

// Parse with System.Text.Json for further processing
using JsonDocument doc = result.JsonDocument;
JsonElement root = doc.RootElement;

if (root.TryGetProperty("total_amount", out JsonElement totalElement))
{
    double total = totalElement.GetDouble();
    Console.WriteLine($"\nTotal for approval: {total:C}");
}

Step 6: Batch Invoice Processing

Process a folder of invoices and export structured data:

string[] invoiceFiles = Directory.GetFiles("invoices")
    .Where(f => new[] { ".pdf", ".png", ".jpg", ".jpeg", ".tiff" }
        .Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

var output = new List<string>();
output.Add("file,invoice_number,vendor,date,total,currency,confidence");

Console.WriteLine($"Processing {invoiceFiles.Length} invoices...\n");

foreach (string file in invoiceFiles)
{
    string fileName = Path.GetFileName(file);
    Console.Write($"  {fileName}... ");

    extractor.SetContent(new Attachment(file));
    TextExtractionResult r = extractor.Parse();

    string invoiceNum = r.GetValue<string>("invoice_number") ?? "N/A";
    string vendor = r.GetValue<string>("vendor_name") ?? "N/A";
    string date = r.GetValue<string>("invoice_date") ?? "N/A";
    double total = r.GetValue<double>("total_amount");
    string currency = r.GetValue<string>("currency") ?? "N/A";

    Console.WriteLine($"#{invoiceNum} from {vendor}: {total} {currency}");

    output.Add($"\"{fileName}\",\"{invoiceNum}\",\"{vendor}\",\"{date}\",{total},\"{currency}\",{r.Confidence:F2}");
}

File.WriteAllLines("invoice_data.csv", output);
Console.WriteLine($"\nExported {invoiceFiles.Length} invoices to invoice_data.csv");

Step 7: Schema from JSON

Instead of defining elements in code, load the schema from a JSON file:

{
    "title": "Invoice",
    "description": "Extract invoice data",
    "type": "object",
    "properties": {
        "invoice_number": { "type": "string", "description": "Invoice ID" },
        "vendor_name": { "type": "string", "description": "Issuing company" },
        "total_amount": { "type": "number", "description": "Total due" },
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": { "type": "string" },
                    "amount": { "type": "number" }
                }
            }
        }
    },
    "required": ["invoice_number", "vendor_name", "total_amount"]
}

Load in code:

string schemaJson = File.ReadAllText("invoice_schema.json");

var extractor = new TextExtraction(model);
extractor.SetElementsFromJsonSchema(schemaJson);

extractor.SetContent(new Attachment("invoice.pdf"));
TextExtractionResult result = extractor.Parse();

Common Issues

Problem Cause Fix
Missing field values (null) Field not found in document Check NullOnDoubt; if true, uncertain fields return null. Set to false to force extraction
Wrong date format Model uses local format Add format hint in element description: "YYYY-MM-DD format"
Line items not extracted Schema mismatch or complex table layout Use a larger model; add Guidance describing the table structure
Low confidence on scanned PDFs Poor image quality Set extractor.OcrEngine to a configured OCR engine for pre-processing
Slow on multi-page invoices Processing all pages Use SetContent(attachment, pageRange: "1-2") to limit to relevant pages

Next Steps