👉 Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/invoice_data_extraction

Invoice Data Extraction for C# .NET Applications (Images and PDFs)

🎯 Purpose of the Sample

The Invoice Data Extraction demo shows how to use the LM-Kit.NET SDK to extract structured invoice fields from invoice documents, including images and PDF files.

It combines a vision-language model with an OCR engine to improve reliability, then applies a JSON schema to extract consistent structured outputs that are easy to integrate into accounting, ERP, and automation pipelines.

👥 Industry Target Audience

This demo is useful for developers and businesses involved in:

Financial services and accounting: automate invoice capture to reduce manual entry and errors.
Enterprise software development: integrate structured invoice extraction into internal tools and platforms.
Supply chain and logistics: process invoices quickly across languages and formats.
Intelligent automation solutions: enable autonomous invoice processing with structured outputs.

🚀 Problem Solved

Manual invoice data extraction is slow, costly, and error-prone, especially when invoices arrive as scans, photos, or PDFs and may contain multiple languages.

This demo automates the workflow:

OCR + vision model understand the document content
A JSON schema defines what fields to extract
The output is returned as both individual fields and a JSON payload

💻 Sample Application Description

The Invoice Data Extraction demo is a console app that:

Lets you select a vision-language model (or enter a custom model URI)
Downloads and loads the model with progress feedback
Loads extraction fields from schema.json
Attaches an OCR engine (Tesseract) with:
- language detection
- orientation detection
- automatic model download
Prompts you to select a sample invoice (or enter a custom document path)
Extracts structured fields and prints:
- extracted values per field
- a full JSON output
- total processing time

✨ Key Features

Flexible model selection: choose from predefined vision-language models, or provide a custom URI.
Real-time progress indicators: model downloading and loading progress in the console.
Schema-driven extraction: extraction fields defined via schema.json and loaded using SetElementsFromJsonSchema.
OCR boost with Tesseract: improves extraction accuracy with language and orientation detection.
Structured outputs: prints each extracted field plus a JSON payload suitable for storage or APIs.
Document preview: opens the selected invoice file using the OS default viewer.

🤖 Benefits for Agentic Solutions

Adding invoice extraction into autonomous workflows provides:

Faster automation: agents can read invoices and produce structured data immediately.
Reduced processing time: less manual validation and re-keying.
More robust inputs: works with images and PDFs, including rotated scans.
Better accuracy: OCR language and orientation detection improves extraction quality.
Easier integration: JSON output plugs directly into downstream systems.

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later

📥 Download the Project

.NET Console Demo

▶️ Running the Application

Clone the repository:

git clone https://github.com/LM-Kit/lm-kit-net-samples

Navigate to the project directory:

cd lm-kit-net-samples/console_net/invoice_data_extraction

Build and run the application:

dotnet build
dotnet run

Follow the on-screen prompts to select a model and provide an invoice document path (image or PDF).

💡 Example Usage

Select a vision-language model: choose a model from the list, or enter a custom model URI.
Choose an invoice document: pick one of the included examples or provide your own file path.
Extraction:
- the OCR engine detects language and orientation
- the extraction engine loads the schema from schema.json
- LM-Kit extracts structured fields from the document
Review extracted data:
- field-by-field values printed to the console
- full JSON payload printed for integration
Process more invoices: press any key to continue and run another invoice.

This demo helps you build invoice automation pipelines that convert invoice documents into structured data reliably, with outputs that are ready for databases, APIs, and accounting systems.

How-To: Extract Invoice Data from Documents: Step-by-step guide to extracting structured invoice fields using schema-driven extraction.
How-To: Extract Structured Data: Learn the general approach to defining schemas and extracting structured data from documents.
Glossary: Structured Data Extraction: Explains schema-driven extraction concepts and JSON output generation.
Glossary: Intelligent Document Processing: Covers the broader field of automated document understanding.
Receipt Expense Scanner Demo: Similar extraction demo focused on parsing receipts with line items and tax breakdowns.

Table of Contents