👉 Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/invoice_data_extraction
Invoice Data Extraction for C# .NET Applications (Images and PDFs)
🎯 Purpose of the Sample
The Invoice Data Extraction demo shows how to use the LM-Kit.NET SDK to extract structured invoice fields from invoice documents, including images and PDF files.
It combines a vision-language model with an OCR engine to improve reliability, then applies a JSON schema to extract consistent structured outputs that are easy to integrate into accounting, ERP, and automation pipelines.
👥 Industry Target Audience
This demo is useful for developers and businesses involved in:
- Financial services and accounting: automate invoice capture to reduce manual entry and errors.
- Enterprise software development: integrate structured invoice extraction into internal tools and platforms.
- Supply chain and logistics: process invoices quickly across languages and formats.
- Intelligent automation solutions: enable autonomous invoice processing with structured outputs.
🚀 Problem Solved
Manual invoice data extraction is slow, costly, and error-prone, especially when invoices arrive as scans, photos, or PDFs and may contain multiple languages.
This demo automates the workflow:
- OCR + vision model understand the document content
- A JSON schema defines what fields to extract
- The output is returned as both individual fields and a JSON payload
💻 Sample Application Description
The Invoice Data Extraction demo is a console app that:
Lets you select a vision-language model (or enter a custom model URI)
Downloads and loads the model with progress feedback
Loads extraction fields from
schema.jsonAttaches an OCR engine (Tesseract) with:
- language detection
- orientation detection
- automatic model download
Prompts you to select a sample invoice (or enter a custom document path)
Extracts structured fields and prints:
- extracted values per field
- a full JSON output
- total processing time
✨ Key Features
- Flexible model selection: choose from predefined vision-language models, or provide a custom URI.
- Real-time progress indicators: model downloading and loading progress in the console.
- Schema-driven extraction: extraction fields defined via
schema.jsonand loaded usingSetElementsFromJsonSchema. - OCR boost with Tesseract: improves extraction accuracy with language and orientation detection.
- Structured outputs: prints each extracted field plus a JSON payload suitable for storage or APIs.
- Document preview: opens the selected invoice file using the OS default viewer.
🤖 Benefits for Agentic Solutions
Adding invoice extraction into autonomous workflows provides:
- Faster automation: agents can read invoices and produce structured data immediately.
- Reduced processing time: less manual validation and re-keying.
- More robust inputs: works with images and PDFs, including rotated scans.
- Better accuracy: OCR language and orientation detection improves extraction quality.
- Easier integration: JSON output plugs directly into downstream systems.
🛠️ Getting Started
📋 Prerequisites
- .NET Framework 4.6.2 or .NET 8.0+
📥 Download the Project
▶️ Running the Application
- Clone the repository:
git clone https://github.com/LM-Kit/lm-kit-net-samples
- Navigate to the project directory:
cd lm-kit-net-samples/console_net/invoice_data_extraction
- Build and run the application:
dotnet build
dotnet run
- Follow the on-screen prompts to select a model and provide an invoice document path (image or PDF).
💡 Example Usage
Select a vision-language model: choose a model from the list, or enter a custom model URI.
Choose an invoice document: pick one of the included examples or provide your own file path.
Extraction:
- the OCR engine detects language and orientation
- the extraction engine loads the schema from
schema.json - LM-Kit extracts structured fields from the document
Review extracted data:
- field-by-field values printed to the console
- full JSON payload printed for integration
Process more invoices: press any key to continue and run another invoice.
This demo helps you build invoice automation pipelines that convert invoice documents into structured data reliably, with outputs that are ready for databases, APIs, and accounting systems.