👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vlm_ocr
VLM OCR for C# .NET Applications
🎯 Purpose of the Demo
VLM OCR demonstrates how to use LM-Kit.NET with vision-language models to extract plain text from images, PDFs, and scanned documents using on-device OCR inference.
The sample shows how to:
- Download and load a vision model with progress callbacks.
- Wrap it with LM-Kit's
VlmOcrengine. - Feed images or PDFs as
Attachmentobjects. - Process multi-page inputs using
Attachment.PageCount. - Select an OCR intent (
VlmOcrIntent) to control the desired output: plain text, Markdown, table recognition, formula recognition, chart recognition, OCR with coordinates, or seal recognition. - Retrieve recognized text plus generation statistics (tokens, speed, quality, context usage).
Why VLM OCR with LM-Kit.NET?
- Local-first: run OCR on your own hardware for privacy-sensitive workloads.
- Unified API: same model abstraction (
LM) for text and vision pipelines. - Intent-driven: select a
VlmOcrIntentand the engine maps it to the best instruction and post-processing for the loaded model. - Rich telemetry: quality score, token usage, and performance metrics per page.
- Ultra-compact: PaddleOCR VL 1.5 requires only ~1 GB VRAM for accurate document OCR.
👥 Target Audience
- Product and Platform: add OCR to existing .NET backends or pipelines.
- Data and Document Processing: bulk ingest of PDFs, scans, screenshots, invoices, and receipts.
- RPA and Back-office: extract text from forms, tables, formulas, charts, and stamps.
- Demo and Education: minimal, readable example of VLM-based OCR in C#.
🚀 Problem Solved
- Turn images and PDFs into text: extract readable text from photos, screenshots, scans, and PDF pages.
- Specialized recognition: use dedicated intents for tables, formulas, charts, and seals.
- Model flexibility: select a model based on your available VRAM and accuracy needs.
- Operational visibility: built-in stats on speed, context usage, and quality.
- Multi-page handling: iterate through PDF pages automatically with
PageCount.
💻 Sample Application Description
Console app that:
Lets you choose a vision model (PaddleOCR VL is the recommended default) or paste a custom model URI.
Downloads the model if needed, with live progress updates.
Repeatedly asks you for a file path (image or PDF), then:
- Prompts you to select an OCR intent (plain text, Markdown, table, formula, chart, coordinates, seal).
- Creates a
VlmOcrinstance with the selected intent. - Loads the file as an
Attachment. - Runs OCR page-by-page via
ocr.Run(attachment, pageIndex). - Prints the extracted text to the console.
Displays a stats block (intent, elapsed time, tokens, quality, speed, context usage).
Loops until you type
qto quit.
✨ Key Features
🧠 Vision-based OCR: uses a multimodal model behind
VlmOcr.🔧 Intent-driven modes: seven intents that work across all supported vision models.
📄 Image + PDF support: the same code path handles both formats.
📥 Interactive loop: enter file path -> select intent -> get text -> see metrics -> repeat.
📑 Multi-page aware: prints results per page using
attachment.PageCount.📊 Telemetry:
- Elapsed time (seconds)
- Generated tokens count
- Stop reason
- Quality score
- Token generation rate
- Context tokens vs context size
📦 Model lifecycle:
- Automatic download on first use.
- Loading progress shown in the console.
❌ Nice errors: friendly message when a file path is invalid or inaccessible.
🧰 Built-In Models (menu)
On startup, the sample shows a model selection menu:
| Option | Model | Approx. VRAM Needed |
|---|---|---|
| 0 | PaddlePaddle PaddleOCR VL 1.5 0.9B | ~1 GB VRAM |
| 1 | LightOn LightOnOCR 2 1B | ~2 GB VRAM |
| 2 | MiniCPM o 4.5 9B | ~5.9 GB VRAM |
| 3 | Alibaba Qwen 3 VL 2B | ~2.5 GB VRAM |
| 4 | Alibaba Qwen 3 VL 4B | ~4.5 GB VRAM |
| 5 | Alibaba Qwen 3 VL 8B | ~6.5 GB VRAM |
| 6 | Google Gemma 3 4B | ~5.7 GB VRAM |
| other | Custom model URI (GGUF / LMK, etc.) | depends on model |
Any input other than
0-6is treated as a custom model URI and passed directly to theLMconstructor.
🔧 OCR Intents
Before processing each document, you select an intent that describes the desired output:
| # | Intent | Description |
|---|---|---|
| 0 | Undefined |
Auto: engine picks the best default for the model |
| 1 | PlainText |
Plain text OCR |
| 2 | Markdown |
Markdown conversion with structural elements |
| 3 | TableRecognition |
Structured table extraction |
| 4 | FormulaRecognition |
Mathematical formula recognition |
| 5 | ChartRecognition |
Chart and graph data extraction |
| 6 | OcrWithCoordinates |
Text detection with bounding-box coordinates |
| 7 | SealRecognition |
Official seal and stamp recognition |
The engine maps each intent to the best available instruction for the loaded model. Not every model natively supports every intent; the engine applies all possible internal logic to reach the desired result.
🧠 Supported Models
The sample is pre-wired to LM-Kit's predefined model cards:
paddleocr-vl:0.9b(recommended)lightonocr-2:1bminicpm-o-45qwen3-vl:2bqwen3-vl:4bqwen3-vl:8bgemma3:4b
You can also provide any valid model URI manually (including local paths or custom model servers) by typing or pasting it when prompted.
🛠️ Commands and Flow
Inside the console loop:
On startup
- Select a model (0-6) or paste a custom model URI.
- The model is downloaded (if needed) and loaded with progress reporting.
Per document (image or PDF)
The app prompts:
enter image or document path (or 'q' to quit):Type a file path and press Enter.
Select an OCR intent (0-7).
The app loads the file into an
Attachment.The app iterates pages:
- For images, this is typically 1 page.
- For PDFs, this can be N pages.
For each page, OCR runs and prints:
- The recognized text
- A Stats section
Quit
- At any prompt, typing
qexits the app cleanly.
- At any prompt, typing
🗣️ Example Use Cases
Try the sample with:
- A scanned invoice image -> use
PlainTextintent to extract all text. - A multi-page PDF report -> use
Markdownintent for structured output. - A table screenshot -> use
TableRecognitionintent for row/column extraction. - A math problem photo -> use
FormulaRecognitionintent to get LaTeX notation. - A chart or graph -> use
ChartRecognitionto extract data points. - A document with a stamp -> use
SealRecognitionto read the seal text.
After each run, compare:
- Quality score: does the text look correct vs. the page?
- Token usage and speed: does a bigger model give better quality at acceptable latency?
⚙️ Behavior and Policies (quick reference)
Model selection: exactly one model per process. To change models, restart the app.
Download and load:
ModelDownloadingProgressprintsDownloading XX.XX%or byte counts.ModelLoadingProgressprintsLoading XX%and clears the console once done.
OCR engine:
VlmOcrruns OCR with the selected vision model and intent.- The
Intentproperty reflects the resolved intent (neverUndefined). result.PageElement.Textis the recognized text for the page.
Multi-page processing:
Attachment.PageCountis used to iterate over pages.- OCR is executed per page using
ocr.Run(attachment, pageIndex).
Licensing:
- You can set an optional license key via
LicenseManager.SetLicenseKey(""). - A free community license is available from the LM-Kit website.
- You can set an optional license key via
💻 Minimal Integration Snippet
using System;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
public class VlmOcrSample
{
public void RunOcr(string filePath)
{
// Load PaddleOCR VL model
var lm = LM.LoadFromModelID("paddleocr-vl:0.9b");
// Create OCR engine with table recognition intent
var ocr = new VlmOcr(lm, VlmOcrIntent.TableRecognition);
// Wrap the file (image or PDF) as an Attachment
var attachment = new Attachment(filePath);
// Run OCR page-by-page
for (int pageIndex = 0; pageIndex < attachment.PageCount; pageIndex++)
{
var result = ocr.Run(attachment, pageIndex);
// Extracted text
Console.WriteLine(result.PageElement.Text);
// Optional: generation stats
Console.WriteLine($"Tokens : {result.TextGeneration.GeneratedTokens.Count}");
Console.WriteLine($"Quality : {result.TextGeneration.QualityScore}");
Console.WriteLine($"Speed : {result.TextGeneration.TokenGenerationRate} tok/s");
}
}
}
Use this pattern to integrate OCR into web APIs, background workers, or desktop apps.
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
📥 Download
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vlm_ocr
Project Link: vlm_ocr (same path as above)
▶️ Run
dotnet build
dotnet run
Then:
- Select a vision model by typing 0-6, or paste a custom model URI.
- Wait for the model to download (first run) and load.
- When prompted, type the path to an image or document file (or
qto quit). - Select an OCR intent (0-7).
- Inspect the recognized text and Stats block (per page).
- Press Enter to process another file, or
qto exit.
🔍 Notes on Key Types
LM(LMKit.Model): generic model wrapper used by LM-Kit.NET.- Accepts a
Uripointing to the model. - Uses callbacks for download and load progress.
- Accepts a
VlmOcr(LMKit.Extraction.Ocr): OCR engine built on top of a vision model.- Construct with
new VlmOcr(model, VlmOcrIntent.PlainText)to set the desired intent. Intentproperty returns the resolved intent governing instruction and post-processing.Run(Attachment, pageIndex)returns an OCR result withPageElementandTextGeneration.
- Construct with
VlmOcrIntent(LMKit.Extraction.Ocr): enum specifying the desired OCR outcome.- The engine maps each intent to the best available instruction for the loaded model.
Attachment(LMKit.Data): wraps external data (here: image files and PDFs).new Attachment(string path)loads a file from disk.PageCountexposes the number of pages (images are typically 1; PDFs can be many).- Exceptions are raised when the path is invalid or inaccessible.
TextGeneration: metadata about the underlying generative pass.GeneratedTokens,TerminationReason,QualityScore,TokenGenerationRate,ContextTokens,ContextSize.
🔧 Extend the Demo
- Write output to disk (
--out output.txt) instead of only printing to console. - Add page selection for PDFs (
--pages 1,3-5). - Add batch mode: process a directory of files.
- Combine multiple intents on the same document (e.g.,
PlainText+TableRecognition). - Post-process
PageElement.Textto normalize whitespace or feed into downstream extraction pipelines. - Combine with LM-Kit's Structured Extraction to go from document -> text -> structured data in one flow.
📚 Additional Resources
- VlmOcr API Reference
- Attachment API Reference
- PaddleOCR VL on HuggingFace
- Document to Markdown Demo: similar demo focused on Markdown output