👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vlm_ocr

VLM OCR for C# .NET Applications

🎯 Purpose of the Demo

VLM OCR demonstrates how to use LM-Kit.NET with vision-language models to extract plain text from images, PDFs, and scanned documents using on-device OCR inference.

The sample shows how to:

Download and load a vision model with progress callbacks.
Wrap it with LM-Kit's VlmOcr engine.
Feed images or PDFs as Attachment objects.
Process multi-page inputs using Attachment.PageCount.
Select an OCR intent (VlmOcrIntent) to control the desired output: plain text, Markdown, table recognition, formula recognition, chart recognition, OCR with coordinates, or seal recognition.
Retrieve recognized text plus generation statistics (tokens, speed, quality, context usage).

Why VLM OCR with LM-Kit.NET?

Local-first: run OCR on your own hardware for privacy-sensitive workloads.
Unified API: same model abstraction (LM) for text and vision pipelines.
Intent-driven: select a VlmOcrIntent and the engine maps it to the best instruction and post-processing for the loaded model.
Rich telemetry: quality score, token usage, and performance metrics per page.
Ultra-compact: PaddleOCR VL 1.5 requires only ~1 GB VRAM for accurate document OCR.

👥 Target Audience

Product and Platform: add OCR to existing .NET backends or pipelines.
Data and Document Processing: bulk ingest of PDFs, scans, screenshots, invoices, and receipts.
RPA and Back-office: extract text from forms, tables, formulas, charts, and stamps.
Demo and Education: minimal, readable example of VLM-based OCR in C#.

🚀 Problem Solved

Turn images and PDFs into text: extract readable text from photos, screenshots, scans, and PDF pages.
Specialized recognition: use dedicated intents for tables, formulas, charts, and seals.
Model flexibility: select a model based on your available VRAM and accuracy needs.
Operational visibility: built-in stats on speed, context usage, and quality.
Multi-page handling: iterate through PDF pages automatically with PageCount.

💻 Sample Application Description

Console app that:

Lets you choose a vision model (PaddleOCR VL is the recommended default) or paste a custom model URI.
Downloads the model if needed, with live progress updates.
Repeatedly asks you for a file path (image or PDF), then:
- Prompts you to select an OCR intent (plain text, Markdown, table, formula, chart, coordinates, seal).
- Creates a VlmOcr instance with the selected intent.
- Loads the file as an Attachment.
- Runs OCR page-by-page via ocr.Run(attachment, pageIndex).
- Prints the extracted text to the console.
Displays a stats block (intent, elapsed time, tokens, quality, speed, context usage).
Loops until you type q to quit.

✨ Key Features

🧠 Vision-based OCR: uses a multimodal model behind VlmOcr.
🔧 Intent-driven modes: seven intents that work across all supported vision models.
📄 Image + PDF support: the same code path handles both formats.
📥 Interactive loop: enter file path -> select intent -> get text -> see metrics -> repeat.
📑 Multi-page aware: prints results per page using attachment.PageCount.
📊 Telemetry:
- Elapsed time (seconds)
- Generated tokens count
- Stop reason
- Quality score
- Token generation rate
- Context tokens vs context size
📦 Model lifecycle:
- Automatic download on first use.
- Loading progress shown in the console.
❌ Nice errors: friendly message when a file path is invalid or inaccessible.

On startup, the sample shows a model selection menu:

Option	Model	Approx. VRAM Needed
0	PaddlePaddle PaddleOCR VL 1.5 0.9B	~1 GB VRAM
1	Z.ai GLM-OCR 0.9B	~1 GB VRAM
2	LightOn LightOnOCR 2 1B	~2 GB VRAM
3	Z.ai GLM-V 4.6 Flash 10B	~7 GB VRAM
4	MiniCPM o 4.5 9B	~5.9 GB VRAM
5	Alibaba Qwen 3.5 2B	~2 GB VRAM
6	Alibaba Qwen 3.5 4B	~3.5 GB VRAM
7	Alibaba Qwen 3.5 9B	~7 GB VRAM
8	Google Gemma 3 4B	~5.7 GB VRAM
9	Alibaba Qwen 3.5 27B	~18 GB VRAM
10	Mistral Ministral 3 8B	~6.5 GB VRAM
other	Custom model URI (GGUF / LMK, etc.)	depends on model

Any input other than 0-10 is treated as a custom model URI and passed directly to the LM constructor.

🔧 OCR Intents

Before processing each document, you select an intent that describes the desired output:

#	Intent	Description
0	`Undefined`	Auto: engine picks the best default for the model
1	`PlainText`	Plain text OCR
2	`Markdown`	Markdown conversion with structural elements
3	`TableRecognition`	Structured table extraction
4	`FormulaRecognition`	Mathematical formula recognition
5	`ChartRecognition`	Chart and graph data extraction
6	`OcrWithCoordinates`	Text detection with bounding-box coordinates
7	`SealRecognition`	Official seal and stamp recognition

The engine maps each intent to the best available instruction for the loaded model. Not every model natively supports every intent; the engine applies all possible internal logic to reach the desired result.

🧠 Supported Models

The sample is pre-wired to LM-Kit's predefined model cards:

paddleocr-vl:0.9b (recommended)
glm-ocr
lightonocr-2:1b
glm-4.6v-flash
minicpm-o-45
qwen3.5:2b
qwen3.5:4b
qwen3.5:9b
gemma3:4b
qwen3.5:27b
ministral3:8b

You can also provide any valid model URI manually (including local paths or custom model servers) by typing or pasting it when prompted.

🛠️ Commands and Flow

Inside the console loop:

On startup
- Select a model (0-10) or paste a custom model URI.
- The model is downloaded (if needed) and loaded with progress reporting.
Per document (image or PDF)
- The app prompts: enter image or document path (or 'q' to quit):
- Type a file path and press Enter.
- Select an OCR intent (0-7).
- The app loads the file into an Attachment.
- The app iterates pages:
  - For images, this is typically 1 page.
  - For PDFs, this can be N pages.
- For each page, OCR runs and prints:
  - The recognized text
  - A Stats section
Quit
- At any prompt, typing q exits the app cleanly.

🗣️ Example Use Cases

Try the sample with:

A scanned invoice image -> use PlainText intent to extract all text.
A multi-page PDF report -> use Markdown intent for structured output.
A table screenshot -> use TableRecognition intent for row/column extraction.
A math problem photo -> use FormulaRecognition intent to get LaTeX notation.
A chart or graph -> use ChartRecognition to extract data points.
A document with a stamp -> use SealRecognition to read the seal text.

After each run, compare:

Quality score: does the text look correct vs. the page?
Token usage and speed: does a bigger model give better quality at acceptable latency?

⚙️ Behavior and Policies (quick reference)

Model selection: exactly one model per process. To change models, restart the app.
Download and load:
- ModelDownloadingProgress prints Downloading XX.XX% or byte counts.
- ModelLoadingProgress prints Loading XX% and clears the console once done.
OCR engine:
- VlmOcr runs OCR with the selected vision model and intent.
- The Intent property reflects the resolved intent (never Undefined).
- result.PageElement.Text is the recognized text for the page.
Multi-page processing:
- Attachment.PageCount is used to iterate over pages.
- OCR is executed per page using ocr.Run(attachment, pageIndex).
Licensing:
- You can set an optional license key via LicenseManager.SetLicenseKey("").
- A free community license is available from the LM-Kit website.

💻 Minimal Integration Snippet

using System;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

public class VlmOcrSample
{
    public void RunOcr(string filePath)
    {
        // Load PaddleOCR VL model
        var lm = LM.LoadFromModelID("paddleocr-vl:0.9b");

        // Create OCR engine with table recognition intent
        var ocr = new VlmOcr(lm, VlmOcrIntent.TableRecognition);

        // Wrap the file (image or PDF) as an Attachment
        var attachment = new Attachment(filePath);

        // Run OCR page-by-page
        for (int pageIndex = 0; pageIndex < attachment.PageCount; pageIndex++)
        {
            var result = ocr.Run(attachment, pageIndex);

            // Extracted text
            Console.WriteLine(result.PageElement.Text);

            // Optional: generation stats
            Console.WriteLine($"Tokens  : {result.TextGeneration.GeneratedTokens.Count}");
            Console.WriteLine($"Quality : {result.TextGeneration.QualityScore}");
            Console.WriteLine($"Speed   : {result.TextGeneration.TokenGenerationRate} tok/s");
        }
    }
}

Use this pattern to integrate OCR into web APIs, background workers, or desktop apps.

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vlm_ocr

Project Link: vlm_ocr (same path as above)

▶️ Run

dotnet build
dotnet run

Then:

Select a vision model by typing 0-10, or paste a custom model URI.
Wait for the model to download (first run) and load.
When prompted, type the path to an image or document file (or q to quit).
Select an OCR intent (0-7).
Inspect the recognized text and Stats block (per page).
Press Enter to process another file, or q to exit.

🔍 Notes on Key Types

LM (LMKit.Model): generic model wrapper used by LM-Kit.NET.
- Accepts a Uri pointing to the model.
- Uses callbacks for download and load progress.
VlmOcr (LMKit.Extraction.Ocr): OCR engine built on top of a vision model.
- Construct with new VlmOcr(model, VlmOcrIntent.PlainText) to set the desired intent.
- Intent property returns the resolved intent governing instruction and post-processing.
- Run(Attachment, pageIndex) returns an OCR result with PageElement and TextGeneration.
VlmOcrIntent (LMKit.Extraction.Ocr): enum specifying the desired OCR outcome.
- The engine maps each intent to the best available instruction for the loaded model.
Attachment (LMKit.Data): wraps external data (here: image files and PDFs).
- new Attachment(string path) loads a file from disk.
- PageCount exposes the number of pages (images are typically 1; PDFs can be many).
- Exceptions are raised when the path is invalid or inaccessible.
TextGeneration: metadata about the underlying generative pass.
- GeneratedTokens, TerminationReason, QualityScore, TokenGenerationRate, ContextTokens, ContextSize.

🔧 Extend the Demo

Write output to disk (--out output.txt) instead of only printing to console.
Add page selection for PDFs (--pages 1,3-5).
Add batch mode: process a directory of files.
Combine multiple intents on the same document (e.g., PlainText + TableRecognition).
Post-process PageElement.Text to normalize whitespace or feed into downstream extraction pipelines.
Combine with LM-Kit's Structured Extraction to go from document -> text -> structured data in one flow.

📚 Additional Resources

VlmOcr API Reference
Attachment API Reference
PaddleOCR VL on HuggingFace
Document to Markdown Demo: similar demo focused on Markdown output

How-To: Extract Text with VLM OCR: Guide to using VlmOcr for on-device text extraction from images and PDFs.
How-To: Extract Tables with VLM OCR: Learn how to use the TableRecognition intent for structured table extraction.
How-To: Recognize Formulas with VLM OCR: Extract mathematical formulas using the FormulaRecognition intent.
Glossary: Optical Character Recognition: Covers OCR concepts and how vision models enhance text extraction.
VLM OCR with Coordinates Demo: Extract text regions with bounding box positions for spatial analysis.

Table of Contents

VLM OCR for C# .NET Applications

🎯 Purpose of the Demo

👥 Target Audience

🚀 Problem Solved

💻 Sample Application Description

✨ Key Features

🧰 Built-In Models (menu)

🔧 OCR Intents

🧠 Supported Models

🛠️ Commands and Flow

🗣️ Example Use Cases

⚙️ Behavior and Policies (quick reference)

💻 Minimal Integration Snippet

🛠️ Getting Started

📋 Prerequisites

📥 Download

▶️ Run

🔍 Notes on Key Types

🔧 Extend the Demo

📚 Additional Resources

Table of Contents

VLM OCR for C# .NET Applications

🎯 Purpose of the Demo

👥 Target Audience

🚀 Problem Solved

💻 Sample Application Description

✨ Key Features

🧰 Built-In Models (menu)

🔧 OCR Intents

🧠 Supported Models

🛠️ Commands and Flow

🗣️ Example Use Cases

⚙️ Behavior and Policies (quick reference)

💻 Minimal Integration Snippet

🛠️ Getting Started

📋 Prerequisites

📥 Download

▶️ Run

🔍 Notes on Key Types

🔧 Extend the Demo

📚 Additional Resources

📚 Related Content