Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vlm_ocr

VLM OCR for C# .NET Applications


🎯 Purpose of the Demo

VLM OCR demonstrates how to use LM-Kit.NET with vision-language models to extract plain text from images, PDFs, and scanned documents using on-device OCR inference.

The sample shows how to:

  • Download and load a vision model with progress callbacks.
  • Wrap it with LM-Kit's VlmOcr engine.
  • Feed images or PDFs as Attachment objects.
  • Process multi-page inputs using Attachment.PageCount.
  • Select an OCR intent (VlmOcrIntent) to control the desired output: plain text, Markdown, table recognition, formula recognition, chart recognition, OCR with coordinates, or seal recognition.
  • Retrieve recognized text plus generation statistics (tokens, speed, quality, context usage).

Why VLM OCR with LM-Kit.NET?

  • Local-first: run OCR on your own hardware for privacy-sensitive workloads.
  • Unified API: same model abstraction (LM) for text and vision pipelines.
  • Intent-driven: select a VlmOcrIntent and the engine maps it to the best instruction and post-processing for the loaded model.
  • Rich telemetry: quality score, token usage, and performance metrics per page.
  • Ultra-compact: PaddleOCR VL 1.5 requires only ~1 GB VRAM for accurate document OCR.

👥 Target Audience

  • Product and Platform: add OCR to existing .NET backends or pipelines.
  • Data and Document Processing: bulk ingest of PDFs, scans, screenshots, invoices, and receipts.
  • RPA and Back-office: extract text from forms, tables, formulas, charts, and stamps.
  • Demo and Education: minimal, readable example of VLM-based OCR in C#.

🚀 Problem Solved

  • Turn images and PDFs into text: extract readable text from photos, screenshots, scans, and PDF pages.
  • Specialized recognition: use dedicated intents for tables, formulas, charts, and seals.
  • Model flexibility: select a model based on your available VRAM and accuracy needs.
  • Operational visibility: built-in stats on speed, context usage, and quality.
  • Multi-page handling: iterate through PDF pages automatically with PageCount.

💻 Sample Application Description

Console app that:

  • Lets you choose a vision model (PaddleOCR VL is the recommended default) or paste a custom model URI.

  • Downloads the model if needed, with live progress updates.

  • Repeatedly asks you for a file path (image or PDF), then:

    • Prompts you to select an OCR intent (plain text, Markdown, table, formula, chart, coordinates, seal).
    • Creates a VlmOcr instance with the selected intent.
    • Loads the file as an Attachment.
    • Runs OCR page-by-page via ocr.Run(attachment, pageIndex).
    • Prints the extracted text to the console.
  • Displays a stats block (intent, elapsed time, tokens, quality, speed, context usage).

  • Loops until you type q to quit.

✨ Key Features

  • 🧠 Vision-based OCR: uses a multimodal model behind VlmOcr.

  • 🔧 Intent-driven modes: seven intents that work across all supported vision models.

  • 📄 Image + PDF support: the same code path handles both formats.

  • 📥 Interactive loop: enter file path -> select intent -> get text -> see metrics -> repeat.

  • 📑 Multi-page aware: prints results per page using attachment.PageCount.

  • 📊 Telemetry:

    • Elapsed time (seconds)
    • Generated tokens count
    • Stop reason
    • Quality score
    • Token generation rate
    • Context tokens vs context size
  • 📦 Model lifecycle:

    • Automatic download on first use.
    • Loading progress shown in the console.
  • ❌ Nice errors: friendly message when a file path is invalid or inaccessible.


🧰 Built-In Models (menu)

On startup, the sample shows a model selection menu:

Option Model Approx. VRAM Needed
0 PaddlePaddle PaddleOCR VL 1.5 0.9B ~1 GB VRAM
1 LightOn LightOnOCR 2 1B ~2 GB VRAM
2 MiniCPM o 4.5 9B ~5.9 GB VRAM
3 Alibaba Qwen 3 VL 2B ~2.5 GB VRAM
4 Alibaba Qwen 3 VL 4B ~4.5 GB VRAM
5 Alibaba Qwen 3 VL 8B ~6.5 GB VRAM
6 Google Gemma 3 4B ~5.7 GB VRAM
other Custom model URI (GGUF / LMK, etc.) depends on model

Any input other than 0-6 is treated as a custom model URI and passed directly to the LM constructor.


🔧 OCR Intents

Before processing each document, you select an intent that describes the desired output:

# Intent Description
0 Undefined Auto: engine picks the best default for the model
1 PlainText Plain text OCR
2 Markdown Markdown conversion with structural elements
3 TableRecognition Structured table extraction
4 FormulaRecognition Mathematical formula recognition
5 ChartRecognition Chart and graph data extraction
6 OcrWithCoordinates Text detection with bounding-box coordinates
7 SealRecognition Official seal and stamp recognition

The engine maps each intent to the best available instruction for the loaded model. Not every model natively supports every intent; the engine applies all possible internal logic to reach the desired result.


🧠 Supported Models

The sample is pre-wired to LM-Kit's predefined model cards:

  • paddleocr-vl:0.9b (recommended)
  • lightonocr-2:1b
  • minicpm-o-45
  • qwen3-vl:2b
  • qwen3-vl:4b
  • qwen3-vl:8b
  • gemma3:4b

You can also provide any valid model URI manually (including local paths or custom model servers) by typing or pasting it when prompted.


🛠️ Commands and Flow

Inside the console loop:

  • On startup

    • Select a model (0-6) or paste a custom model URI.
    • The model is downloaded (if needed) and loaded with progress reporting.
  • Per document (image or PDF)

    • The app prompts: enter image or document path (or 'q' to quit):

    • Type a file path and press Enter.

    • Select an OCR intent (0-7).

    • The app loads the file into an Attachment.

    • The app iterates pages:

      • For images, this is typically 1 page.
      • For PDFs, this can be N pages.
    • For each page, OCR runs and prints:

      • The recognized text
      • A Stats section
  • Quit

    • At any prompt, typing q exits the app cleanly.

🗣️ Example Use Cases

Try the sample with:

  • A scanned invoice image -> use PlainText intent to extract all text.
  • A multi-page PDF report -> use Markdown intent for structured output.
  • A table screenshot -> use TableRecognition intent for row/column extraction.
  • A math problem photo -> use FormulaRecognition intent to get LaTeX notation.
  • A chart or graph -> use ChartRecognition to extract data points.
  • A document with a stamp -> use SealRecognition to read the seal text.

After each run, compare:

  • Quality score: does the text look correct vs. the page?
  • Token usage and speed: does a bigger model give better quality at acceptable latency?

⚙️ Behavior and Policies (quick reference)

  • Model selection: exactly one model per process. To change models, restart the app.

  • Download and load:

    • ModelDownloadingProgress prints Downloading XX.XX% or byte counts.
    • ModelLoadingProgress prints Loading XX% and clears the console once done.
  • OCR engine:

    • VlmOcr runs OCR with the selected vision model and intent.
    • The Intent property reflects the resolved intent (never Undefined).
    • result.PageElement.Text is the recognized text for the page.
  • Multi-page processing:

    • Attachment.PageCount is used to iterate over pages.
    • OCR is executed per page using ocr.Run(attachment, pageIndex).
  • Licensing:

    • You can set an optional license key via LicenseManager.SetLicenseKey("").
    • A free community license is available from the LM-Kit website.

💻 Minimal Integration Snippet

using System;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

public class VlmOcrSample
{
    public void RunOcr(string filePath)
    {
        // Load PaddleOCR VL model
        var lm = LM.LoadFromModelID("paddleocr-vl:0.9b");

        // Create OCR engine with table recognition intent
        var ocr = new VlmOcr(lm, VlmOcrIntent.TableRecognition);

        // Wrap the file (image or PDF) as an Attachment
        var attachment = new Attachment(filePath);

        // Run OCR page-by-page
        for (int pageIndex = 0; pageIndex < attachment.PageCount; pageIndex++)
        {
            var result = ocr.Run(attachment, pageIndex);

            // Extracted text
            Console.WriteLine(result.PageElement.Text);

            // Optional: generation stats
            Console.WriteLine($"Tokens  : {result.TextGeneration.GeneratedTokens.Count}");
            Console.WriteLine($"Quality : {result.TextGeneration.QualityScore}");
            Console.WriteLine($"Speed   : {result.TextGeneration.TokenGenerationRate} tok/s");
        }
    }
}

Use this pattern to integrate OCR into web APIs, background workers, or desktop apps.


🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vlm_ocr

Project Link: vlm_ocr (same path as above)

▶️ Run

dotnet build
dotnet run

Then:

  1. Select a vision model by typing 0-6, or paste a custom model URI.
  2. Wait for the model to download (first run) and load.
  3. When prompted, type the path to an image or document file (or q to quit).
  4. Select an OCR intent (0-7).
  5. Inspect the recognized text and Stats block (per page).
  6. Press Enter to process another file, or q to exit.

🔍 Notes on Key Types

  • LM (LMKit.Model): generic model wrapper used by LM-Kit.NET.

    • Accepts a Uri pointing to the model.
    • Uses callbacks for download and load progress.
  • VlmOcr (LMKit.Extraction.Ocr): OCR engine built on top of a vision model.

    • Construct with new VlmOcr(model, VlmOcrIntent.PlainText) to set the desired intent.
    • Intent property returns the resolved intent governing instruction and post-processing.
    • Run(Attachment, pageIndex) returns an OCR result with PageElement and TextGeneration.
  • VlmOcrIntent (LMKit.Extraction.Ocr): enum specifying the desired OCR outcome.

    • The engine maps each intent to the best available instruction for the loaded model.
  • Attachment (LMKit.Data): wraps external data (here: image files and PDFs).

    • new Attachment(string path) loads a file from disk.
    • PageCount exposes the number of pages (images are typically 1; PDFs can be many).
    • Exceptions are raised when the path is invalid or inaccessible.
  • TextGeneration: metadata about the underlying generative pass.

    • GeneratedTokens, TerminationReason, QualityScore, TokenGenerationRate, ContextTokens, ContextSize.

🔧 Extend the Demo

  • Write output to disk (--out output.txt) instead of only printing to console.
  • Add page selection for PDFs (--pages 1,3-5).
  • Add batch mode: process a directory of files.
  • Combine multiple intents on the same document (e.g., PlainText + TableRecognition).
  • Post-process PageElement.Text to normalize whitespace or feed into downstream extraction pipelines.
  • Combine with LM-Kit's Structured Extraction to go from document -> text -> structured data in one flow.

📚 Additional Resources