Table of Contents

πŸ‘‰ Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_to_markdown

Document-to-Markdown Vision OCR in .NET Applications


🎯 Purpose of the Sample

Document-to-Markdown Vision OCR demonstrates how to use LM-Kit.NET with vision-capable models to run on-device OCR on images and PDF documents (scans, screenshots, receipts, reports, etc.) and convert them into clean text (or Markdown-style text) in a loop.

The sample shows how to:

  • Download and load a vision model with progress callbacks.
  • Wrap it with LM-Kit’s VlmOcr engine.
  • Feed images or PDFs as Attachment objects.
  • Process multi-page inputs using Attachment.PageCount.
  • Retrieve recognized text plus generation statistics (tokens, speed, quality, context usage).

Why Vision OCR with LM-Kit.NET?

  • Local-first: run OCR on your own hardware for privacy-sensitive workloads.
  • Unified API: same model abstraction (LM) for text and vision pipelines.
  • Rich telemetry: quality score, token usage, and performance metrics per page.
  • Drop-in: replace existing OCR engines without changing your data flow too much.

πŸ‘₯ Target Audience

  • Product and Platform - add OCR to existing .NET backends or pipelines.
  • Data and Document Processing - bulk ingest of PDFs, scans, screenshots, etc.
  • RPA and Back-office - extract text from forms, invoices, tickets, and reports.
  • Demo and Education - minimal, readable example of vision + OCR in C#.

πŸš€ Problem Solved

  • Turn images and PDFs into text: extract readable text from photos, screenshots, scans, and PDF pages.
  • Model flexibility: select a model based on your available VRAM and latency needs.
  • Operational visibility: built-in stats on speed, context usage, and quality.
  • Repeatable loop: process one file after another in a single console session.
  • Multi-page handling: iterate through PDF pages automatically with PageCount.

πŸ’» Sample Application Description

Console app that:

  • Lets you choose a vision model (or paste a custom model URI).

  • Downloads the model if needed, with live progress updates.

  • Wraps it in a VlmOcr instance.

  • Repeatedly asks you for a file path (image or PDF), then:

    • Loads the file as an Attachment.
    • Runs OCR page-by-page via ocr.Run(attachment, pageIndex).
    • Prints the extracted text to the console.
  • Displays a stats block (elapsed time, tokens, quality, speed, context usage).

  • Loops until you type q to quit.

✨ Key Features

  • 🧠 Vision-based OCR: uses a multimodal model behind VlmOcr.

  • πŸ“„ Image + PDF support: the same code path handles both formats.

  • πŸ“₯ Interactive loop: enter file path -> get text -> see metrics -> repeat.

  • πŸ“‘ Multi-page aware: prints results per page using attachment.PageCount.

  • πŸ“Š Telemetry:

    • Elapsed time (seconds)
    • Generated tokens count
    • Stop reason
    • Quality score
    • Token generation rate
    • Context tokens vs context size
  • πŸ“¦ Model lifecycle:

    • Automatic download on first use.
    • Loading progress shown in the console.
  • ❌ Nice errors: friendly message when a file path is invalid or inaccessible.


🧰 Built-In Models (menu)

On startup, the sample shows a model selection menu:

Option Model Approx. VRAM Needed
0 LightOn LightOnOCR 1025 1B ~2 GB VRAM
1 MiniCPM 2.6 o 8.1B ~5.9 GB VRAM
2 Alibaba Qwen 3 2B (vision) ~2.5 GB VRAM
3 Alibaba Qwen 3 4B (vision) ~4 GB VRAM
4 Alibaba Qwen 3 8B (vision) ~6.5 GB VRAM
5 Google Gemma 3 4B (vision) ~5.7 GB VRAM
6 Google Gemma 3 12B (vision) ~11 GB VRAM
7 Mistral Ministral 3 3B (vision) ~3.5 GB VRAM
8 Mistral Ministral 3 8B (vision) ~6.5 GB VRAM
9 Mistral Ministral 3 14B (vision) ~12 GB VRAM
other Custom model URI (GGUF / LMK, etc.) depends on model

Any input other than 0-9 is treated as a custom model URI and passed directly to the LM constructor.


🧠 Supported Models

The sample is pre-wired to LM-Kit’s predefined model cards:

  • lightonocr1025:1b
  • minicpm-o
  • qwen3-vl:2b
  • qwen3-vl:4b
  • qwen3-vl:8b
  • gemma3:4b
  • gemma3:12b
  • ministral3:3b
  • ministral3:8b
  • ministral3:14b

Internally:

modelLink = ModelCard
    .GetPredefinedModelCardByModelID("qwen3-vl:4b")
    .ModelUri
    .ToString();

You can also provide any valid model URI manually (including local paths or custom model servers) by typing or pasting it when prompted.


πŸ› οΈ Commands and Flow

Inside the console loop:

  • On startup

    • Select a model (0-9) or paste a custom model URI.
    • The model is downloaded (if needed) and loaded with progress reporting.
  • Per document (image or PDF)

    • The app prompts: enter file path (image or PDF) (or 'q' to quit):

    • Type a file path and press Enter.

    • The app loads it into an Attachment.

    • The app iterates pages:

      • For images, this is typically 1 page.
      • For PDFs, this can be N pages.
    • For each page, OCR runs and prints:

      • The recognized text or Markdown
      • A Stats section
  • Quit

    • At any prompt, typing q exits the app cleanly.

πŸ—£οΈ Example Use Cases

Try the sample with:

  • A scanned invoice image -> extract all text before sending it to your backend.
  • A PDF report (multi-page) -> convert page-by-page to Markdown.
  • A screenshot of a web page -> capture titles and paragraph content.
  • A photo of a document from a phone -> sanity-check OCR quality and speed.
  • A code screenshot -> pull code into a text editor for quick edits.
  • A multi-language flyer -> see how the model handles different languages.

After each run, compare:

  • Quality score - does the text look correct vs. the page?
  • Token usage and speed - does a bigger model give better quality at acceptable latency?

βš™οΈ Behavior and Policies (quick reference)

  • Model selection: exactly one model per process. To change models, restart the app.

  • Download and load:

    • ModelDownloadingProgress prints Downloading model XX.XX% or byte counts.
    • ModelLoadingProgress prints Loading model XX% and clears the console once done.
  • OCR engine:

    • VlmOcr runs OCR with the selected vision model.
    • result.PageElement.Text is the recognized text for the page.
  • Multi-page processing:

    • Attachment.PageCount is used to iterate over pages.
    • OCR is executed per page using ocr.Run(attachment, pageIndex).
  • Generation stats:

    • result.TextGeneration.GeneratedTokens.Count
    • result.TextGeneration.TerminationReason
    • result.TextGeneration.QualityScore
    • result.TextGeneration.TokenGenerationRate
    • result.TextGeneration.ContextTokens.Count / result.TextGeneration.ContextSize
  • Licensing:

    • You can set an optional license key via LicenseManager.SetLicenseKey("").
    • A free community license is available from the LM-Kit website.

πŸ’» Minimal Integration Snippet

using System;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

public class VisionOcrSample
{
    public void RunOcr(string modelUri, string filePath)
    {
        // Load the vision model
        var lm = new LM(
            new Uri(modelUri),
            downloadingProgress: (path, contentLength, bytesRead) => true,
            loadingProgress: progress => true);

        // Create OCR engine
        var ocr = new VlmOcr(lm);

        // Wrap the file (image or PDF) as an Attachment
        var attachment = new Attachment(filePath);

        // Run OCR page-by-page (PDFs can be multi-page; images are usually 1 page)
        for (int pageIndex = 0; pageIndex < attachment.PageCount; pageIndex++)
        {
            var result = ocr.Run(attachment, pageIndex);

            // Extracted text / Markdown
            Console.WriteLine(result.PageElement.Text);

            // Optional: generation stats
            Console.WriteLine($"Tokens   : {result.TextGeneration.GeneratedTokens.Count}");
            Console.WriteLine($"Quality  : {result.TextGeneration.QualityScore}");
            Console.WriteLine($"Speed    : {result.TextGeneration.TokenGenerationRate} tok/s");
        }
    }
}

Use this pattern to integrate OCR into web APIs, background workers, or desktop apps.


πŸ› οΈ Getting Started

πŸ“‹ Prerequisites

  • .NET Framework 4.6.2 or .NET 8.0+

πŸ“₯ Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document_to_markdown

Project Link: document_to_markdown (same path as above)

▢️ Run

dotnet build
dotnet run

Then:

  1. Select a vision model by typing 0-9, or paste a custom model URI.
  2. Wait for the model to download (first run) and load.
  3. When prompted, type the path to an image or PDF file (or q to quit).
  4. Inspect the recognized text and Stats block (per page).
  5. Press Enter to process another file, or q to exit.

πŸ” Notes on Key Types

  • LM (LMKit.Model) - generic model wrapper used by LM-Kit.NET:

    • Accepts a Uri pointing to the model.
    • Uses callbacks for download and load progress.
  • VlmOcr (LMKit.Extraction.Ocr) - OCR engine built on top of a vision model:

    • Run(Attachment, pageIndex) -> returns an OCR result with PageElement and TextGeneration.
  • Attachment (LMKit.Data) - wraps external data (here: image files and PDFs):

    • new Attachment(string path) loads a file from disk.
    • PageCount exposes the number of pages (images are typically 1; PDFs can be many).
    • Exceptions are raised when the path is invalid or inaccessible.
  • TextGeneration - metadata about the underlying generative pass:

    • GeneratedTokens, TerminationReason, QualityScore, TokenGenerationRate, ContextTokens, ContextSize.

πŸ”§ Extend the Demo

  • Write output to disk (--out output.md) instead of only printing to console.

  • Add page selection for PDFs (--pages 1,3-5).

  • Add batch mode: process a directory of files.

  • Post-process PageElement.Text to:

    • normalize whitespace,
    • detect sections (headers, paragraphs),
    • or convert into your own document format.
  • Combine with LM-Kit’s Structured Extraction to go from document -> markdown -> structured data in one flow.