Table of Contents

Image-to-Markdown Vision OCR in .NET Applications


🎯 Purpose of the Sample

Image-to-Markdown Vision OCR demonstrates how to use LM-Kit.NET with vision-capable models to run on-device OCR on images (documents, screenshots, receipts, etc.) and convert them into clean text (or Markdown-style text) in a loop.

The sample shows how to:

  • Download and load a vision model with progress callbacks.
  • Wrap it with LM-Kit’s VlmOcr engine.
  • Feed images as Attachment objects.
  • Retrieve recognized text plus generation statistics (tokens, speed, quality, context usage).

Why Vision OCR with LM-Kit.NET?

  • Local-first: run OCR on your own hardware for privacy-sensitive workloads.
  • Unified API: same model abstraction (LM) for text and vision pipelines.
  • Rich telemetry: quality score, token usage, and performance metrics per image.
  • Drop-in: replace existing OCR engines without changing your data flow too much.

πŸ‘₯ Target Audience

  • Product & Platform – add OCR to existing .NET backends or pipelines.
  • Data & Document Processing – bulk ingest of PDFs, scans, screenshots, etc.
  • RPA / Back-office – extract text from forms, invoices, tickets, and reports.
  • Demo & Education – minimal, readable example of vision + OCR in C#.

πŸš€ Problem Solved

  • Turn images into text: extract readable text from screenshots, scans, or photos.
  • Model flexibility: select a model based on your available VRAM and latency needs.
  • Operational visibility: built-in stats on speed, context usage, and quality.
  • Repeatable loop: process one image after another in a single console session.

πŸ’» Sample Application Description

Console app that:

  • Lets you choose a vision model (or paste a custom model URI).
  • Downloads the model if needed, with live progress updates.
  • Wraps it in a VlmOcr instance.
  • Repeatedly asks you for an image path, then:
    • Loads the file as an Attachment.
    • Runs OCR via ocr.Run(attachment).
    • Prints the extracted text to the console.
  • Displays a stats block (elapsed time, tokens, quality, speed, context usage).
  • Loops until you type q to quit.

✨ Key Features

  • 🧠 Vision-based OCR: uses a multimodal model behind VlmOcr.
  • πŸ“₯ Interactive loop: enter image path β†’ get text β†’ see metrics β†’ repeat.
  • πŸ“Š Telemetry:
    • Elapsed time (seconds)
    • Generated tokens count
    • Stop reason
    • Quality score
    • Token generation rate
    • Context tokens vs context size
  • πŸ“¦ Model lifecycle:
    • Automatic download on first use.
    • Loading progress shown in the console.
  • ❌ Nice errors: friendly message when an image path is invalid or inaccessible.

🧰 Built-In Models (menu)

On startup, the sample shows a model selection menu:

Option Model Approx. VRAM Needed
0 LightOn LightOnOCR 1025 1B ~2 GB VRAM
1 MiniCPM 2.6 o 8.1B ~5.9 GB VRAM
2 Alibaba Qwen 3 2B (vision) ~2.5 GB VRAM
3 Alibaba Qwen 3 4B (vision) ~4 GB VRAM
4 Alibaba Qwen 3 8B (vision) ~6.5 GB VRAM
5 Google Gemma 3 4B (vision) ~5.7 GB VRAM
6 Google Gemma 3 12B (vision) ~11 GB VRAM
7 Mistral Ministral 3 3B (vision) ~3.5 GB VRAM
8 Mistral Ministral 3 8B (vision) ~6.5 GB VRAM
9 Mistral Ministral 3 14B (vision) ~12 GB VRAM
other Custom model URI (GGUF / LMK, etc.) depends on model

Any input other than 0–9 is treated as a custom model URI and passed directly to the LM constructor.


🧠 Supported Models

The sample is pre-wired to LM-Kit’s predefined model cards:

  • lightonocr1025:1b
  • minicpm-o
  • qwen3-vl:2b
  • qwen3-vl:4b
  • qwen3-vl:8b
  • gemma3:4b
  • gemma3:12b
  • ministral3:3b
  • ministral3:8b
  • ministral3:14b

Internally:

modelLink = ModelCard
    .GetPredefinedModelCardByModelID("qwen3-vl:4b")
    .ModelUri
    .ToString();

You can also provide any valid model URI manually (including local paths or custom model servers) by typing/pasting it when prompted.


πŸ› οΈ Commands & Flow

Inside the console loop:

  • On startup

    • Select a model (0–9) or paste a custom model URI.
    • The model is downloaded (if needed) and loaded with progress reporting.
  • Per image

    • The app prompts: enter image path (or 'q' to quit):
    • Type a file path and press Enter.
    • The app loads it into an Attachment and runs OCR.
    • Text is printed, followed by a Stats section.
    • Then:
      • Press Enter to process another image, or
      • Type q to exit.
  • Quit

    • At any image prompt or "process another image" prompt, q exits the app cleanly.

πŸ—£οΈ Example Use Cases

Try the sample with:

  • A scanned invoice β†’ extract all text before sending it to your backend.
  • A screenshot of a web page β†’ capture titles and paragraph content.
  • A photo of a document from a phone β†’ sanity-check OCR quality & speed.
  • A code screenshot β†’ pull code into a text editor for quick edits.
  • A multi-language flyer β†’ see how the model handles different languages.

After each run, compare:

  • Quality score – does the text look correct vs. the image?
  • Token usage & speed – does a bigger model give better quality at acceptable latency?

βš™οΈ Behavior & Policies (quick reference)

  • Model selection: exactly one model per process. To change models, restart the app.
  • Download & load:
    • ModelDownloadingProgress prints Downloading model XX.XX% or byte counts.
    • ModelLoadingProgress prints Loading model XX% and clears the console once done.
  • OCR engine:
    • VlmOcr runs OCR with the selected vision model.
    • result.PageElement.Text is the recognized text for the page.
  • Generation stats:
    • result.TextGeneration.GeneratedTokens.Count
    • result.TextGeneration.TerminationReason
    • result.TextGeneration.QualityScore
    • result.TextGeneration.TokenGenerationRate
    • result.TextGeneration.ContextTokens.Count / result.TextGeneration.ContextSize
  • Licensing:
    • You can set an optional license key via LicenseManager.SetLicenseKey("").
    • A free community license is available from the LM-Kit website.

πŸ’» Minimal Integration Snippet

using System;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;

public class VisionOcrSample
{
    public void RunOcr(string modelUri, string imagePath)
    {

        // Load the vision model
        var lm = new LM(
            new Uri(modelUri),
            downloadingProgress: (path, contentLength, bytesRead) => true,
            loadingProgress: progress => true);

        // Create OCR engine
        var ocr = new VlmOcr(lm);

        // Wrap the image as an Attachment
        var attachment = new Attachment(imagePath);

        // Run OCR
        var result = ocr.Run(attachment);

        // Extracted text
        Console.WriteLine(result.PageElement.Text);

        // Optional: generation stats
        Console.WriteLine($"Tokens   : {result.TextGeneration.GeneratedTokens.Count}");
        Console.WriteLine($"Quality  : {result.TextGeneration.QualityScore}");
        Console.WriteLine($"Speed    : {result.TextGeneration.TokenGenerationRate} tok/s");
    }
}

Use this pattern to integrate OCR into web APIs, background workers, or desktop apps.


πŸ› οΈ Getting Started

πŸ“‹ Prerequisites

  • .NET Framework 4.6.2 or .NET 6.0

πŸ“₯ Download

git clone https://github.com/LM-Kit/lm-kit-net-samples.git
cd lm-kit-net-samples/console_net/image_to_markdown

Project Link: image_to_markdown (same path as above)

▢️ Run

dotnet build
dotnet run

Then:

  1. Select a vision model by typing 0–9, or paste a custom model URI.
  2. Wait for the model to download (first run) and load.
  3. When prompted, type the path to an image file (or q to quit).
  4. Inspect the recognized text and Stats block.
  5. Press Enter to process another image, or q to exit.

πŸ” Notes on Key Types

  • LM (LMKit.Model) – generic model wrapper used by LM-Kit.NET:

    • Accepts a Uri pointing to the model.
    • Uses callbacks for download and load progress.
  • VlmOcr (LMKit.Extraction.Ocr) – OCR engine built on top of a vision model:

    • Run(Attachment) β†’ returns an OCR result with PageElement and TextGeneration.
  • Attachment (LMKit.Data) – wraps external data (here: image files):

    • new Attachment(string path) loads an image from disk.
    • Exceptions are raised when the path is invalid or inaccessible.
  • TextGeneration – metadata about the underlying generative pass:

    • GeneratedTokens, TerminationReason, QualityScore, TokenGenerationRate, ContextTokens, ContextSize.

⚠️ Troubleshooting

  • β€œError: Unable to open '…'.”

    • The path is wrong, the file doesn’t exist, or permissions are missing.
    • Check the path, fix permissions, then try again.
  • Slow or failing model load

    • Insufficient VRAM/CPU or slow storage/network.
    • Try a smaller model (e.g., LightOnOCR 1B, Qwen 3 2B, Ministral 3B).
  • Out-of-memory or driver errors

    • VRAM not sufficient for the selected model.
    • Pick a model with lower VRAM requirements or upgrade hardware.
  • Poor OCR quality

    • Try a larger or OCR-focused model (e.g., LightOnOCR 1B or higher-capacity vision models).
    • Ensure the image is sharp, not heavily compressed, and roughly upright.

πŸ”§ Extend the Demo

  • Use VlmOcr in a web API to provide OCR as a service.
  • Pipe the extracted text into:
    • RAG pipelines,
    • downstream NLP (classification, sentiment, extraction),
    • or your own business logic.
  • Add batch processing (multiple images per run) or directory watchers.
  • Post-process PageElement.Text to:
    • normalize whitespace,
    • detect sections (headers, paragraphs),
    • or convert into your own document format.
  • Combine with LM-Kit’s Text Analysis or Structured Extraction to go from
    image β†’ text β†’ structured data in one flow.