Image-to-Markdown Vision OCR in .NET Applications
π― Purpose of the Sample
Image-to-Markdown Vision OCR demonstrates how to use LM-Kit.NET with vision-capable models to run on-device OCR on images (documents, screenshots, receipts, etc.) and convert them into clean text (or Markdown-style text) in a loop.
The sample shows how to:
- Download and load a vision model with progress callbacks.
- Wrap it with LM-Kitβs
VlmOcrengine. - Feed images as
Attachmentobjects. - Retrieve recognized text plus generation statistics (tokens, speed, quality, context usage).
Why Vision OCR with LM-Kit.NET?
- Local-first: run OCR on your own hardware for privacy-sensitive workloads.
- Unified API: same model abstraction (
LM) for text and vision pipelines. - Rich telemetry: quality score, token usage, and performance metrics per image.
- Drop-in: replace existing OCR engines without changing your data flow too much.
π₯ Target Audience
- Product & Platform β add OCR to existing .NET backends or pipelines.
- Data & Document Processing β bulk ingest of PDFs, scans, screenshots, etc.
- RPA / Back-office β extract text from forms, invoices, tickets, and reports.
- Demo & Education β minimal, readable example of vision + OCR in C#.
π Problem Solved
- Turn images into text: extract readable text from screenshots, scans, or photos.
- Model flexibility: select a model based on your available VRAM and latency needs.
- Operational visibility: built-in stats on speed, context usage, and quality.
- Repeatable loop: process one image after another in a single console session.
π» Sample Application Description
Console app that:
- Lets you choose a vision model (or paste a custom model URI).
- Downloads the model if needed, with live progress updates.
- Wraps it in a
VlmOcrinstance. - Repeatedly asks you for an image path, then:
- Loads the file as an
Attachment. - Runs OCR via
ocr.Run(attachment). - Prints the extracted text to the console.
- Loads the file as an
- Displays a stats block (elapsed time, tokens, quality, speed, context usage).
- Loops until you type
qto quit.
β¨ Key Features
- π§ Vision-based OCR: uses a multimodal model behind
VlmOcr. - π₯ Interactive loop: enter image path β get text β see metrics β repeat.
- π Telemetry:
- Elapsed time (seconds)
- Generated tokens count
- Stop reason
- Quality score
- Token generation rate
- Context tokens vs context size
- π¦ Model lifecycle:
- Automatic download on first use.
- Loading progress shown in the console.
- β Nice errors: friendly message when an image path is invalid or inaccessible.
π§° Built-In Models (menu)
On startup, the sample shows a model selection menu:
| Option | Model | Approx. VRAM Needed |
|---|---|---|
| 0 | LightOn LightOnOCR 1025 1B | ~2 GB VRAM |
| 1 | MiniCPM 2.6 o 8.1B | ~5.9 GB VRAM |
| 2 | Alibaba Qwen 3 2B (vision) | ~2.5 GB VRAM |
| 3 | Alibaba Qwen 3 4B (vision) | ~4 GB VRAM |
| 4 | Alibaba Qwen 3 8B (vision) | ~6.5 GB VRAM |
| 5 | Google Gemma 3 4B (vision) | ~5.7 GB VRAM |
| 6 | Google Gemma 3 12B (vision) | ~11 GB VRAM |
| 7 | Mistral Ministral 3 3B (vision) | ~3.5 GB VRAM |
| 8 | Mistral Ministral 3 8B (vision) | ~6.5 GB VRAM |
| 9 | Mistral Ministral 3 14B (vision) | ~12 GB VRAM |
| other | Custom model URI (GGUF / LMK, etc.) | depends on model |
Any input other than
0β9is treated as a custom model URI and passed directly to theLMconstructor.
π§ Supported Models
The sample is pre-wired to LM-Kitβs predefined model cards:
lightonocr1025:1bminicpm-oqwen3-vl:2bqwen3-vl:4bqwen3-vl:8bgemma3:4bgemma3:12bministral3:3bministral3:8bministral3:14b
Internally:
modelLink = ModelCard
.GetPredefinedModelCardByModelID("qwen3-vl:4b")
.ModelUri
.ToString();
You can also provide any valid model URI manually (including local paths or custom model servers) by typing/pasting it when prompted.
π οΈ Commands & Flow
Inside the console loop:
On startup
- Select a model (0β9) or paste a custom model URI.
- The model is downloaded (if needed) and loaded with progress reporting.
Per image
- The app prompts:
enter image path (or 'q' to quit): - Type a file path and press Enter.
- The app loads it into an
Attachmentand runs OCR. - Text is printed, followed by a Stats section.
- Then:
- Press Enter to process another image, or
- Type
qto exit.
- The app prompts:
Quit
- At any image prompt or "process another image" prompt,
qexits the app cleanly.
- At any image prompt or "process another image" prompt,
π£οΈ Example Use Cases
Try the sample with:
- A scanned invoice β extract all text before sending it to your backend.
- A screenshot of a web page β capture titles and paragraph content.
- A photo of a document from a phone β sanity-check OCR quality & speed.
- A code screenshot β pull code into a text editor for quick edits.
- A multi-language flyer β see how the model handles different languages.
After each run, compare:
- Quality score β does the text look correct vs. the image?
- Token usage & speed β does a bigger model give better quality at acceptable latency?
βοΈ Behavior & Policies (quick reference)
- Model selection: exactly one model per process. To change models, restart the app.
- Download & load:
ModelDownloadingProgressprintsDownloading model XX.XX%or byte counts.ModelLoadingProgressprintsLoading model XX%and clears the console once done.
- OCR engine:
VlmOcrruns OCR with the selected vision model.result.PageElement.Textis the recognized text for the page.
- Generation stats:
result.TextGeneration.GeneratedTokens.Countresult.TextGeneration.TerminationReasonresult.TextGeneration.QualityScoreresult.TextGeneration.TokenGenerationRateresult.TextGeneration.ContextTokens.Count/result.TextGeneration.ContextSize
- Licensing:
- You can set an optional license key via
LicenseManager.SetLicenseKey(""). - A free community license is available from the LM-Kit website.
- You can set an optional license key via
π» Minimal Integration Snippet
using System;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Model;
public class VisionOcrSample
{
public void RunOcr(string modelUri, string imagePath)
{
// Load the vision model
var lm = new LM(
new Uri(modelUri),
downloadingProgress: (path, contentLength, bytesRead) => true,
loadingProgress: progress => true);
// Create OCR engine
var ocr = new VlmOcr(lm);
// Wrap the image as an Attachment
var attachment = new Attachment(imagePath);
// Run OCR
var result = ocr.Run(attachment);
// Extracted text
Console.WriteLine(result.PageElement.Text);
// Optional: generation stats
Console.WriteLine($"Tokens : {result.TextGeneration.GeneratedTokens.Count}");
Console.WriteLine($"Quality : {result.TextGeneration.QualityScore}");
Console.WriteLine($"Speed : {result.TextGeneration.TokenGenerationRate} tok/s");
}
}
Use this pattern to integrate OCR into web APIs, background workers, or desktop apps.
π οΈ Getting Started
π Prerequisites
- .NET Framework 4.6.2 or .NET 6.0
π₯ Download
git clone https://github.com/LM-Kit/lm-kit-net-samples.git
cd lm-kit-net-samples/console_net/image_to_markdown
Project Link: image_to_markdown (same path as above)
βΆοΈ Run
dotnet build
dotnet run
Then:
- Select a vision model by typing 0β9, or paste a custom model URI.
- Wait for the model to download (first run) and load.
- When prompted, type the path to an image file (or
qto quit). - Inspect the recognized text and Stats block.
- Press Enter to process another image, or
qto exit.
π Notes on Key Types
LM(LMKit.Model) β generic model wrapper used by LM-Kit.NET:- Accepts a
Uripointing to the model. - Uses callbacks for download and load progress.
- Accepts a
VlmOcr(LMKit.Extraction.Ocr) β OCR engine built on top of a vision model:Run(Attachment)β returns an OCR result withPageElementandTextGeneration.
Attachment(LMKit.Data) β wraps external data (here: image files):new Attachment(string path)loads an image from disk.- Exceptions are raised when the path is invalid or inaccessible.
TextGenerationβ metadata about the underlying generative pass:GeneratedTokens,TerminationReason,QualityScore,TokenGenerationRate,ContextTokens,ContextSize.
β οΈ Troubleshooting
βError: Unable to open 'β¦'.β
- The path is wrong, the file doesnβt exist, or permissions are missing.
- Check the path, fix permissions, then try again.
Slow or failing model load
- Insufficient VRAM/CPU or slow storage/network.
- Try a smaller model (e.g., LightOnOCR 1B, Qwen 3 2B, Ministral 3B).
Out-of-memory or driver errors
- VRAM not sufficient for the selected model.
- Pick a model with lower VRAM requirements or upgrade hardware.
Poor OCR quality
- Try a larger or OCR-focused model (e.g., LightOnOCR 1B or higher-capacity vision models).
- Ensure the image is sharp, not heavily compressed, and roughly upright.
π§ Extend the Demo
- Use
VlmOcrin a web API to provide OCR as a service. - Pipe the extracted text into:
- RAG pipelines,
- downstream NLP (classification, sentiment, extraction),
- or your own business logic.
- Add batch processing (multiple images per run) or directory watchers.
- Post-process
PageElement.Textto:- normalize whitespace,
- detect sections (headers, paragraphs),
- or convert into your own document format.
- Combine with LM-Kitβs Text Analysis or Structured Extraction to go from
image β text β structured data in one flow.