👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vlm_ocr_with_coordinates
VLM OCR with Text Coordinates for C# .NET Applications
🎯 Purpose of the Demo
VLM OCR with Coordinates demonstrates how to use LM-Kit.NET with vision-language models to detect text regions with bounding boxes in images and documents, then draw the detected regions onto annotated output images.
The sample shows how to:
- Download and load a vision model that supports coordinate output (PaddleOCR VL).
- Wrap it with LM-Kit's
VlmOcrengine usingVlmOcrIntent.OcrWithCoordinates. - Feed images or multi-page PDFs as
Attachmentobjects. - Iterate over the returned
TextElementinstances with pixel-accurate bounding boxes. - Draw red bounding boxes onto the original image using the
Canvasdrawing API. - Save the annotated image as a PNG file.
Why VLM OCR with Coordinates?
- Location-aware extraction: know where each text region sits on the page, not just what it says.
- Automatic coordinate translation: LM-Kit.NET maps the model's normalized grid tokens back to source image pixels through the preprocessing transform chain.
- Visual verification: draw bounding boxes to validate detection quality before integrating into production pipelines.
- Document format support: the same code path handles single images and multi-page PDFs.
👥 Target Audience
- Document Processing: workflows that need spatial information (form field extraction, zone-based reading order, region classification).
- Quality Assurance: visually verify OCR accuracy by overlaying bounding boxes on the source image.
- Data Labeling: generate bounding-box annotations for training or review without manual annotation tools.
- RPA and Back-office: locate specific fields on invoices, receipts, or forms by position.
🚀 Problem Solved
- Know where text lives on the page: plain text OCR loses spatial information. This demo preserves region positions.
- Visual debugging: immediately see which regions the model detected by looking at the annotated image.
- Multi-page documents: process each page of a PDF individually, with per-page annotated output.
- Model flexibility: the model selection menu is structured for future expansion as more engines add coordinate support.
💻 Sample Application Description
Console app that:
Lets you choose a vision model that supports coordinate output (PaddleOCR VL is the current default) or paste a custom model URI.
Downloads the model if needed, with live progress updates.
Repeatedly asks you for a file path (image or PDF), then:
- Creates a
VlmOcrinstance withVlmOcrIntent.OcrWithCoordinates. - Loads the file as an
Attachment. - Runs OCR page-by-page via
ocr.Run(attachment, pageIndex). - For each page, prints every detected text region with its text, position, and size.
- Draws red bounding boxes on the image using
CanvasandPen. - Saves the annotated image next to the original file (e.g.
document_annotated.png).
- Creates a
Displays a stats block (elapsed time, tokens, quality, speed, context usage).
Loops until you type
qto quit.
✨ Key Features
📍 Coordinate extraction: each text region includes
Left,Top,Width,Heightin source image pixels.🖼️ Annotated output: bounding boxes drawn directly on the image with the
CanvasAPI.📄 Image + PDF support: images are loaded directly; PDF pages are rendered before annotation.
📑 Multi-page aware: each page of a multi-page document gets its own annotated image (e.g.
contract_page1_annotated.png).📊 Telemetry:
- Elapsed time (seconds)
- Generated tokens count
- Stop reason
- Quality score
- Token generation rate
- Context tokens vs context size
📦 Model lifecycle:
- Automatic download on first use.
- Loading progress shown in the console.
❌ Nice errors: friendly message when a file path is invalid or the annotated image cannot be saved.
🧰 Built-In Models (menu)
On startup, the sample shows a model selection menu:
| Option | Model | Approx. VRAM Needed |
|---|---|---|
| 0 | PaddlePaddle PaddleOCR VL 1.5 0.9B | ~1 GB VRAM |
| other | Custom model URI (GGUF / LMK, etc.) | depends on model |
Only models that support bounding-box coordinate output are listed. The menu will grow as more engines add this capability. Any input other than
0is treated as a custom model URI and passed directly to theLMconstructor.
🧠 How Coordinate Translation Works
PaddleOCR VL emits eight <|LOC_nnn|> tokens per text region (four corners, each with an X and Y coordinate on a normalized 0..999 grid). LM-Kit.NET translates these through two steps:
- LOC grid to processed image pixels. The LOC values are denormalized against the content dimensions of the image that was actually fed to the model.
- Processed image pixels to source image pixels. The preprocessing transform (crop, scale) is reversed so the final coordinates match the user's original image.
This happens automatically inside VlmOcr when the intent is OcrWithCoordinates. The result is a PageElement populated with TextElement instances whose Left, Top, Width, and Height are expressed in source image pixels.
🛠️ Commands and Flow
Inside the console loop:
On startup
- Select a model (0) or paste a custom model URI.
- The model is downloaded (if needed) and loaded with progress reporting.
Per document (image or PDF)
The app prompts:
enter image or document path (or 'q' to quit):Type a file path and press Enter.
The app loads the file into an
Attachment.The app iterates pages:
- For images, this is typically 1 page.
- For PDFs, this can be N pages.
For each page, OCR runs and prints:
- Each detected text region with
[index] "text"andPosition: (x, y) Size: w x h - The total number of detected regions
- The path to the annotated output image
- A Stats section
- Each detected text region with
Quit
- At any prompt, typing
qexits the app cleanly.
- At any prompt, typing
🗣️ Example Use Cases
Try the sample with:
- A scanned invoice to locate line items, totals, and dates by position.
- A multi-page PDF contract to find where signature blocks and clause headings appear.
- A phone-captured photo of a receipt to verify which text regions the model detects.
- A form or ID card to identify field positions for downstream extraction.
After each run, inspect:
- The annotated image to verify bounding-box accuracy.
- The console output for exact pixel coordinates of each region.
- The quality score and token count to assess detection completeness.
💻 Minimal Integration Snippet
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Extraction.Ocr;
using LMKit.Graphics.Drawing;
using LMKit.Graphics.Geometry;
using LMKit.Graphics.Primitives;
using LMKit.Media.Image;
using LMKit.Model;
// Load PaddleOCR VL model
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b");
// Create OCR engine with coordinate extraction
var ocr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates)
{
MaximumCompletionTokens = 4096
};
// Run OCR
var attachment = new Attachment("document.png");
VlmOcr.VlmOcrResult result = ocr.Run(attachment);
// Iterate text regions
foreach (TextElement element in result.PageElement.TextElements)
{
Console.WriteLine($"\"{element.Text}\" at ({element.Left:F1}, {element.Top:F1}) " +
$"size {element.Width:F1} x {element.Height:F1}");
}
// Draw bounding boxes on the image
using ImageBuffer image = ImageBuffer.LoadAsRGB("document.png");
var canvas = new Canvas(image) { Antialiasing = true };
var pen = new Pen(new Color32(255, 0, 0), 2) { LineJoin = LineJoin.Miter };
foreach (TextElement element in result.PageElement.TextElements)
{
canvas.DrawRectangle(
Rectangle.FromSize(element.Left, element.Top, element.Width, element.Height),
pen);
}
image.SaveAsPng("document_annotated.png");
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
📥 Download
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vlm_ocr_with_coordinates
Project Link: vlm_ocr_with_coordinates (same path as above)
▶️ Run
dotnet build
dotnet run
Then:
- Select a vision model by typing 0, or paste a custom model URI.
- Wait for the model to download (first run) and load.
- When prompted, type the path to an image or document file (or
qto quit). - Inspect the detected text regions with coordinates in the console.
- Open the annotated image saved next to the original file.
- Press Enter to process another file, or
qto exit.
🔍 Notes on Key Types
VlmOcr(LMKit.Extraction.Ocr): OCR engine built on top of a vision model.- Construct with
new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates)to enable coordinate extraction. Run(Attachment, pageIndex)returns a result withPageElementcontainingTextElementinstances.
- Construct with
TextElement(LMKit.Document.Layout): a recognized text region with spatial information.Text: the recognized string content.Left,Top: top-left corner position in source image pixels.Width,Height: bounding box dimensions in source image pixels.
Canvas(LMKit.Graphics.Drawing): a fluent drawing surface that wraps anImageBuffer.DrawRectangle(Rectangle, Pen)renders an outline on the underlying image.- All drawing is immediate and modifies the image in place.
Pen(LMKit.Graphics.Drawing): defines stroke color, width, and line join style.Attachment(LMKit.Data): wraps external data (images or documents).PageCountexposes the number of pages (images are typically 1; PDFs can be many).RenderPage(pageIndex, format)renders a specific page of a multi-page document to anImageBuffer.
🔧 Extend the Demo
- Change the box color or thickness by modifying the
Penconstructor (e.g. green boxes with 4 px stroke). - Add a semi-transparent fill behind each text region for better visibility.
- Write detected regions to a JSON file for downstream processing.
- Filter regions by position or size to focus on specific areas of the document.
- Combine with LM-Kit's Structured Extraction to extract field values from specific regions.
- Add page selection for PDFs (
--pages 1,3-5) to process only specific pages.
📚 Additional Resources
- VlmOcr API Reference
- TextElement API Reference
- Canvas API Reference
- How-To: Locate Text Regions with Bounding Boxes Using VLM OCR
- VLM OCR Demo: general-purpose OCR demo with all intents