👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/vlm_ocr_with_coordinates

VLM OCR with Text Coordinates for C# .NET Applications

🎯 Purpose of the Demo

VLM OCR with Coordinates demonstrates how to use LM-Kit.NET with vision-language models to detect text regions with bounding boxes in images and documents, then draw the detected regions onto annotated output images.

The sample shows how to:

Download and load a vision model that supports coordinate output (PaddleOCR VL).
Wrap it with LM-Kit's VlmOcr engine using VlmOcrIntent.OcrWithCoordinates.
Feed images or multi-page PDFs as Attachment objects.
Iterate over the returned TextElement instances with pixel-accurate bounding boxes.
Draw red bounding boxes onto the original image using the Canvas drawing API.
Save the annotated image as a PNG file.

Why VLM OCR with Coordinates?

Location-aware extraction: know where each text region sits on the page, not just what it says.
Automatic coordinate translation: LM-Kit.NET maps the model's normalized grid tokens back to source image pixels through the preprocessing transform chain.
Visual verification: draw bounding boxes to validate detection quality before integrating into production pipelines.
Document format support: the same code path handles single images and multi-page PDFs.

👥 Target Audience

Document Processing: workflows that need spatial information (form field extraction, zone-based reading order, region classification).
Quality Assurance: visually verify OCR accuracy by overlaying bounding boxes on the source image.
Data Labeling: generate bounding-box annotations for training or review without manual annotation tools.
RPA and Back-office: locate specific fields on invoices, receipts, or forms by position.

🚀 Problem Solved

Know where text lives on the page: plain text OCR loses spatial information. This demo preserves region positions.
Visual debugging: immediately see which regions the model detected by looking at the annotated image.
Multi-page documents: process each page of a PDF individually, with per-page annotated output.
Model flexibility: the model selection menu is structured for future expansion as more engines add coordinate support.

💻 Sample Application Description

Console app that:

Lets you choose a vision model that supports coordinate output (PaddleOCR VL is the current default) or paste a custom model URI.
Downloads the model if needed, with live progress updates.
Repeatedly asks you for a file path (image or PDF), then:
- Creates a VlmOcr instance with VlmOcrIntent.OcrWithCoordinates.
- Loads the file as an Attachment.
- Runs OCR page-by-page via ocr.Run(attachment, pageIndex).
- For each page, prints every detected text region with its text, position, and size.
- Draws red bounding boxes on the image using Canvas and Pen.
- Saves the annotated image next to the original file (e.g. document_annotated.png).
Displays a stats block (elapsed time, tokens, quality, speed, context usage).
Loops until you type q to quit.

✨ Key Features

📍 Coordinate extraction: each text region includes Left, Top, Width, Height in source image pixels.
🖼️ Annotated output: bounding boxes drawn directly on the image with the Canvas API.
📄 Image + PDF support: images are loaded directly; PDF pages are rendered before annotation.
📑 Multi-page aware: each page of a multi-page document gets its own annotated image (e.g. contract_page1_annotated.png).
📊 Telemetry:
- Elapsed time (seconds)
- Generated tokens count
- Stop reason
- Quality score
- Token generation rate
- Context tokens vs context size
📦 Model lifecycle:
- Automatic download on first use.
- Loading progress shown in the console.
❌ Nice errors: friendly message when a file path is invalid or the annotated image cannot be saved.

On startup, the sample shows a model selection menu:

Option	Model	Approx. VRAM Needed
0	PaddlePaddle PaddleOCR VL 1.5 0.9B	~1 GB VRAM
other	Custom model URI (GGUF / LMK, etc.)	depends on model

Only models that support bounding-box coordinate output are listed. The menu will grow as more engines add this capability. Any input other than 0 is treated as a custom model URI and passed directly to the LM constructor.

🧠 How Coordinate Translation Works

PaddleOCR VL emits eight <|LOC_nnn|> tokens per text region (four corners, each with an X and Y coordinate on a normalized 0..999 grid). LM-Kit.NET translates these through two steps:

LOC grid to processed image pixels. The LOC values are denormalized against the content dimensions of the image that was actually fed to the model.
Processed image pixels to source image pixels. The preprocessing transform (crop, scale) is reversed so the final coordinates match the user's original image.

This happens automatically inside VlmOcr when the intent is OcrWithCoordinates. The result is a PageElement populated with TextElement instances whose Left, Top, Width, and Height are expressed in source image pixels.

🛠️ Commands and Flow

Inside the console loop:

On startup
- Select a model (0) or paste a custom model URI.
- The model is downloaded (if needed) and loaded with progress reporting.
Per document (image or PDF)
- The app prompts: enter image or document path (or 'q' to quit):
- Type a file path and press Enter.
- The app loads the file into an Attachment.
- The app iterates pages:
  - For images, this is typically 1 page.
  - For PDFs, this can be N pages.
- For each page, OCR runs and prints:
  - Each detected text region with [index] "text" and Position: (x, y) Size: w x h
  - The total number of detected regions
  - The path to the annotated output image
  - A Stats section
Quit
- At any prompt, typing q exits the app cleanly.

🗣️ Example Use Cases

Try the sample with:

A scanned invoice to locate line items, totals, and dates by position.
A multi-page PDF contract to find where signature blocks and clause headings appear.
A phone-captured photo of a receipt to verify which text regions the model detects.
A form or ID card to identify field positions for downstream extraction.

After each run, inspect:

The annotated image to verify bounding-box accuracy.
The console output for exact pixel coordinates of each region.
The quality score and token count to assess detection completeness.

💻 Minimal Integration Snippet

using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Extraction.Ocr;
using LMKit.Graphics.Drawing;
using LMKit.Graphics.Geometry;
using LMKit.Graphics.Primitives;
using LMKit.Media.Image;
using LMKit.Model;

// Load PaddleOCR VL model
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b");

// Create OCR engine with coordinate extraction
var ocr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates)
{
    MaximumCompletionTokens = 4096
};

// Run OCR
var attachment = new Attachment("document.png");
VlmOcr.VlmOcrResult result = ocr.Run(attachment);

// Iterate text regions
foreach (TextElement element in result.PageElement.TextElements)
{
    Console.WriteLine($"\"{element.Text}\"  at ({element.Left:F1}, {element.Top:F1})  " +
                      $"size {element.Width:F1} x {element.Height:F1}");
}

// Draw bounding boxes on the image
using ImageBuffer image = ImageBuffer.LoadAsRGB("document.png");
var canvas = new Canvas(image) { Antialiasing = true };
var pen = new Pen(new Color32(255, 0, 0), 2) { LineJoin = LineJoin.Miter };

foreach (TextElement element in result.PageElement.TextElements)
{
    canvas.DrawRectangle(
        Rectangle.FromSize(element.Left, element.Top, element.Width, element.Height),
        pen);
}

image.SaveAsPng("document_annotated.png");

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/vlm_ocr_with_coordinates

Project Link: vlm_ocr_with_coordinates (same path as above)

▶️ Run

dotnet build
dotnet run

Then:

Select a vision model by typing 0, or paste a custom model URI.
Wait for the model to download (first run) and load.
When prompted, type the path to an image or document file (or q to quit).
Inspect the detected text regions with coordinates in the console.
Open the annotated image saved next to the original file.
Press Enter to process another file, or q to exit.

🔍 Notes on Key Types

VlmOcr (LMKit.Extraction.Ocr): OCR engine built on top of a vision model.
- Construct with new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates) to enable coordinate extraction.
- Run(Attachment, pageIndex) returns a result with PageElement containing TextElement instances.
TextElement (LMKit.Document.Layout): a recognized text region with spatial information.
- Text: the recognized string content.
- Left, Top: top-left corner position in source image pixels.
- Width, Height: bounding box dimensions in source image pixels.
Canvas (LMKit.Graphics.Drawing): a fluent drawing surface that wraps an ImageBuffer.
- DrawRectangle(Rectangle, Pen) renders an outline on the underlying image.
- All drawing is immediate and modifies the image in place.
Pen (LMKit.Graphics.Drawing): defines stroke color, width, and line join style.
Attachment (LMKit.Data): wraps external data (images or documents).
- PageCount exposes the number of pages (images are typically 1; PDFs can be many).
- RenderPage(pageIndex, format) renders a specific page of a multi-page document to an ImageBuffer.

🔧 Extend the Demo

Change the box color or thickness by modifying the Pen constructor (e.g. green boxes with 4 px stroke).
Add a semi-transparent fill behind each text region for better visibility.
Write detected regions to a JSON file for downstream processing.
Filter regions by position or size to focus on specific areas of the document.
Combine with LM-Kit's Structured Extraction to extract field values from specific regions.
Add page selection for PDFs (--pages 1,3-5) to process only specific pages.

📚 Additional Resources

VlmOcr API Reference
TextElement API Reference
Canvas API Reference
How-To: Locate Text Regions with Bounding Boxes Using VLM OCR
VLM OCR Demo: general-purpose OCR demo with all intents

How-To: Locate Text Regions with VLM OCR: Step-by-step guide to extracting text with bounding box coordinates.
How-To: Extract Text with Layout Preservation: Learn how to maintain spatial layout when extracting text from documents.
Glossary: Optical Character Recognition: Covers OCR concepts including coordinate-aware text detection.
VLM OCR Demo: General-purpose VLM OCR demo with all seven recognition intents.

Table of Contents

VLM OCR with Text Coordinates for C# .NET Applications

🎯 Purpose of the Demo

👥 Target Audience

🚀 Problem Solved

💻 Sample Application Description

✨ Key Features

🧰 Built-In Models (menu)

🧠 How Coordinate Translation Works

🛠️ Commands and Flow

🗣️ Example Use Cases

💻 Minimal Integration Snippet

🛠️ Getting Started

📋 Prerequisites

📥 Download

▶️ Run

🔍 Notes on Key Types

🔧 Extend the Demo

📚 Additional Resources

Table of Contents

VLM OCR with Text Coordinates for C# .NET Applications

🎯 Purpose of the Demo

👥 Target Audience

🚀 Problem Solved

💻 Sample Application Description

✨ Key Features

🧰 Built-In Models (menu)

🧠 How Coordinate Translation Works

🛠️ Commands and Flow

🗣️ Example Use Cases

💻 Minimal Integration Snippet

🛠️ Getting Started

📋 Prerequisites

📥 Download

▶️ Run

🔍 Notes on Key Types

🔧 Extend the Demo

📚 Additional Resources

📚 Related Content