Locate Text Regions with Bounding Boxes Using VLM OCR

Some document processing workflows require more than plain text: you need to know where each text region sits on the page. LM-Kit.NET's VlmOcr engine can return structured TextElement instances with bounding boxes when the underlying model supports coordinate output. PaddleOCR VL 1.6 is one such model. Combined with the OcrWithCoordinates intent, it emits normalized location tokens that LM-Kit.NET translates back to source image pixel coordinates automatically. This tutorial shows how to run OCR with coordinate extraction, iterate the results, and draw the bounding boxes onto the original image using the Canvas drawing API.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	~1 GB (PaddleOCR VL 1.6)
Disk	~750 MB free for model download

Input formats: PNG, JPEG, TIFF, BMP, WebP, or any image format supported by LM-Kit.NET.

Create the Project

dotnet new console -n OcrWithCoordinates
cd OcrWithCoordinates
dotnet add package LM-Kit.NET

Complete Example

The following program loads an image, runs OCR with coordinate extraction, draws a red bounding box around every detected text region, prints each region's text and position, and saves the annotated image. Copy the entire snippet into Program.cs and replace the input path with your own image.

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Extraction.Ocr;
using LMKit.Graphics.Drawing;
using LMKit.Graphics.Geometry;
using LMKit.Graphics.Primitives;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl-1.6:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Run OCR with coordinate extraction
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates)
{
    MaximumCompletionTokens = 4096
};

string inputPath = "document.png";            // <── replace with your image
string outputPath = "document_annotated.png";

Console.WriteLine($"Running OCR on {inputPath}...\n");

using var attachment = new Attachment(inputPath);
VlmOcr.VlmOcrResult result = ocr.Run(attachment);
PageElement page = result.PageElement;

// ──────────────────────────────────────
// 3. Print detected text regions
// ──────────────────────────────────────
int index = 0;

foreach (TextElement element in page.TextElements)
{
    Console.WriteLine($"  [{index}] \"{element.Text}\"");
    Console.WriteLine($"       Position: ({element.Left:F1}, {element.Top:F1})  " +
                      $"Size: {element.Width:F1} x {element.Height:F1}");
    index++;
}

Console.WriteLine($"\nTotal regions: {index}");

// ──────────────────────────────────────
// 4. Draw bounding boxes on the image
// ──────────────────────────────────────
using ImageBuffer image = ImageBuffer.LoadAsRGB(inputPath);
var canvas = new Canvas(image) { Antialiasing = true };

var pen = new Pen(new Color32(255, 0, 0), 2)  // red, 2 px
{
    LineJoin = LineJoin.Miter
};

foreach (TextElement element in page.TextElements)
{
    var rect = Rectangle.FromSize(
        element.Left,
        element.Top,
        element.Width,
        element.Height);

    canvas.DrawRectangle(rect, pen);
}

// ──────────────────────────────────────
// 5. Save annotated image
// ──────────────────────────────────────
image.SaveAsPng(outputPath);

Console.WriteLine($"\nAnnotated image saved to {outputPath}");

How It Works

Coordinate pipeline

PaddleOCR VL emits eight <|LOC_nnn|> tokens per text region (four corners, each with an X and Y coordinate on a normalized 0..999 grid). LM-Kit.NET translates these through two steps:

LOC grid to processed image pixels. The LOC values are denormalized against the content dimensions of the image that was actually fed to the model.
Processed image pixels to source image pixels. The preprocessing transform (crop, scale) is reversed so the final coordinates match the user's original image.

This happens automatically inside VlmOcr when the intent is OcrWithCoordinates and the model supports it. The result is a PageElement populated with TextElement instances whose Left, Top, Width, and Height are expressed in source image pixels.

Drawing

The Canvas class wraps an ImageBuffer and provides fluent drawing primitives. DrawRectangle renders an outline using the supplied Pen. All drawing is immediate and modifies the underlying image in place. Saving the ImageBuffer afterwards writes the annotated result.

Customization Ideas

Goal	Change
Different box color	Change the `Color32` in the `Pen` constructor (e.g. `new Color32(0, 255, 0)` for green)
Thicker outlines	Increase the pen size (e.g. `new Pen(color, 4)`)
Semi-transparent fill behind text	Add `canvas.FillRectangle(rect, new Color32(255, 255, 0, 80))` before drawing the outline
Save as JPEG	Replace `SaveAsPng` with `SaveAsJpeg(outputPath, quality: 90)`
Process a PDF page	Use `new Attachment("file.pdf")` and pass `pageIndex` to `ocr.Run(attachment, pageIndex: 0)`, then render the page to an `ImageBuffer` for drawing

Common Issues

Problem	Cause	Fix
`PageElement.TextElements` is empty	Model did not emit LOC tokens	Ensure you use `VlmOcrIntent.OcrWithCoordinates` with PaddleOCR VL
Boxes appear shifted	Image was resized externally after OCR	Draw on the same image that was passed to `VlmOcr`, not a resized copy
Output truncated	Token limit too low for dense pages	Increase `MaximumCompletionTokens` to 8192 or higher

Next Steps

Extract Text from Images and Documents with VLM OCR: plain text extraction without coordinates.
Extract Tables from Documents with VLM OCR: structured table recognition.
Recognize Mathematical Formulas with VLM OCR: LaTeX extraction from equations.

Table of Contents