Table of Contents

Locate Text Regions with Bounding Boxes Using VLM OCR

Some document processing workflows require more than plain text: you need to know where each text region sits on the page. LM-Kit.NET's VlmOcr engine can return structured TextElement instances with bounding boxes when the underlying model supports coordinate output. PaddleOCR VL 1.5 is one such model. Combined with the OcrWithCoordinates intent, it emits normalized location tokens that LM-Kit.NET translates back to source image pixel coordinates automatically. This tutorial shows how to run OCR with coordinate extraction, iterate the results, and draw the bounding boxes onto the original image using the Canvas drawing API.


Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~1 GB (PaddleOCR VL 1.5)
Disk ~750 MB free for model download

Input formats: PNG, JPEG, TIFF, BMP, WebP, or any image format supported by LM-Kit.NET.


Create the Project

dotnet new console -n OcrWithCoordinates
cd OcrWithCoordinates
dotnet add package LM-Kit.NET

Complete Example

The following program loads an image, runs OCR with coordinate extraction, draws a red bounding box around every detected text region, prints each region's text and position, and saves the annotated image. Copy the entire snippet into Program.cs and replace the input path with your own image.

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Extraction.Ocr;
using LMKit.Graphics.Drawing;
using LMKit.Graphics.Geometry;
using LMKit.Graphics.Primitives;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load PaddleOCR VL model
// ──────────────────────────────────────
Console.WriteLine("Loading PaddleOCR VL model...");
using LM model = LM.LoadFromModelID("paddleocr-vl:0.9b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Run OCR with coordinate extraction
// ──────────────────────────────────────
var ocr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates)
{
    MaximumCompletionTokens = 4096
};

string inputPath = "document.png";            // <── replace with your image
string outputPath = "document_annotated.png";

Console.WriteLine($"Running OCR on {inputPath}...\n");

using var attachment = new Attachment(inputPath);
VlmOcr.VlmOcrResult result = ocr.Run(attachment);
PageElement page = result.PageElement;

// ──────────────────────────────────────
// 3. Print detected text regions
// ──────────────────────────────────────
int index = 0;

foreach (TextElement element in page.TextElements)
{
    Console.WriteLine($"  [{index}] \"{element.Text}\"");
    Console.WriteLine($"       Position: ({element.Left:F1}, {element.Top:F1})  " +
                      $"Size: {element.Width:F1} x {element.Height:F1}");
    index++;
}

Console.WriteLine($"\nTotal regions: {index}");

// ──────────────────────────────────────
// 4. Draw bounding boxes on the image
// ──────────────────────────────────────
using ImageBuffer image = ImageBuffer.LoadAsRGB(inputPath);
var canvas = new Canvas(image) { Antialiasing = true };

var pen = new Pen(new Color32(255, 0, 0), 2)  // red, 2 px
{
    LineJoin = LineJoin.Miter
};

foreach (TextElement element in page.TextElements)
{
    var rect = Rectangle.FromSize(
        element.Left,
        element.Top,
        element.Width,
        element.Height);

    canvas.DrawRectangle(rect, pen);
}

// ──────────────────────────────────────
// 5. Save annotated image
// ──────────────────────────────────────
image.SaveAsPng(outputPath);

Console.WriteLine($"\nAnnotated image saved to {outputPath}");

How It Works

Coordinate pipeline

PaddleOCR VL emits eight <|LOC_nnn|> tokens per text region (four corners, each with an X and Y coordinate on a normalized 0..999 grid). LM-Kit.NET translates these through two steps:

  1. LOC grid to processed image pixels. The LOC values are denormalized against the content dimensions of the image that was actually fed to the model.
  2. Processed image pixels to source image pixels. The preprocessing transform (crop, scale) is reversed so the final coordinates match the user's original image.

This happens automatically inside VlmOcr when the intent is OcrWithCoordinates and the model supports it. The result is a PageElement populated with TextElement instances whose Left, Top, Width, and Height are expressed in source image pixels.

Drawing

The Canvas class wraps an ImageBuffer and provides fluent drawing primitives. DrawRectangle renders an outline using the supplied Pen. All drawing is immediate and modifies the underlying image in place. Saving the ImageBuffer afterwards writes the annotated result.


Customization Ideas

Goal Change
Different box color Change the Color32 in the Pen constructor (e.g. new Color32(0, 255, 0) for green)
Thicker outlines Increase the pen size (e.g. new Pen(color, 4))
Semi-transparent fill behind text Add canvas.FillRectangle(rect, new Color32(255, 255, 0, 80)) before drawing the outline
Save as JPEG Replace SaveAsPng with SaveAsJpeg(outputPath, quality: 90)
Process a PDF page Use new Attachment("file.pdf") and pass pageIndex to ocr.Run(attachment, pageIndex: 0), then render the page to an ImageBuffer for drawing

Common Issues

Problem Cause Fix
PageElement.TextElements is empty Model did not emit LOC tokens Ensure you use VlmOcrIntent.OcrWithCoordinates with PaddleOCR VL
Boxes appear shifted Image was resized externally after OCR Draw on the same image that was passed to VlmOcr, not a resized copy
Output truncated Token limit too low for dense pages Increase MaximumCompletionTokens to 8192 or higher

Next Steps