Table of Contents

Understanding Optical Character Recognition (OCR) in LM-Kit.NET


TL;DR

Optical Character Recognition (OCR) is the technology that converts images of text into machine-readable strings. Modern OCR goes far beyond simple character recognition: it can preserve document layout, extract tables, recognize mathematical formulas, locate text regions with bounding boxes, and convert scanned documents to structured formats like Markdown. In LM-Kit.NET, two OCR engines are available: VlmOcr, powered by vision language models, for AI-driven transcription with deep understanding of document structure; and TesseractOcr, the industry-standard engine for fast, layout-aware text extraction with support for 34 languages.


What is OCR?

Definition: Optical Character Recognition (OCR) is the process of extracting text from images, scanned documents, photographs, or PDF pages. Traditional OCR uses pattern matching and character-level classifiers. Modern OCR combines classical techniques with deep learning and vision language models (VLMs) to understand not just individual characters, but entire document structures: headings, paragraphs, tables, formulas, stamps, and charts.

Why OCR Matters

  1. Digitization: Convert paper documents, scanned archives, and photographs into searchable, editable text.
  2. Document Intelligence: Feed OCR output into RAG pipelines, extraction systems, or classification engines for automated document processing.
  3. Accessibility: Make image-based content accessible to screen readers and search engines.
  4. Automation: Process invoices, contracts, receipts, and forms without manual data entry.
  5. Multilingual Support: Extract text in dozens of languages from a single pipeline.

Traditional OCR vs. VLM OCR

Aspect Traditional OCR (Tesseract) VLM OCR
Approach Character segmentation + classifiers Vision language model inference
Speed Very fast (milliseconds per page) Slower (seconds per page, depends on model size)
Accuracy on clean text Excellent Excellent
Accuracy on degraded/noisy images Good with preprocessing Better contextual understanding
Layout understanding Bounding boxes and text blocks Full semantic structure (headings, sections, tables)
Table extraction Limited Preserves tabular structure
Formula recognition Not supported Supported (LaTeX output)
Output format Plain text with coordinates Markdown, structured text, coordinates
GPU required No Yes (model inference)
Languages 34 (trained data files) Depends on model training

Use Tesseract when you need fast, high-volume text extraction from clean documents. Use VLM OCR when you need deep structural understanding, table/formula recognition, or Markdown output.


VLM OCR Intents

The VlmOcrIntent enum controls what kind of content the VLM OCR engine focuses on:

Intent Description Output
PlainText Extract unformatted text content Raw text
Markdown Extract with full document structure Structured Markdown with headings, lists, tables
TableRecognition Focus on tabular data Markdown tables
FormulaRecognition Extract mathematical expressions LaTeX notation
ChartRecognition Interpret charts and graphs Textual description of data
OcrWithCoordinates Extract text with spatial positions Text with bounding box coordinates
SealRecognition Identify stamps and seals Recognized seal text

Not every model supports every intent. Use VlmOcr.GetSupportedIntents(model) to query which intents a given model family natively supports.


Practical Application in LM-Kit.NET SDK

VlmOcr: AI-Powered OCR

The VlmOcr class uses a loaded vision language model to transcribe images and documents. It operates in the LMKit.Extraction.Ocr namespace and extends the abstract OcrEngine base class.

Key capabilities:

  • Intent-driven extraction: Choose between plain text, Markdown, tables, formulas, charts, or coordinates.
  • Attachment support: Process PDF pages, images, and multi-page documents via the Attachment class.
  • Instruction customization: Override the default instruction to guide the model for domain-specific tasks.
  • Post-processing options: Strip Markdown image markup (StripImageMarkup) and HTML style attributes (StripStyleAttributes) from output.
  • Supported model families: Qwen2-VL, Gemma 3 VL, MiniCPM-O, LightONOCR, PaddleOCR VL, and more.

TesseractOcr: Traditional OCR

The TesseractOcr class in LMKit.Integrations.Tesseract wraps the Tesseract 5.x engine with additional intelligence:

  • 34 languages with automatic model download from HuggingFace.
  • Optional VLM-based language detection: Attach a vision model to auto-detect document language before OCR.
  • Auto-orientation detection: Correct rotated pages (0/90/180/270 degrees) automatically.
  • Auto-deskew: Straighten slightly tilted scans.
  • Layout-aware output: Returns OcrResult with TextElement objects containing bounding boxes, confidence scores, and page geometry.

Built-In OCR Tool

The ocr_recognize built-in tool wraps Tesseract for use in agent workflows. Agents can invoke OCR on images and PDFs as part of their reasoning loop, making document understanding a first-class tool capability.


Code Example

VLM OCR: Extract Markdown from a Document

using LMKit.Model;
using LMKit.Extraction.Ocr;

// Load a vision language model
var model = LM.LoadFromModelID("qwen2-vl:7b");

// Create VLM OCR engine with Markdown intent
var vlmOcr = new VlmOcr(model, VlmOcrIntent.Markdown);

// Process an image
var image = ImageBuffer.FromFile("scanned_report.png");
var result = await vlmOcr.RunAsync(image);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: structured Markdown with headings, paragraphs, and tables

VLM OCR: Extract Tables from a PDF Page

using LMKit.Model;
using LMKit.Extraction.Ocr;
using LMKit.Document;

var model = LM.LoadFromModelID("qwen2-vl:7b");
var vlmOcr = new VlmOcr(model, VlmOcrIntent.TableRecognition);

// Process a specific page from a PDF
var attachment = await Attachment.CreateAsync("financial_report.pdf");
var result = await vlmOcr.RunAsync(attachment, pageIndex: 2);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: Markdown tables preserving row/column structure

Tesseract OCR: Fast Text Extraction

using LMKit.Integrations.Tesseract;
using LMKit.Extraction.Ocr;

// Create Tesseract OCR engine
using var tesseract = new TesseractOcr
{
    DefaultLanguage = "eng",
    EnableOrientationDetection = true,
    EnableAutoDeskew = true
};

// Process a scanned page
var ocrParams = new OcrParameters
{
    Image = ImageBuffer.FromFile("scanned_page.tiff")
};

var result = await tesseract.RunAsync(ocrParams);

Console.WriteLine($"Rotation detected: {result.PageRotation} degrees");
Console.WriteLine($"Text: {result.PageText}");

// Access individual text elements with positions
foreach (var element in result.TextElements)
{
    Console.WriteLine($"[{element.BoundingBox}] {element.Text}");
}

VLM OCR: Locate Text with Bounding Boxes

using LMKit.Model;
using LMKit.Extraction.Ocr;

var model = LM.LoadFromModelID("qwen2-vl:7b");

// Query supported intents for this model
var supported = VlmOcr.GetSupportedIntents(model);
Console.WriteLine($"Supported intents: {string.Join(", ", supported)}");

// Extract text with coordinates
var vlmOcr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates);
var image = ImageBuffer.FromFile("form.png");
var result = await vlmOcr.RunAsync(image);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: text regions with bounding box coordinates

The OCR Pipeline

A typical document processing pipeline using OCR:

+-------------+     +----------------+     +-------------------+
| Input       | --> | Preprocessing  | --> | OCR Engine        |
| (image/PDF) |     | (deskew, crop, |     | (VlmOcr or        |
|             |     |  orientation)  |     |  TesseractOcr)    |
+-------------+     +----------------+     +-------------------+
                                                    |
                                                    v
                                           +-------------------+
                                           | Post-Processing   |
                                           | (cleanup, format) |
                                           +-------------------+
                                                    |
                           +------------------------+------------------------+
                           |                        |                        |
                           v                        v                        v
                    +-----------+           +---------------+        +---------------+
                    | RAG       |           | Extraction    |        | Classification|
                    | Pipeline  |           | (NER, fields) |        | (document     |
                    | (chunking)|           |               |        |  type, lang)  |
                    +-----------+           +---------------+        +---------------+

Key Terms

  • OCR (Optical Character Recognition): Converting images of text into machine-readable character strings.
  • VLM OCR: OCR powered by a vision language model, capable of understanding document structure, not just individual characters.
  • Tesseract: The open-source OCR engine maintained by Google, integrated in LM-Kit.NET via TesseractOcr.
  • VlmOcrIntent: An enum specifying the extraction goal (plain text, Markdown, table, formula, coordinates, etc.).
  • Bounding Box: A rectangular region on the page defining where a text element is located.
  • Deskew: Correcting the slight rotation (skew) of a scanned document so text lines are horizontal.
  • OcrResult: The structured output from the OcrEngine, containing page geometry, text elements, and layout information.
  • PageElement: The layout-friendly representation of recognized content within an OcrResult or VlmOcrResult.

  • VlmOcr: AI-powered OCR engine using vision language models
  • VlmOcrIntent: Enum defining extraction intents
  • OcrEngine: Abstract base class for all OCR implementations
  • OcrResult: Structured OCR output with page geometry and text elements
  • TesseractOcr: Traditional Tesseract-based OCR engine
  • ImageBuffer: Image container for OCR input
  • Attachment: Document container supporting PDFs and images


External Resources


Summary

Optical Character Recognition (OCR) transforms images and scanned documents into machine-readable text. LM-Kit.NET offers two complementary engines: VlmOcr for AI-powered transcription with deep structural understanding (tables, formulas, Markdown, coordinates), and TesseractOcr for fast, language-aware text extraction with auto-orientation and deskew. Both engines extend the shared OcrEngine base class and produce OcrResult objects with layout-aware text elements. Combined with the ocr_recognize built-in tool, OCR integrates seamlessly into agent workflows, RAG pipelines, and document processing systems, enabling fully automated document intelligence on-device.

Share