Table of Contents

Understanding Optical Character Recognition (OCR) in LM-Kit.NET


TL;DR

Optical Character Recognition (OCR) is the technology that converts images of text into machine-readable strings. Modern OCR goes far beyond simple character recognition: it can preserve document layout, extract tables, recognize mathematical formulas, locate text regions with bounding boxes, and convert scanned documents to structured formats like Markdown. In LM-Kit.NET, two OCR engines are available: VlmOcr, powered by vision language models, for AI-driven transcription with deep understanding of document structure; and LMKitOcr, a high-throughput engine with very high accuracy on business documents, advanced page layout handling, and support for 34 languages.


What is OCR?

Definition: Optical Character Recognition (OCR) is the process of extracting text from images, scanned documents, photographs, or PDF pages. Traditional OCR uses pattern matching and character-level classifiers. Modern OCR combines classical techniques with deep learning and vision language models (VLMs) to understand not just individual characters, but entire document structures: headings, paragraphs, tables, formulas, stamps, and charts.

Why OCR Matters

  1. Digitization: Convert paper documents, scanned archives, and photographs into searchable, editable text.
  2. Document Intelligence: Feed OCR output into RAG pipelines, extraction systems, or classification engines for automated document processing.
  3. Accessibility: Make image-based content accessible to screen readers and search engines.
  4. Automation: Process invoices, contracts, receipts, and forms without manual data entry.
  5. Multilingual Support: Extract text in dozens of languages from a single pipeline.

Traditional OCR vs. VLM OCR

Aspect LM-Kit OCR VLM OCR
Approach High-throughput engine with advanced layout analysis Vision language model inference
Speed Very fast, optimized for high throughput Slower (seconds per page, depends on model size)
Accuracy on business documents Very high (invoices, contracts, reports, forms) Excellent
Accuracy on degraded/noisy images Good with preprocessing Better contextual understanding
Layout understanding Advanced page layout handling with reading order reconstruction Full semantic structure (headings, sections, tables)
Table extraction Limited Preserves tabular structure
Formula recognition Not supported Supported (LaTeX output)
Output format Plain text with coordinates Markdown, structured text, coordinates
GPU required No Yes (model inference)
Languages 34 (dictionaries) Depends on model training

Use LM-Kit OCR when you need high-throughput processing with very high accuracy on business documents and complex page layouts. Use VLM OCR when you need deep semantic understanding, table/formula recognition, or Markdown output.


VLM OCR Intents

The VlmOcrIntent enum controls what kind of content the VLM OCR engine focuses on:

Intent Description Output
PlainText Extract unformatted text content Raw text
Markdown Extract with full document structure Structured Markdown with headings, lists, tables
TableRecognition Focus on tabular data Markdown tables
FormulaRecognition Extract mathematical expressions LaTeX notation
ChartRecognition Interpret charts and graphs Textual description of data
OcrWithCoordinates Extract text with spatial positions Text with bounding box coordinates
SealRecognition Identify stamps and seals Recognized seal text

Not every model supports every intent. Use VlmOcr.GetSupportedIntents(model) to query which intents a given model family natively supports.


Practical Application in LM-Kit.NET SDK

VlmOcr: AI-Powered OCR

The VlmOcr class uses a loaded vision language model to transcribe images and documents. It operates in the LMKit.Extraction.Ocr namespace and extends the abstract OcrEngine base class.

Key capabilities:

  • Intent-driven extraction: Choose between plain text, Markdown, tables, formulas, charts, or coordinates.
  • Attachment support: Process PDF pages, images, and multi-page documents via the Attachment class.
  • Instruction customization: Override the default instruction to guide the model for domain-specific tasks.
  • Post-processing options: Strip Markdown image markup (StripImageMarkup) and HTML style attributes (StripStyleAttributes) from output.
  • Supported model families: Qwen2-VL, Gemma 4 VL, MiniCPM-O, LightONOCR, PaddleOCR VL, and more.

LMKitOcr: High-Throughput OCR

The LMKitOcr class in LMKit.Extraction.Ocr is engineered for speed, accuracy on business documents, and complex page layout handling:

  • High throughput: optimized for large-scale batch processing workflows.
  • Very high accuracy on business documents: invoices, contracts, reports, forms, and similar structured documents.
  • Complex page layout handling: advanced layout analysis with intelligent reading order reconstruction for multi-column documents.
  • 34 languages with automatic model download from HuggingFace.
  • Optional VLM-based language detection: Attach a vision model to auto-detect document language before OCR.
  • Auto-orientation detection: Correct rotated pages (0/90/180/270 degrees) automatically.
  • Auto-deskew: Straighten slightly tilted scans.
  • Layout-aware output: Returns OcrResult with TextElement objects containing bounding boxes, confidence scores, and page geometry.

Built-In OCR Tool

The ocr_recognize built-in tool wraps LM-Kit OCR for use in agent workflows. Agents can invoke OCR on images and PDFs as part of their reasoning loop, making document understanding a first-class tool capability.


Code Example

VLM OCR: Extract Markdown from a Document

using LMKit.Model;
using LMKit.Extraction.Ocr;

// Load a vision language model
var model = LM.LoadFromModelID("qwen2-vl:7b");

// Create VLM OCR engine with Markdown intent
var vlmOcr = new VlmOcr(model, VlmOcrIntent.Markdown);

// Process an image
var image = ImageBuffer.FromFile("scanned_report.png");
var result = await vlmOcr.RunAsync(image);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: structured Markdown with headings, paragraphs, and tables

VLM OCR: Extract Tables from a PDF Page

using LMKit.Model;
using LMKit.Extraction.Ocr;
using LMKit.Document;

var model = LM.LoadFromModelID("qwen2-vl:7b");
var vlmOcr = new VlmOcr(model, VlmOcrIntent.TableRecognition);

// Process a specific page from a PDF
var attachment = await Attachment.CreateAsync("financial_report.pdf");
var result = await vlmOcr.RunAsync(attachment, pageIndex: 2);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: Markdown tables preserving row/column structure

LM-Kit OCR: Fast Text Extraction

using LMKit.Extraction.Ocr;

// Create LM-Kit OCR engine
using var ocr = new LMKitOcr
{
    DefaultLanguage = "eng",
    EnableAutoDeskew = true
};

// Process a scanned page with orientation detection
var ocrParams = new OcrParameters(
    ImageBuffer.FromFile("scanned_page.tiff"),
    enableOrientationDetection: true);

var result = await ocr.RunAsync(ocrParams);

Console.WriteLine($"Rotation detected: {result.PageRotation} degrees");
Console.WriteLine($"Text: {result.PageText}");

// Access individual text elements with positions
foreach (var element in result.TextElements)
{
    Console.WriteLine($"[{element.BoundingBox}] {element.Text}");
}

VLM OCR: Locate Text with Bounding Boxes

using LMKit.Model;
using LMKit.Extraction.Ocr;

var model = LM.LoadFromModelID("qwen2-vl:7b");

// Query supported intents for this model
var supported = VlmOcr.GetSupportedIntents(model);
Console.WriteLine($"Supported intents: {string.Join(", ", supported)}");

// Extract text with coordinates
var vlmOcr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates);
var image = ImageBuffer.FromFile("form.png");
var result = await vlmOcr.RunAsync(image);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: text regions with bounding box coordinates

The OCR Pipeline

A typical document processing pipeline using OCR:

+-------------+     +----------------+     +-------------------+
| Input       | --> | Preprocessing  | --> | OCR Engine        |
| (image/PDF) |     | (deskew, crop, |     | (VlmOcr or        |
|             |     |  orientation)  |     |  LMKitOcr)        |
+-------------+     +----------------+     +-------------------+
                                                    |
                                                    v
                                           +-------------------+
                                           | Post-Processing   |
                                           | (cleanup, format) |
                                           +-------------------+
                                                    |
                           +------------------------+------------------------+
                           |                        |                        |
                           v                        v                        v
                    +-----------+           +---------------+        +---------------+
                    | RAG       |           | Extraction    |        | Classification|
                    | Pipeline  |           | (NER, fields) |        | (document     |
                    | (chunking)|           |               |        |  type, lang)  |
                    +-----------+           +---------------+        +---------------+

Key Terms

  • OCR (Optical Character Recognition): Converting images of text into machine-readable character strings.
  • VLM OCR: OCR powered by a vision language model, capable of understanding document structure, not just individual characters.
  • LM-Kit OCR: The built-in traditional OCR engine in LM-Kit.NET, available via the LMKitOcr class.
  • VlmOcrIntent: An enum specifying the extraction goal (plain text, Markdown, table, formula, coordinates, etc.).
  • Bounding Box: A rectangular region on the page defining where a text element is located.
  • Deskew: Correcting the slight rotation (skew) of a scanned document so text lines are horizontal.
  • OcrResult: The structured output from the OcrEngine, containing page geometry, text elements, and layout information.
  • PageElement: The layout-friendly representation of recognized content within an OcrResult or VlmOcrResult.

  • VlmOcr: AI-powered OCR engine using vision language models
  • VlmOcrIntent: Enum defining extraction intents
  • OcrEngine: Abstract base class for all OCR implementations
  • OcrResult: Structured OCR output with page geometry and text elements
  • LMKitOcr: Traditional OCR engine for fast, layout-aware text extraction
  • ImageBuffer: Image container for OCR input
  • Attachment: Document container supporting PDFs and images


External Resources


Summary

Optical Character Recognition (OCR) transforms images and scanned documents into machine-readable text. LM-Kit.NET offers two complementary engines: VlmOcr for AI-powered transcription with deep structural understanding (tables, formulas, Markdown, coordinates), and LMKitOcr for fast, language-aware text extraction with auto-orientation and deskew. Both engines extend the shared OcrEngine base class and produce OcrResult objects with layout-aware text elements. Combined with the ocr_recognize built-in tool, OCR integrates seamlessly into agent workflows, RAG pipelines, and document processing systems, enabling fully automated document intelligence on-device.

Share