Understanding Optical Character Recognition (OCR) in LM-Kit.NET

TL;DR

Optical Character Recognition (OCR) is the technology that converts images of text into machine-readable strings. Modern OCR goes far beyond simple character recognition: it can preserve document layout, extract tables, recognize mathematical formulas, locate text regions with bounding boxes, and convert scanned documents to structured formats like Markdown. In LM-Kit.NET, two OCR engines are available: VlmOcr, powered by vision language models, for AI-driven transcription with deep understanding of document structure; and LMKitOcr, a high-throughput engine with very high accuracy on business documents, advanced page layout handling, and support for 34 languages.

What is OCR?

Definition: Optical Character Recognition (OCR) is the process of extracting text from images, scanned documents, photographs, or PDF pages. Traditional OCR uses pattern matching and character-level classifiers. Modern OCR combines classical techniques with deep learning and vision language models (VLMs) to understand not just individual characters, but entire document structures: headings, paragraphs, tables, formulas, stamps, and charts.

Why OCR Matters

Digitization: Convert paper documents, scanned archives, and photographs into searchable, editable text.
Document Intelligence: Feed OCR output into RAG pipelines, extraction systems, or classification engines for automated document processing.
Accessibility: Make image-based content accessible to screen readers and search engines.
Automation: Process invoices, contracts, receipts, and forms without manual data entry.
Multilingual Support: Extract text in dozens of languages from a single pipeline.

Traditional OCR vs. VLM OCR

Aspect	LM-Kit OCR	VLM OCR
Approach	High-throughput engine with advanced layout analysis	Vision language model inference
Speed	Very fast, optimized for high throughput	Slower (seconds per page, depends on model size)
Accuracy on business documents	Very high (invoices, contracts, reports, forms)	Excellent
Accuracy on degraded/noisy images	Good with preprocessing	Better contextual understanding
Layout understanding	Advanced page layout handling with reading order reconstruction	Full semantic structure (headings, sections, tables)
Table extraction	Limited	Preserves tabular structure
Formula recognition	Not supported	Supported (LaTeX output)
Output format	Plain text with coordinates	Markdown, structured text, coordinates
GPU required	No	Yes (model inference)
Languages	34 (dictionaries)	Depends on model training

Use LM-Kit OCR when you need high-throughput processing with very high accuracy on business documents and complex page layouts. Use VLM OCR when you need deep semantic understanding, table/formula recognition, or Markdown output.

VLM OCR Intents

The VlmOcrIntent enum controls what kind of content the VLM OCR engine focuses on:

Intent	Description	Output
`PlainText`	Extract unformatted text content	Raw text
`Markdown`	Extract with full document structure	Structured Markdown with headings, lists, tables
`TableRecognition`	Focus on tabular data	Markdown tables
`FormulaRecognition`	Extract mathematical expressions	LaTeX notation
`ChartRecognition`	Interpret charts and graphs	Textual description of data
`OcrWithCoordinates`	Extract text with spatial positions	Text with bounding box coordinates
`SealRecognition`	Identify stamps and seals	Recognized seal text

Not every model supports every intent. Use VlmOcr.GetSupportedIntents(model) to query which intents a given model family natively supports.

Practical Application in LM-Kit.NET SDK

VlmOcr: AI-Powered OCR

The VlmOcr class uses a loaded vision language model to transcribe images and documents. It operates in the LMKit.Extraction.Ocr namespace and extends the abstract OcrEngine base class.

Key capabilities:

Intent-driven extraction: Choose between plain text, Markdown, tables, formulas, charts, or coordinates.
Attachment support: Process PDF pages, images, and multi-page documents via the Attachment class.
Instruction customization: Override the default instruction to guide the model for domain-specific tasks.
Post-processing options: Strip Markdown image markup (StripImageMarkup) and HTML style attributes (StripStyleAttributes) from output.
Supported model families: Qwen2-VL, Gemma 4 VL, MiniCPM-O, LightONOCR, PaddleOCR VL, and more.

LMKitOcr: High-Throughput OCR

The LMKitOcr class in LMKit.Extraction.Ocr is engineered for speed, accuracy on business documents, and complex page layout handling:

High throughput: optimized for large-scale batch processing workflows.
Very high accuracy on business documents: invoices, contracts, reports, forms, and similar structured documents.
Complex page layout handling: advanced layout analysis with intelligent reading order reconstruction for multi-column documents.
34 languages with automatic model download from HuggingFace.
Optional VLM-based language detection: Attach a vision model to auto-detect document language before OCR.
Auto-orientation detection: Correct rotated pages (0/90/180/270 degrees) automatically.
Auto-deskew: Straighten slightly tilted scans.
Layout-aware output: Returns OcrResult with TextElement objects containing bounding boxes, confidence scores, and page geometry.

Built-In OCR Tool

The ocr_recognize built-in tool wraps LM-Kit OCR for use in agent workflows. Agents can invoke OCR on images and PDFs as part of their reasoning loop, making document understanding a first-class tool capability.

Code Example

VLM OCR: Extract Markdown from a Document

using LMKit.Model;
using LMKit.Extraction.Ocr;

// Load a vision language model
var model = LM.LoadFromModelID("qwen3.5:9b");

// Create VLM OCR engine with Markdown intent
var vlmOcr = new VlmOcr(model, VlmOcrIntent.Markdown);

// Process an image
var image = ImageBuffer.FromFile("scanned_report.png");
var result = await vlmOcr.RunAsync(image);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: structured Markdown with headings, paragraphs, and tables

VLM OCR: Extract Tables from a PDF Page

using LMKit.Model;
using LMKit.Extraction.Ocr;
using LMKit.Document;

var model = LM.LoadFromModelID("qwen3.5:9b");
var vlmOcr = new VlmOcr(model, VlmOcrIntent.TableRecognition);

// Process a specific page from a PDF
var attachment = await Attachment.CreateAsync("financial_report.pdf");
var result = await vlmOcr.RunAsync(attachment, pageIndex: 2);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: Markdown tables preserving row/column structure

LM-Kit OCR: Fast Text Extraction

using LMKit.Extraction.Ocr;

// Create LM-Kit OCR engine
using var ocr = new LMKitOcr
{
    DefaultLanguage = "eng",
    EnableAutoDeskew = true
};

// Process a scanned page with orientation detection
var ocrParams = new OcrParameters(
    ImageBuffer.FromFile("scanned_page.tiff"),
    enableOrientationDetection: true);

var result = await ocr.RunAsync(ocrParams);

Console.WriteLine($"Rotation detected: {result.PageRotation} degrees");
Console.WriteLine($"Text: {result.PageText}");

// Access individual text elements with positions
foreach (var element in result.TextElements)
{
    Console.WriteLine($"[{element.BoundingBox}] {element.Text}");
}

VLM OCR: Locate Text with Bounding Boxes

using LMKit.Model;
using LMKit.Extraction.Ocr;

var model = LM.LoadFromModelID("qwen3.5:9b");

// Query supported intents for this model
var supported = VlmOcr.GetSupportedIntents(model);
Console.WriteLine($"Supported intents: {string.Join(", ", supported)}");

// Extract text with coordinates
var vlmOcr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates);
var image = ImageBuffer.FromFile("form.png");
var result = await vlmOcr.RunAsync(image);

Console.WriteLine(result.TextGeneration.TextContent);
// Output: text regions with bounding box coordinates

The OCR Pipeline

A typical document processing pipeline using OCR:

+-------------+     +----------------+     +-------------------+
| Input       | --> | Preprocessing  | --> | OCR Engine        |
| (image/PDF) |     | (deskew, crop, |     | (VlmOcr or        |
|             |     |  orientation)  |     |  LMKitOcr)        |
+-------------+     +----------------+     +-------------------+
                                                    |
                                                    v
                                           +-------------------+
                                           | Post-Processing   |
                                           | (cleanup, format) |
                                           +-------------------+
                                                    |
                           +------------------------+------------------------+
                           |                        |                        |
                           v                        v                        v
                    +-----------+           +---------------+        +---------------+
                    | RAG       |           | Extraction    |        | Classification|
                    | Pipeline  |           | (NER, fields) |        | (document     |
                    | (chunking)|           |               |        |  type, lang)  |
                    +-----------+           +---------------+        +---------------+

Key Terms

OCR (Optical Character Recognition): Converting images of text into machine-readable character strings.
VLM OCR: OCR powered by a vision language model, capable of understanding document structure, not just individual characters.
LM-Kit OCR: The built-in traditional OCR engine in LM-Kit.NET, available via the LMKitOcr class.
VlmOcrIntent: An enum specifying the extraction goal (plain text, Markdown, table, formula, coordinates, etc.).
Bounding Box: A rectangular region on the page defining where a text element is located.
Deskew: Correcting the slight rotation (skew) of a scanned document so text lines are horizontal.
OcrResult: The structured output from the OcrEngine, containing page geometry, text elements, and layout information.
PageElement: The layout-friendly representation of recognized content within an OcrResult or VlmOcrResult.

VlmOcr: AI-powered OCR engine using vision language models
VlmOcrIntent: Enum defining extraction intents
OcrEngine: Abstract base class for all OCR implementations
OcrResult: Structured OCR output with page geometry and text elements
LMKitOcr: Traditional OCR engine for fast, layout-aware text extraction
ImageBuffer: Image container for OCR input
Attachment: Document container supporting PDFs and images

Vision Language Models (VLM): The multimodal models that power VLM OCR
Intelligent Document Processing (IDP): End-to-end document pipelines where OCR is one stage
Structured Data Extraction: Extracting typed fields from OCR output
RAG (Retrieval-Augmented Generation): Feeding OCR output into retrieval pipelines
Chunking: Splitting OCR output into retrievable segments
Named Entity Recognition (NER): Identifying entities within OCR-extracted text
Classification: Classifying documents after OCR extraction
Inference: The model inference process that underlies VLM OCR

External Resources

Qwen2-VL: Enhancing Vision-Language Model's Perception (Wang et al., 2024): The vision model behind LM-Kit.NET's primary VLM OCR backend
LM-Kit VLM OCR Demo: Working sample for VLM-based OCR
Extract Text with VLM OCR (How-To): Step-by-step VLM OCR guide

Summary

Optical Character Recognition (OCR) transforms images and scanned documents into machine-readable text. LM-Kit.NET offers two complementary engines: VlmOcr for AI-powered transcription with deep structural understanding (tables, formulas, Markdown, coordinates), and LMKitOcr for fast, language-aware text extraction with auto-orientation and deskew. Both engines extend the shared OcrEngine base class and produce OcrResult objects with layout-aware text elements. Combined with the ocr_recognize built-in tool, OCR integrates seamlessly into agent workflows, RAG pipelines, and document processing systems, enabling fully automated document intelligence on-device.

Table of Contents