Understanding Optical Character Recognition (OCR) in LM-Kit.NET
TL;DR
Optical Character Recognition (OCR) is the technology that converts images of text into machine-readable strings. Modern OCR goes far beyond simple character recognition: it can preserve document layout, extract tables, recognize mathematical formulas, locate text regions with bounding boxes, and convert scanned documents to structured formats like Markdown. In LM-Kit.NET, two OCR engines are available: VlmOcr, powered by vision language models, for AI-driven transcription with deep understanding of document structure; and TesseractOcr, the industry-standard engine for fast, layout-aware text extraction with support for 34 languages.
What is OCR?
Definition: Optical Character Recognition (OCR) is the process of extracting text from images, scanned documents, photographs, or PDF pages. Traditional OCR uses pattern matching and character-level classifiers. Modern OCR combines classical techniques with deep learning and vision language models (VLMs) to understand not just individual characters, but entire document structures: headings, paragraphs, tables, formulas, stamps, and charts.
Why OCR Matters
- Digitization: Convert paper documents, scanned archives, and photographs into searchable, editable text.
- Document Intelligence: Feed OCR output into RAG pipelines, extraction systems, or classification engines for automated document processing.
- Accessibility: Make image-based content accessible to screen readers and search engines.
- Automation: Process invoices, contracts, receipts, and forms without manual data entry.
- Multilingual Support: Extract text in dozens of languages from a single pipeline.
Traditional OCR vs. VLM OCR
| Aspect | Traditional OCR (Tesseract) | VLM OCR |
|---|---|---|
| Approach | Character segmentation + classifiers | Vision language model inference |
| Speed | Very fast (milliseconds per page) | Slower (seconds per page, depends on model size) |
| Accuracy on clean text | Excellent | Excellent |
| Accuracy on degraded/noisy images | Good with preprocessing | Better contextual understanding |
| Layout understanding | Bounding boxes and text blocks | Full semantic structure (headings, sections, tables) |
| Table extraction | Limited | Preserves tabular structure |
| Formula recognition | Not supported | Supported (LaTeX output) |
| Output format | Plain text with coordinates | Markdown, structured text, coordinates |
| GPU required | No | Yes (model inference) |
| Languages | 34 (trained data files) | Depends on model training |
Use Tesseract when you need fast, high-volume text extraction from clean documents. Use VLM OCR when you need deep structural understanding, table/formula recognition, or Markdown output.
VLM OCR Intents
The VlmOcrIntent enum controls what kind of content the VLM OCR engine focuses on:
| Intent | Description | Output |
|---|---|---|
PlainText |
Extract unformatted text content | Raw text |
Markdown |
Extract with full document structure | Structured Markdown with headings, lists, tables |
TableRecognition |
Focus on tabular data | Markdown tables |
FormulaRecognition |
Extract mathematical expressions | LaTeX notation |
ChartRecognition |
Interpret charts and graphs | Textual description of data |
OcrWithCoordinates |
Extract text with spatial positions | Text with bounding box coordinates |
SealRecognition |
Identify stamps and seals | Recognized seal text |
Not every model supports every intent. Use VlmOcr.GetSupportedIntents(model) to query which intents a given model family natively supports.
Practical Application in LM-Kit.NET SDK
VlmOcr: AI-Powered OCR
The VlmOcr class uses a loaded vision language model to transcribe images and documents. It operates in the LMKit.Extraction.Ocr namespace and extends the abstract OcrEngine base class.
Key capabilities:
- Intent-driven extraction: Choose between plain text, Markdown, tables, formulas, charts, or coordinates.
- Attachment support: Process PDF pages, images, and multi-page documents via the
Attachmentclass. - Instruction customization: Override the default instruction to guide the model for domain-specific tasks.
- Post-processing options: Strip Markdown image markup (
StripImageMarkup) and HTML style attributes (StripStyleAttributes) from output. - Supported model families: Qwen2-VL, Gemma 3 VL, MiniCPM-O, LightONOCR, PaddleOCR VL, and more.
TesseractOcr: Traditional OCR
The TesseractOcr class in LMKit.Integrations.Tesseract wraps the Tesseract 5.x engine with additional intelligence:
- 34 languages with automatic model download from HuggingFace.
- Optional VLM-based language detection: Attach a vision model to auto-detect document language before OCR.
- Auto-orientation detection: Correct rotated pages (0/90/180/270 degrees) automatically.
- Auto-deskew: Straighten slightly tilted scans.
- Layout-aware output: Returns
OcrResultwithTextElementobjects containing bounding boxes, confidence scores, and page geometry.
Built-In OCR Tool
The ocr_recognize built-in tool wraps Tesseract for use in agent workflows. Agents can invoke OCR on images and PDFs as part of their reasoning loop, making document understanding a first-class tool capability.
Code Example
VLM OCR: Extract Markdown from a Document
using LMKit.Model;
using LMKit.Extraction.Ocr;
// Load a vision language model
var model = LM.LoadFromModelID("qwen2-vl:7b");
// Create VLM OCR engine with Markdown intent
var vlmOcr = new VlmOcr(model, VlmOcrIntent.Markdown);
// Process an image
var image = ImageBuffer.FromFile("scanned_report.png");
var result = await vlmOcr.RunAsync(image);
Console.WriteLine(result.TextGeneration.TextContent);
// Output: structured Markdown with headings, paragraphs, and tables
VLM OCR: Extract Tables from a PDF Page
using LMKit.Model;
using LMKit.Extraction.Ocr;
using LMKit.Document;
var model = LM.LoadFromModelID("qwen2-vl:7b");
var vlmOcr = new VlmOcr(model, VlmOcrIntent.TableRecognition);
// Process a specific page from a PDF
var attachment = await Attachment.CreateAsync("financial_report.pdf");
var result = await vlmOcr.RunAsync(attachment, pageIndex: 2);
Console.WriteLine(result.TextGeneration.TextContent);
// Output: Markdown tables preserving row/column structure
Tesseract OCR: Fast Text Extraction
using LMKit.Integrations.Tesseract;
using LMKit.Extraction.Ocr;
// Create Tesseract OCR engine
using var tesseract = new TesseractOcr
{
DefaultLanguage = "eng",
EnableOrientationDetection = true,
EnableAutoDeskew = true
};
// Process a scanned page
var ocrParams = new OcrParameters
{
Image = ImageBuffer.FromFile("scanned_page.tiff")
};
var result = await tesseract.RunAsync(ocrParams);
Console.WriteLine($"Rotation detected: {result.PageRotation} degrees");
Console.WriteLine($"Text: {result.PageText}");
// Access individual text elements with positions
foreach (var element in result.TextElements)
{
Console.WriteLine($"[{element.BoundingBox}] {element.Text}");
}
VLM OCR: Locate Text with Bounding Boxes
using LMKit.Model;
using LMKit.Extraction.Ocr;
var model = LM.LoadFromModelID("qwen2-vl:7b");
// Query supported intents for this model
var supported = VlmOcr.GetSupportedIntents(model);
Console.WriteLine($"Supported intents: {string.Join(", ", supported)}");
// Extract text with coordinates
var vlmOcr = new VlmOcr(model, VlmOcrIntent.OcrWithCoordinates);
var image = ImageBuffer.FromFile("form.png");
var result = await vlmOcr.RunAsync(image);
Console.WriteLine(result.TextGeneration.TextContent);
// Output: text regions with bounding box coordinates
The OCR Pipeline
A typical document processing pipeline using OCR:
+-------------+ +----------------+ +-------------------+
| Input | --> | Preprocessing | --> | OCR Engine |
| (image/PDF) | | (deskew, crop, | | (VlmOcr or |
| | | orientation) | | TesseractOcr) |
+-------------+ +----------------+ +-------------------+
|
v
+-------------------+
| Post-Processing |
| (cleanup, format) |
+-------------------+
|
+------------------------+------------------------+
| | |
v v v
+-----------+ +---------------+ +---------------+
| RAG | | Extraction | | Classification|
| Pipeline | | (NER, fields) | | (document |
| (chunking)| | | | type, lang) |
+-----------+ +---------------+ +---------------+
Key Terms
- OCR (Optical Character Recognition): Converting images of text into machine-readable character strings.
- VLM OCR: OCR powered by a vision language model, capable of understanding document structure, not just individual characters.
- Tesseract: The open-source OCR engine maintained by Google, integrated in LM-Kit.NET via
TesseractOcr. - VlmOcrIntent: An enum specifying the extraction goal (plain text, Markdown, table, formula, coordinates, etc.).
- Bounding Box: A rectangular region on the page defining where a text element is located.
- Deskew: Correcting the slight rotation (skew) of a scanned document so text lines are horizontal.
- OcrResult: The structured output from the
OcrEngine, containing page geometry, text elements, and layout information. - PageElement: The layout-friendly representation of recognized content within an
OcrResultorVlmOcrResult.
Related API Documentation
VlmOcr: AI-powered OCR engine using vision language modelsVlmOcrIntent: Enum defining extraction intentsOcrEngine: Abstract base class for all OCR implementationsOcrResult: Structured OCR output with page geometry and text elementsTesseractOcr: Traditional Tesseract-based OCR engineImageBuffer: Image container for OCR inputAttachment: Document container supporting PDFs and images
Related Glossary Topics
- Vision Language Models (VLM): The multimodal models that power VLM OCR
- Intelligent Document Processing (IDP): End-to-end document pipelines where OCR is one stage
- Structured Data Extraction: Extracting typed fields from OCR output
- RAG (Retrieval-Augmented Generation): Feeding OCR output into retrieval pipelines
- Chunking: Splitting OCR output into retrievable segments
- Named Entity Recognition (NER): Identifying entities within OCR-extracted text
- Classification: Classifying documents after OCR extraction
- Inference: The model inference process that underlies VLM OCR
External Resources
- An Overview of Tesseract OCR Engine: Official Tesseract documentation
- Qwen2-VL: Enhancing Vision-Language Model's Perception (Wang et al., 2024): The vision model behind LM-Kit.NET's primary VLM OCR backend
- LM-Kit VLM OCR Demo: Working sample for VLM-based OCR
- Extract Text with VLM OCR (How-To): Step-by-step VLM OCR guide
Summary
Optical Character Recognition (OCR) transforms images and scanned documents into machine-readable text. LM-Kit.NET offers two complementary engines: VlmOcr for AI-powered transcription with deep structural understanding (tables, formulas, Markdown, coordinates), and TesseractOcr for fast, language-aware text extraction with auto-orientation and deskew. Both engines extend the shared OcrEngine base class and produce OcrResult objects with layout-aware text elements. Combined with the ocr_recognize built-in tool, OCR integrates seamlessly into agent workflows, RAG pipelines, and document processing systems, enabling fully automated document intelligence on-device.