Process Scanned Documents with OCR and Vision Models
Many enterprise documents exist only as scanned images: legacy archives, signed contracts, faxed purchase orders, and handwritten inspection forms. These documents have no text layer, so standard text extraction returns nothing. LM-Kit.NET provides three built-in OCR engines: VlmOcr (Vision Language Model with Dynamic Sampling) for layout-aware understanding, TesseractOcr for traditional character recognition, and TextractOcr for cloud-based OCR via Amazon Textract. All three inherit from the OcrEngine abstract class, which you can also extend to integrate any other OCR provider. For engines that return word bounding boxes (TesseractOcr, TextractOcr, or any custom provider), LM-Kit.NET's internal layout analysis system reconstructs the full document structure: paragraphs with correct reading order, lines, and words. The InferenceModality setting controls how extraction and analysis use text, vision, or both. This tutorial builds a scanned document processor that selects the right OCR strategy per document type and shows how to plug in custom OCR backends.
Why Choosing the Right OCR Approach Matters
Two enterprise problems that a configurable OCR strategy solves:
- Mixed-quality document archives. An insurance company digitizing 20 years of claims has clean typed forms alongside handwritten adjuster notes and faded fax copies. VLM OCR handles degraded inputs and handwriting, while Tesseract OCR is faster for clean typed documents. A strategy that routes documents to the right engine maximizes throughput without sacrificing accuracy.
- Complex document layouts. Financial statements, engineering drawings, and medical forms combine tables, charts, stamps, and free-form text. LM-Kit.NET handles layout reconstruction at two levels. For bounding-box engines (TesseractOcr, TextractOcr, or custom providers), the internal layout analysis system reconstructs paragraphs, reading order, and line grouping from word coordinates. For VLM OCR with the recommended
lightonocr-2:1bmodel, Dynamic Sampling produces structured Markdown that preserves tables and headings directly. Both paths enable accurate downstream extraction and search.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 2+ GB for VLM OCR, none for Tesseract |
| Disk | ~2 GB free for model download |
| Input formats | Scanned PDF, PNG, JPEG, TIFF, BMP, WebP |
Step 1: Create the Project
dotnet new console -n ScannedDocProcessor
cd ScannedDocProcessor
dotnet add package LM-Kit.NET
Step 2: Understand the OCR Architecture
All OCR engines in LM-Kit.NET inherit from the abstract OcrEngine class. This means any engine can be used interchangeably with TextExtraction, DocumentRag, and other document processing components.
Layout reconstruction. TesseractOcr and TextractOcr return word-level bounding boxes. LM-Kit.NET feeds these bounding boxes into its internal layout analysis system, which reconstructs the full document structure: paragraphs with correct reading order, lines, and words. As long as an OCR engine provides word bounding boxes, LM-Kit.NET can reconstruct the layout with very high precision. This layout analysis system is the result of continuous research in document layout understanding and is improved with every release.
VLM OCR with Dynamic Sampling. VlmOcr takes a different approach: it sends the page image directly to a Vision Language Model, which understands the layout visually and produces structured Markdown. When paired with the recommended lightonocr-2:1b model, LM-Kit.NET applies Dynamic Sampling technology on top of the model, achieving exceptional precision and speed for OCR workloads.
OcrEngine (abstract)
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ VlmOcr │ │ TesseractOcr │ │ TextractOcr │
│ │ │ │ │ │
│ Vision LLM │ │ Traditional │ │ Amazon │
│ + Dynamic │ │ character │ │ Textract │
│ Sampling │ │ recognition │ │ cloud API │
│ │ │ │ │ │
│ Output: │ │ Output: │ │ Output: │
│ Structured │ │ Reconstructed │ │ Reconstructed │
│ Markdown │ │ layout via │ │ layout via │
│ (visual) │ │ bounding boxes │ │ bounding boxes │
└─────────────────┘ └─────────────────┘ └─────────────────┘
You can also subclass OcrEngine to add Google Vision,
Azure AI Vision, or any other OCR backend.
| Feature | VlmOcr | TesseractOcr | TextractOcr |
|---|---|---|---|
| Layout preservation | Structured Markdown (visual understanding) | Reconstructed paragraphs, lines, words via layout analysis | Reconstructed paragraphs, lines, words via layout analysis |
| Handwriting | Good (context-aware) | Limited | Good |
| Speed | Fast with lightonocr-2:1b + Dynamic Sampling |
Faster (CPU-based) | Fast (cloud) |
| GPU required | Yes | No | No (cloud-based) |
| Internet required | No | No | Yes |
| Best for | Complex layouts, mixed content, degraded scans | Clean typed text, high-volume batch | High-throughput cloud workloads |
Step 3: VLM OCR for Complex Documents
VLM OCR sends each page image directly to a Vision Language Model, which visually interprets the layout and produces structured Markdown. The recommended model for OCR workloads is lightonocr-2:1b, a purpose-built OCR model that LM-Kit.NET enhances with Dynamic Sampling technology. Dynamic Sampling optimizes the token generation strategy at inference time, delivering exceptional accuracy and speed that surpasses what the base model achieves alone.
using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
MaximumCompletionTokens = 4096
};
Console.WriteLine("=== VLM OCR: Scanned Document ===\n");
string imagePath = "scanned_invoice.png";
if (File.Exists(imagePath))
{
var image = new ImageBuffer(imagePath);
Console.Write($"Processing {imagePath}... ");
VlmOcr.VlmOcrResult result = vlmOcr.Run(image);
string markdown = result.TextGeneration.Completion;
Console.WriteLine($"done ({result.TextGeneration.GeneratedTokenCount} tokens)\n");
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine(markdown);
Console.ResetColor();
// Save as Markdown
File.WriteAllText("output.md", markdown);
Console.WriteLine("\nSaved to output.md");
}
Step 4: Custom OCR Instructions
Tailor OCR behavior for specific document types:
// Standard document transcription
vlmOcr.Instruction = "Transcribe this document as Markdown, preserving headings, tables, and lists.";
// Focus on tabular data
vlmOcr.Instruction = "This is a financial statement. Extract all tables as Markdown tables. " +
"Preserve column headers and alignment. Include all numeric values.";
// Handwritten notes
vlmOcr.Instruction = "This is a handwritten document. Transcribe the handwriting as accurately as possible. " +
"Use [illegible] for text that cannot be read.";
// Forms with labeled fields
vlmOcr.Instruction = "This is a filled form. Extract each field as 'Label: Value' on a separate line. " +
"Include checkboxes as [x] checked or [ ] unchecked.";
// Code or technical diagrams
vlmOcr.Instruction = "This contains source code. Transcribe as a fenced code block with language annotation.";
Step 5: Process Multi-Page Scanned PDFs
Console.WriteLine("\n=== Multi-Page Scanned PDF ===\n");
string pdfPath = "scanned_report.pdf";
if (File.Exists(pdfPath))
{
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Processing {pageCount} pages from {Path.GetFileName(pdfPath)}...\n");
var fullDocument = new StringBuilder();
for (int page = 0; page < pageCount; page++)
{
Console.Write($" Page {page + 1}/{pageCount}... ");
VlmOcr.VlmOcrResult pageResult = vlmOcr.Run(attachment, pageIndex: page);
string pageMarkdown = pageResult.TextGeneration.Completion;
fullDocument.AppendLine($"## Page {page + 1}");
fullDocument.AppendLine();
fullDocument.AppendLine(pageMarkdown);
fullDocument.AppendLine();
Console.WriteLine($"{pageResult.TextGeneration.GeneratedTokenCount} tokens");
}
string outputPath = Path.ChangeExtension(pdfPath, ".md");
File.WriteAllText(outputPath, fullDocument.ToString());
Console.WriteLine($"\nSaved {pageCount} pages to {outputPath}");
}
Step 6: Using InferenceModality for Extraction
When combining OCR with data extraction, the InferenceModality property controls how the model processes the input:
using LMKit.Extraction;
using LMKit.Inference;
Console.WriteLine("\n=== Extraction from Scanned Documents ===\n");
// Load a general-purpose model for extraction
Console.WriteLine("Loading extraction model...");
using LM extractionModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
var extractor = new TextExtraction(extractionModel)
{
Elements = new List<TextExtractionElement>
{
new("invoice_number", TextExtractionElement.ElementType.String, "Invoice number"),
new("vendor_name", TextExtractionElement.ElementType.String, "Vendor or company name"),
new("total_amount", TextExtractionElement.ElementType.Number, "Total amount"),
}
};
// Text mode: uses extracted text only (fast, needs text layer or pre-OCR)
extractor.PreferredInferenceModality = InferenceModality.Text;
// Vision mode: sends the image directly to the model (no OCR needed)
extractor.PreferredInferenceModality = InferenceModality.Vision;
// Multimodal: combines both text and image for best accuracy
extractor.PreferredInferenceModality = InferenceModality.Multimodal;
// BestModality: model picks the best single modality automatically
extractor.PreferredInferenceModality = InferenceModality.BestModality;
// Extract from scanned image using vision
extractor.PreferredInferenceModality = InferenceModality.Vision;
extractor.SetContent(new ImageBuffer("scanned_invoice.png"));
TextExtractionResult result = extractor.Parse();
Console.WriteLine($"Invoice #: {result.GetValue<string>("invoice_number")}");
Console.WriteLine($"Vendor: {result.GetValue<string>("vendor_name")}");
Console.WriteLine($"Total: {result.GetValue<double>("total_amount")}");
Step 7: OCR Engine Events
Monitor OCR processing with events:
vlmOcr.OcrStarting += (sender, e) =>
{
Console.WriteLine($" OCR starting for page...");
// Set e.Cancel = true to skip this page
};
vlmOcr.OcrCompleted += (sender, e) =>
{
Console.WriteLine($" OCR completed: {e.Result.TextGeneration.GeneratedTokenCount} tokens");
};
Step 8: Amazon Textract OCR
For cloud-based OCR with Amazon Textract, use TextractOcr. This sends images to the AWS Textract API and returns word-level bounding boxes. LM-Kit.NET's layout analysis system then reconstructs the full document structure (paragraphs with reading order, lines, and words) from these bounding boxes with very high precision:
using LMKit.Integrations.AWS;
using LMKit.Integrations.AWS.Ocr.Textract;
Console.WriteLine("\n=== Amazon Textract OCR ===\n");
// ──────────────────────────────────────
// Configure Textract with AWS credentials
// ──────────────────────────────────────
var textractOcr = new TextractOcr(
awsAccessKeyId: Environment.GetEnvironmentVariable("AWS_ACCESS_KEY_ID"),
awsSecretAccessKey: Environment.GetEnvironmentVariable("AWS_SECRET_ACCESS_KEY"),
region: AWSRegion.USEast1)
{
Timeout = TimeSpan.FromSeconds(30)
};
// Monitor progress with events (inherited from OcrEngine)
textractOcr.OcrStarting += (_, e) =>
{
Console.WriteLine($" Sending page to Textract...");
};
textractOcr.OcrCompleted += (_, e) =>
{
if (e.Exception != null)
Console.WriteLine($" Textract error: {e.Exception.Message}");
else
Console.WriteLine($" Textract completed: {e.Result.PageText.Length} chars");
};
// Process a scanned image
string imagePath = "scanned_invoice.png";
if (File.Exists(imagePath))
{
var parameters = new OcrParameters(new ImageBuffer(imagePath));
OcrResult textractResult = await textractOcr.RunAsync(parameters);
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($"\n{textractResult.PageText}");
Console.ResetColor();
// Access bounding box information for layout analysis
foreach (var element in textractResult.TextElements)
{
Console.WriteLine($" Text: \"{element.Text}\" at ({element.X:F0}, {element.Y:F0})");
}
}
You can parse the region from a string using AWSRegionConverter:
// Parse region from configuration
AWSRegion region = AWSRegionConverter.ParseRegion("eu-west-1");
string regionId = AWSRegionConverter.ToIdentifier(AWSRegion.EUWest1); // "eu-west-1"
Step 9: Use Any OCR Engine with TextExtraction and DocumentRag
Every OcrEngine subclass works interchangeably with TextExtraction and DocumentRag through the OcrEngine property:
using LMKit.Extraction;
using LMKit.Retrieval;
// ──────────────────────────────────────
// Use Textract with TextExtraction
// ──────────────────────────────────────
var extractor = new TextExtraction(extractionModel)
{
OcrEngine = textractOcr, // Swap in any OcrEngine implementation
Elements = new List<TextExtractionElement>
{
new("invoice_number", TextExtractionElement.ElementType.String, "Invoice number"),
new("vendor_name", TextExtractionElement.ElementType.String, "Vendor or company name"),
new("total_amount", TextExtractionElement.ElementType.Number, "Total amount"),
}
};
// ──────────────────────────────────────
// Use Textract with DocumentRag
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
OcrEngine = textractOcr // Scanned pages use Textract for text extraction
};
// Switch to VLM OCR for vision-based understanding
rag.OcrEngine = vlmOcr;
// Switch to Tesseract for CPU-only environments
rag.OcrEngine = new TesseractOcr();
Step 10: Build a Custom OCR Provider
The OcrEngine abstract class lets you integrate any OCR backend (Google Cloud Vision, Azure AI Vision, ABBYY, or a custom service). Override the RunAsync method and return an OcrResult. If your OCR provider returns word bounding boxes, include them in the OcrResult so that LM-Kit.NET's layout analysis system can reconstruct paragraphs, reading order, lines, and words with high precision:
using LMKit.Extraction.Ocr;
public sealed class GoogleVisionOcr : OcrEngine
{
private readonly string _apiKey;
public GoogleVisionOcr(string apiKey)
{
_apiKey = apiKey;
}
public override async Task<OcrResult> RunAsync(
OcrParameters ocrParameters,
CancellationToken cancellationToken = default)
{
// 1. Get the image bytes from OcrParameters
byte[] imageBytes = ocrParameters.ImageData; // PNG-encoded image
string mime = ocrParameters.Mime; // Always "image/png"
// 2. Call your OCR service
// ... send imageBytes to Google Cloud Vision API ...
string extractedText = "Text from Google Vision...";
// 3. Return as OcrResult
// Option A: Simple text result (no bounding boxes, no layout reconstruction)
return new OcrResult(extractedText);
// Option B (recommended): With word bounding boxes for layout reconstruction.
// When you provide bounding boxes, LM-Kit.NET's layout analysis system
// automatically reconstructs paragraphs, reading order, lines, and words
// with very high precision.
// var textElements = new List<TextElement>
// {
// new TextElement("Invoice #123", x: 100, y: 50, width: 200, height: 20),
// new TextElement("Total: $500", x: 100, y: 300, width: 150, height: 20),
// };
// return new OcrResult(textElements,
// pageWidth: ocrParameters.Image.Width,
// pageHeight: ocrParameters.Image.Height);
}
}
// Use your custom provider anywhere an OcrEngine is accepted
var customOcr = new GoogleVisionOcr("your-api-key");
var extractor = new TextExtraction(model) { OcrEngine = customOcr };
var rag = new DocumentRag(embeddingModel) { OcrEngine = customOcr };
The OcrEngine base class provides OcrStarting and OcrCompleted events automatically, so any custom provider gets event support without additional code.
Step 11: Batch Processing with Adaptive Strategy
Route documents to the best OCR approach based on their characteristics:
Console.WriteLine("\n=== Adaptive Batch OCR ===\n");
string inputDir = "scanned_docs";
string outputDir = "ocr_output";
Directory.CreateDirectory(outputDir);
string[] files = Directory.GetFiles(inputDir)
.Where(f => new[] { ".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp" }
.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
Console.WriteLine($"Processing {files.Length} file(s)...\n");
foreach (string file in files)
{
string fileName = Path.GetFileName(file);
Console.Write($" {fileName}: ");
var attachment = new Attachment(file);
var fullText = new StringBuilder();
for (int page = 0; page < Math.Max(1, attachment.PageCount); page++)
{
// Use VLM OCR for all scanned content
VlmOcr.VlmOcrResult pageResult = attachment.PageCount > 0
? vlmOcr.Run(attachment, pageIndex: page)
: vlmOcr.Run(new ImageBuffer(file));
fullText.AppendLine(pageResult.TextGeneration.Completion);
fullText.AppendLine();
}
string outPath = Path.Combine(outputDir, Path.ChangeExtension(fileName, ".md"));
File.WriteAllText(outPath, fullText.ToString());
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"VLM OCR → {outPath}");
Console.ResetColor();
}
Console.WriteLine($"\nAll files processed to {Path.GetFullPath(outputDir)}");
Model Selection for OCR
| Model ID | VRAM | Speed | Best For |
|---|---|---|---|
lightonocr-2:1b (recommended) |
~2 GB | Fastest | Purpose-built OCR with Dynamic Sampling. Best precision and speed |
qwen3-vl:2b |
~2.5 GB | Very fast | Lightweight multilingual OCR |
qwen3-vl:4b |
~4 GB | Fast | Multilingual documents, good accuracy |
gemma3:4b |
~5.7 GB | Moderate | Mixed text and vision tasks |
qwen3-vl:8b |
~6.5 GB | Moderate | High-quality multilingual OCR |
gemma3:12b |
~11 GB | Slow | Complex layouts, degraded scans, handwriting |
For dedicated OCR workloads, lightonocr-2:1b is the top recommendation. LM-Kit.NET applies Dynamic Sampling technology on top of this model, achieving precision and speed that outperforms much larger models. For multilingual scanned documents, use the Qwen3-VL family.
When to Use Each Approach
| Document Type | Recommended Approach | Why |
|---|---|---|
| Clean typed text, receipts | TesseractOcr |
Fast, no GPU needed |
| Tables, financial statements | VlmOcr |
Preserves table structure |
| Handwritten notes | VlmOcr with large model |
Context-aware recognition |
| Mixed typed/handwritten forms | VlmOcr with form instruction |
Handles both content types |
| High-volume batch (1000+ pages) | TesseractOcr for triage, VlmOcr for flagged pages |
Balance speed and quality |
| Multi-language scanned docs | VlmOcr with Qwen3-VL |
Strong multilingual support |
| Cloud-first infrastructure | TextractOcr |
No local GPU needed, scalable |
| Existing AWS pipeline | TextractOcr |
Native integration with S3, Lambda |
| Air-gapped environments | VlmOcr or TesseractOcr |
No internet required |
| Proprietary OCR service | Custom OcrEngine subclass |
Integrate any backend |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| VLM output truncated | MaximumCompletionTokens too low |
Increase to 4096 or higher |
| Tables not properly formatted | Model too small | Use a larger model; add table-specific Instruction |
| Blank output from VlmOcr | Image too small or low contrast | Preprocess with CropAuto and Deskew first |
| Slow on large batches | VLM processes every page | Use lightonocr-2:1b for speed; process critical pages only |
| Tesseract returns garbled text | Image is skewed or noisy | Preprocess with deskew and crop before OCR |
| Textract timeout | Large image or slow network | Increase Timeout; reduce image resolution before sending |
| Textract authentication error | Invalid AWS credentials | Verify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables |
| Custom OcrEngine returns empty text | RunAsync not returning proper OcrResult |
Ensure you construct OcrResult with the extracted text string |
Agent-Based OCR with Built-In Tools
If you are building an AI agent that needs OCR as part of a larger workflow, LM-Kit.NET provides a built-in OcrTool that wraps Tesseract OCR with support for 34 languages. The agent can call OCR autonomously alongside other document tools:
using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;
var agent = Agent.CreateBuilder(model)
.WithPersona("Scanned Document Processor")
.WithTools(tools =>
{
tools.Register(BuiltInTools.Ocr); // Tesseract OCR (34 languages)
tools.Register(BuiltInTools.ImageDeskew); // Correct page rotation
tools.Register(BuiltInTools.ImageCrop); // Remove borders
tools.Register(BuiltInTools.PdfSplit); // Split multi-document PDFs
tools.Register(BuiltInTools.DocumentText); // Extract text from PDFs
})
.Build();
var result = await agent.RunAsync(
"Deskew 'scan.png', then run OCR on it in French. " +
"Also extract the text from page 2 of 'report.pdf'.");
See Equip an Agent with Built-In Tools for the complete Document tools reference.
Next Steps
- Convert Documents to Markdown with VLM OCR: focused guide on document-to-Markdown conversion.
- Automatically Split Multi-Document PDFs with AI Vision: split multi-document scans into individual documents.
- Preprocess Images for Vision Pipelines: clean images before OCR.
- Extract Invoice Data from PDFs and Images: extract structured data from scanned invoices.
- Import and Query Documents with Vision Understanding: index scanned documents for RAG.
- Equip an Agent with Built-In Tools: use OcrTool and other document tools in agent workflows.