Process Scanned Documents with OCR and Vision Models

Many enterprise documents exist only as scanned images: legacy archives, signed contracts, faxed purchase orders, and handwritten inspection forms. These documents have no text layer, so standard text extraction returns nothing. LM-Kit.NET provides three built-in OCR engines: VlmOcr (Vision Language Model with Dynamic Sampling) for layout-aware understanding, TesseractOcr for traditional character recognition, and TextractOcr for cloud-based OCR via Amazon Textract. All three inherit from the OcrEngine abstract class, which you can also extend to integrate any other OCR provider. For engines that return word bounding boxes (TesseractOcr, TextractOcr, or any custom provider), LM-Kit.NET's internal layout analysis system reconstructs the full document structure: paragraphs with correct reading order, lines, and words. The InferenceModality setting controls how extraction and analysis use text, vision, or both. This tutorial builds a scanned document processor that selects the right OCR strategy per document type and shows how to plug in custom OCR backends.

Why Choosing the Right OCR Approach Matters

Two enterprise problems that a configurable OCR strategy solves:

Mixed-quality document archives. An insurance company digitizing 20 years of claims has clean typed forms alongside handwritten adjuster notes and faded fax copies. VLM OCR handles degraded inputs and handwriting, while Tesseract OCR is faster for clean typed documents. A strategy that routes documents to the right engine maximizes throughput without sacrificing accuracy.
Complex document layouts. Financial statements, engineering drawings, and medical forms combine tables, charts, stamps, and free-form text. LM-Kit.NET handles layout reconstruction at two levels. For bounding-box engines (TesseractOcr, TextractOcr, or custom providers), the internal layout analysis system reconstructs paragraphs, reading order, and line grouping from word coordinates. For VLM OCR with the recommended lightonocr-2:1b model, Dynamic Sampling produces structured Markdown that preserves tables and headings directly. Both paths enable accurate downstream extraction and search.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	2+ GB for VLM OCR, none for Tesseract
Disk	~2 GB free for model download
Input formats	Scanned PDF, PNG, JPEG, TIFF, BMP, WebP

Step 1: Create the Project

dotnet new console -n ScannedDocProcessor
cd ScannedDocProcessor
dotnet add package LM-Kit.NET

Step 2: Understand the OCR Architecture

All OCR engines in LM-Kit.NET inherit from the abstract OcrEngine class. This means any engine can be used interchangeably with TextExtraction, DocumentRag, and other document processing components.

Layout reconstruction. TesseractOcr and TextractOcr return word-level bounding boxes. LM-Kit.NET feeds these bounding boxes into its internal layout analysis system, which reconstructs the full document structure: paragraphs with correct reading order, lines, and words. As long as an OCR engine provides word bounding boxes, LM-Kit.NET can reconstruct the layout with very high precision. This layout analysis system is the result of continuous research in document layout understanding and is improved with every release.

VLM OCR with Dynamic Sampling. VlmOcr takes a different approach: it sends the page image directly to a Vision Language Model, which understands the layout visually and produces structured Markdown. When paired with the recommended lightonocr-2:1b model, LM-Kit.NET applies Dynamic Sampling technology on top of the model, achieving exceptional precision and speed for OCR workloads.

                            OcrEngine (abstract)
                                  │
              ┌───────────────────┼───────────────────┐
              ▼                   ▼                   ▼
     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
     │   VlmOcr        │ │  TesseractOcr   │ │  TextractOcr    │
     │                 │ │                 │ │                 │
     │  Vision LLM     │ │  Traditional    │ │  Amazon         │
     │  + Dynamic      │ │  character      │ │  Textract       │
     │  Sampling       │ │  recognition    │ │  cloud API      │
     │                 │ │                 │ │                 │
     │  Output:        │ │  Output:        │ │  Output:        │
     │  Structured     │ │  Reconstructed  │ │  Reconstructed  │
     │  Markdown       │ │  layout via     │ │  layout via     │
     │  (visual)       │ │  bounding boxes │ │  bounding boxes │
     └─────────────────┘ └─────────────────┘ └─────────────────┘

     You can also subclass OcrEngine to add Google Vision,
     Azure AI Vision, or any other OCR backend.

Feature	VlmOcr	TesseractOcr	TextractOcr
Layout preservation	Structured Markdown (visual understanding)	Reconstructed paragraphs, lines, words via layout analysis	Reconstructed paragraphs, lines, words via layout analysis
Handwriting	Good (context-aware)	Limited	Good
Speed	Fast with `lightonocr-2:1b` + Dynamic Sampling	Faster (CPU-based)	Fast (cloud)
GPU required	Yes	No	No (cloud-based)
Internet required	No	No	Yes
Best for	Complex layouts, mixed content, degraded scans	Clean typed text, high-volume batch	High-throughput cloud workloads

Step 3: VLM OCR for Complex Documents

VLM OCR sends each page image directly to a Vision Language Model, which visually interprets the layout and produces structured Markdown. The recommended model for OCR workloads is lightonocr-2:1b, a purpose-built OCR model that LM-Kit.NET enhances with Dynamic Sampling technology. Dynamic Sampling optimizes the token generation strategy at inference time, delivering exceptional accuracy and speed that surpasses what the base model achieves alone.

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
    MaximumCompletionTokens = 4096
};

Console.WriteLine("=== VLM OCR: Scanned Document ===\n");

string imagePath = "scanned_invoice.png";
if (File.Exists(imagePath))
{
    var image = ImageBuffer.LoadAsRGB(imagePath);
    Console.Write($"Processing {imagePath}... ");

    VlmOcr.VlmOcrResult result = vlmOcr.Run(image);
    string markdown = result.TextGeneration.Completion;

    Console.WriteLine($"done ({result.TextGeneration.GeneratedTokenCount} tokens)\n");
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine(markdown);
    Console.ResetColor();

    // Save as Markdown
    File.WriteAllText("output.md", markdown);
    Console.WriteLine("\nSaved to output.md");
}

Step 4: Custom OCR Instructions

Tailor OCR behavior for specific document types:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
    MaximumCompletionTokens = 4096
};

// Standard document transcription
vlmOcr.Instruction = "Transcribe this document as Markdown, preserving headings, tables, and lists.";

// Focus on tabular data
vlmOcr.Instruction = "This is a financial statement. Extract all tables as Markdown tables. " +
                     "Preserve column headers and alignment. Include all numeric values.";

// Handwritten notes
vlmOcr.Instruction = "This is a handwritten document. Transcribe the handwriting as accurately as possible. " +
                     "Use [illegible] for text that cannot be read.";

// Forms with labeled fields
vlmOcr.Instruction = "This is a filled form. Extract each field as 'Label: Value' on a separate line. " +
                     "Include checkboxes as [x] checked or [ ] unchecked.";

// Code or technical diagrams
vlmOcr.Instruction = "This contains source code. Transcribe as a fenced code block with language annotation.";

Step 5: Process Multi-Page Scanned PDFs

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
    MaximumCompletionTokens = 4096
};

Console.WriteLine("\n=== Multi-Page Scanned PDF ===\n");

string pdfPath = "scanned_report.pdf";
if (File.Exists(pdfPath))
{
    var attachment = new Attachment(pdfPath);
    int pageCount = attachment.PageCount;
    Console.WriteLine($"Processing {pageCount} pages from {Path.GetFileName(pdfPath)}...\n");

    var fullDocument = new StringBuilder();

    for (int page = 0; page < pageCount; page++)
    {
        Console.Write($"  Page {page + 1}/{pageCount}... ");

        VlmOcr.VlmOcrResult pageResult = vlmOcr.Run(attachment, pageIndex: page);
        string pageMarkdown = pageResult.TextGeneration.Completion;

        fullDocument.AppendLine($"## Page {page + 1}");
        fullDocument.AppendLine();
        fullDocument.AppendLine(pageMarkdown);
        fullDocument.AppendLine();

        Console.WriteLine($"{pageResult.TextGeneration.GeneratedTokenCount} tokens");
    }

    string outputPath = Path.ChangeExtension(pdfPath, ".md");
    File.WriteAllText(outputPath, fullDocument.ToString());
    Console.WriteLine($"\nSaved {pageCount} pages to {outputPath}");
}

Step 6: Using InferenceModality for Extraction

When combining OCR with data extraction, the InferenceModality property controls how the model processes the input:

using LMKit.Data;
using LMKit.Extraction;
using LMKit.Graphics;
using LMKit.Inference;
using LMKit.Media.Image;
using LMKit.Model;

Console.WriteLine("\n=== Extraction from Scanned Documents ===\n");

// Load a general-purpose model for extraction
Console.WriteLine("Loading extraction model...");
using LM extractionModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var extractor = new TextExtraction(extractionModel)
{
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", ElementType.String, "Invoice number"),
        new("vendor_name", ElementType.String, "Vendor or company name"),
        new("total_amount", ElementType.Double, "Total amount"),
    }
};

// Text mode: uses extracted text only (fast, needs text layer or pre-OCR)
extractor.PreferredInferenceModality = InferenceModality.Text;

// Vision mode: sends the image directly to the model (no OCR needed)
extractor.PreferredInferenceModality = InferenceModality.Vision;

// Multimodal: combines both text and image for best accuracy
extractor.PreferredInferenceModality = InferenceModality.Multimodal;

// BestModality: model picks the best single modality automatically
extractor.PreferredInferenceModality = InferenceModality.BestModality;

// Extract from scanned image using vision
extractor.PreferredInferenceModality = InferenceModality.Vision;
extractor.SetContent(ImageBuffer.LoadAsRGB("scanned_invoice.png"));
TextExtractionResult result = extractor.Parse();

Console.WriteLine($"Invoice #: {result.GetValue<string>("invoice_number")}");
Console.WriteLine($"Vendor:    {result.GetValue<string>("vendor_name")}");
Console.WriteLine($"Total:     {result.GetValue<double>("total_amount")}");

Step 7: OCR Engine Events

Monitor OCR processing with events:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
    MaximumCompletionTokens = 4096
};

vlmOcr.OcrStarting += (sender, e) =>
{
    Console.WriteLine($"  OCR starting for page...");
    // Set e.Cancel = true to skip this page
};

vlmOcr.OcrCompleted += (sender, e) =>
{
    Console.WriteLine($"  OCR completed: {e.Result.TextGeneration.GeneratedTokenCount} tokens");
};

Step 8: Amazon Textract OCR

For cloud-based OCR with Amazon Textract, use TextractOcr. This sends images to the AWS Textract API and returns word-level bounding boxes. LM-Kit.NET's layout analysis system then reconstructs the full document structure (paragraphs with reading order, lines, and words) from these bounding boxes with very high precision:

using LMKit.Integrations.AWS;
using LMKit.Integrations.AWS.Ocr.Textract;
using LMKit.Media.Image;

Console.WriteLine("\n=== Amazon Textract OCR ===\n");

// ──────────────────────────────────────
// Configure Textract with AWS credentials
// ──────────────────────────────────────
var textractOcr = new TextractOcr(
    awsAccessKeyId: Environment.GetEnvironmentVariable("AWS_ACCESS_KEY_ID"),
    awsSecretAccessKey: Environment.GetEnvironmentVariable("AWS_SECRET_ACCESS_KEY"),
    region: AWSRegion.USEast1)
{
    Timeout = TimeSpan.FromSeconds(30)
};

// Monitor progress with events (inherited from OcrEngine)
textractOcr.OcrStarting += (_, e) =>
{
    Console.WriteLine($"  Sending page to Textract...");
};

textractOcr.OcrCompleted += (_, e) =>
{
    if (e.Exception != null)
        Console.WriteLine($"  Textract error: {e.Exception.Message}");
    else
        Console.WriteLine($"  Textract completed: {e.Result.PageText.Length} chars");
};

// Process a scanned image
string imagePath = "scanned_invoice.png";
if (File.Exists(imagePath))
{
    var parameters = new OcrParameters(ImageBuffer.LoadAsRGB(imagePath));
    OcrResult textractResult = await textractOcr.RunAsync(parameters);

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"\n{textractResult.PageText}");
    Console.ResetColor();

    // Access bounding box information for layout analysis
    foreach (var element in textractResult.TextElements)
    {
        Console.WriteLine($"  Text: \"{element.Text}\" at ({element.X:F0}, {element.Y:F0})");
    }
}

You can parse the region from a string using AWSRegionConverter:

// Parse region from configuration
AWSRegion region = AWSRegionConverter.ParseRegion("eu-west-1");
string regionId = AWSRegionConverter.ToIdentifier(AWSRegion.EUWest1);  // "eu-west-1"

Step 9: Use Any OCR Engine with TextExtraction and DocumentRag

Every OcrEngine subclass works interchangeably with TextExtraction and DocumentRag through the OcrEngine property:

using LMKit.Extraction;
using LMKit.Integrations.Tesseract;
using LMKit.Retrieval;
using LMKit.Data;

// ──────────────────────────────────────
// Use Textract with TextExtraction
// ──────────────────────────────────────
var extractor = new TextExtraction(extractionModel)
{
    OcrEngine = textractOcr,  // Swap in any OcrEngine implementation
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", ElementType.String, "Invoice number"),
        new("vendor_name", ElementType.String, "Vendor or company name"),
        new("total_amount", ElementType.Double, "Total amount"),
    }
};

// ──────────────────────────────────────
// Use Textract with DocumentRag
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    OcrEngine = textractOcr  // Scanned pages use Textract for text extraction
};

// Switch to VLM OCR for vision-based understanding
rag.OcrEngine = vlmOcr;

// Switch to Tesseract for CPU-only environments
rag.OcrEngine = new TesseractOcr();

Step 10: Build a Custom OCR Provider

The OcrEngine abstract class lets you integrate any OCR backend (Google Cloud Vision, Azure AI Vision, ABBYY, or a custom service). Override the RunAsync method and return an OcrResult. If your OCR provider returns word bounding boxes, include them in the OcrResult so that LM-Kit.NET's layout analysis system can reconstruct paragraphs, reading order, lines, and words with high precision:

using LMKit.Extraction.Ocr;

public sealed class GoogleVisionOcr : OcrEngine
{
    private readonly string _apiKey;

    public GoogleVisionOcr(string apiKey)
    {
        _apiKey = apiKey;
    }

    public override async Task<OcrResult> RunAsync(
        OcrParameters ocrParameters,
        CancellationToken cancellationToken = default)
    {
        // 1. Get the image bytes from OcrParameters
        byte[] imageBytes = ocrParameters.ImageData;  // PNG-encoded image
        string mime = ocrParameters.Mime;              // Always "image/png"

        // 2. Call your OCR service
        // ... send imageBytes to Google Cloud Vision API ...
        string extractedText = "Text from Google Vision...";

        // 3. Return as OcrResult
        // Option A: Simple text result (no bounding boxes, no layout reconstruction)
        return new OcrResult(extractedText);

        // Option B (recommended): With word bounding boxes for layout reconstruction.
        // When you provide bounding boxes, LM-Kit.NET's layout analysis system
        // automatically reconstructs paragraphs, reading order, lines, and words
        // with very high precision.
        // var textElements = new List<TextElement>
        // {
        //     new TextElement("Invoice #123", x: 100, y: 50, width: 200, height: 20),
        //     new TextElement("Total: $500", x: 100, y: 300, width: 150, height: 20),
        // };
        // return new OcrResult(textElements,
        //     pageWidth: ocrParameters.Image.Width,
        //     pageHeight: ocrParameters.Image.Height);
    }
}

Use your custom provider anywhere an OcrEngine is accepted:

// Use your custom provider anywhere an OcrEngine is accepted
var customOcr = new GoogleVisionOcr("your-api-key");

var extractor = new TextExtraction(model) { OcrEngine = customOcr };
var rag = new DocumentRag(embeddingModel) { OcrEngine = customOcr };

The OcrEngine base class provides OcrStarting and OcrCompleted events automatically, so any custom provider gets event support without additional code.

Step 11: Batch Processing with Adaptive Strategy

Route documents to the best OCR approach based on their characteristics:

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Media.Image;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
    MaximumCompletionTokens = 4096
};

Console.WriteLine("\n=== Adaptive Batch OCR ===\n");

string inputDir = "scanned_docs";
string outputDir = "ocr_output";
Directory.CreateDirectory(outputDir);

string[] files = Directory.GetFiles(inputDir)
    .Where(f => new[] { ".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp" }
        .Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

Console.WriteLine($"Processing {files.Length} file(s)...\n");

foreach (string file in files)
{
    string fileName = Path.GetFileName(file);
    Console.Write($"  {fileName}: ");

    var attachment = new Attachment(file);
    var fullText = new StringBuilder();

    for (int page = 0; page < Math.Max(1, attachment.PageCount); page++)
    {
        // Use VLM OCR for all scanned content
        VlmOcr.VlmOcrResult pageResult = attachment.PageCount > 0
            ? vlmOcr.Run(attachment, pageIndex: page)
            : vlmOcr.Run(ImageBuffer.LoadAsRGB(file));

        fullText.AppendLine(pageResult.TextGeneration.Completion);
        fullText.AppendLine();
    }

    string outPath = Path.Combine(outputDir, Path.ChangeExtension(fileName, ".md"));
    File.WriteAllText(outPath, fullText.ToString());

    Console.ForegroundColor = ConsoleColor.Green;
    Console.WriteLine($"VLM OCR → {outPath}");
    Console.ResetColor();
}

Console.WriteLine($"\nAll files processed to {Path.GetFullPath(outputDir)}");

Model Selection for OCR

Model ID	VRAM	Speed	Best For
`paddleocr-vl:0.9b`	~1 GB	Very fast	Ultra-compact with six task modes (tables, formulas, charts, coordinates, seals)
`glm-ocr`	~1 GB	Very fast	Document parsing, text, formula, table, and complex layout recognition
`lightonocr-2:1b` (recommended)	~2 GB	Fastest	Purpose-built OCR with Dynamic Sampling. Best precision and speed
`glm-4.6v-flash`	~7 GB	Fast	Vision + OCR + chat + tool calling. Documents, screenshots, charts, tables
`qwen3-vl:2b`	~2.5 GB	Very fast	Lightweight multilingual OCR
`qwen3-vl:4b`	~4 GB	Fast	Multilingual documents, good accuracy
`gemma3:4b`	~5.7 GB	Moderate	Mixed text and vision tasks
`qwen3-vl:8b`	~6.5 GB	Moderate	High-quality multilingual OCR
`gemma3:12b`	~11 GB	Slow	Complex layouts, degraded scans, handwriting

For dedicated OCR workloads, lightonocr-2:1b is the top recommendation. LM-Kit.NET applies Dynamic Sampling technology on top of this model, achieving precision and speed that outperforms much larger models. For ultra-compact alternatives, paddleocr-vl:0.9b and glm-ocr both run under 1 GB. Use glm-4.6v-flash when you need OCR combined with chat and tool calling in a single model. For multilingual scanned documents, use the Qwen3-VL family.

When to Use Each Approach

Document Type	Recommended Approach	Why
Clean typed text, receipts	`TesseractOcr`	Fast, no GPU needed
Tables, financial statements	`VlmOcr`	Preserves table structure
Handwritten notes	`VlmOcr` with large model	Context-aware recognition
Mixed typed/handwritten forms	`VlmOcr` with form instruction	Handles both content types
High-volume batch (1000+ pages)	`TesseractOcr` for triage, `VlmOcr` for flagged pages	Balance speed and quality
Multi-language scanned docs	`VlmOcr` with Qwen3-VL	Strong multilingual support
Cloud-first infrastructure	`TextractOcr`	No local GPU needed, scalable
Existing AWS pipeline	`TextractOcr`	Native integration with S3, Lambda
Air-gapped environments	`VlmOcr` or `TesseractOcr`	No internet required
Proprietary OCR service	Custom `OcrEngine` subclass	Integrate any backend

Common Issues

Problem	Cause	Fix
VLM output truncated	`MaximumCompletionTokens` too low	Increase to 4096 or higher
Tables not properly formatted	Model too small	Use a larger model; add table-specific `Instruction`
Blank output from VlmOcr	Image too small or low contrast	Preprocess with `CropAuto` and `Deskew` first
Slow on large batches	VLM processes every page	Use `lightonocr-2:1b` for speed; process critical pages only
Tesseract returns garbled text	Image is skewed or noisy	Preprocess with deskew and crop before OCR
Textract timeout	Large image or slow network	Increase `Timeout`; reduce image resolution before sending
Textract authentication error	Invalid AWS credentials	Verify `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` environment variables
Custom OcrEngine returns empty text	`RunAsync` not returning proper `OcrResult`	Ensure you construct `OcrResult` with the extracted text string

Agent-Based OCR with Built-In Tools

If you are building an AI agent that needs OCR as part of a larger workflow, LM-Kit.NET provides a built-in OcrTool that wraps Tesseract OCR with support for 34 languages. The agent can call OCR autonomously alongside other document tools:

using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;

var agent = Agent.CreateBuilder(model)
    .WithPersona("Scanned Document Processor")
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.OcrRecognize);            // Tesseract OCR (34 languages)
        tools.Register(BuiltInTools.ImageDeskew);     // Correct page rotation
        tools.Register(BuiltInTools.ImageCrop);       // Remove borders
        tools.Register(BuiltInTools.PdfSplit);        // Split multi-document PDFs
        tools.Register(BuiltInTools.DocumentTextExtract);    // Extract text from PDFs
    })
    .Build();

var result = await agent.RunAsync(
    "Deskew 'scan.png', then run OCR on it in French. " +
    "Also extract the text from page 2 of 'report.pdf'.");

See Equip an Agent with Built-In Tools for the complete Document tools reference.

Next Steps

Convert Documents to Markdown with VLM OCR: focused guide on document-to-Markdown conversion.
Automatically Split Multi-Document PDFs with AI Vision: split multi-document scans into individual documents.
Preprocess Images for Vision Pipelines: clean images before OCR.
Extract Invoice Data from PDFs and Images: extract structured data from scanned invoices.
Import and Query Documents with Vision Understanding: index scanned documents for RAG.
Equip an Agent with Built-In Tools: use OcrTool and other document tools in agent workflows.

Table of Contents