Table of Contents

Process Scanned Documents with OCR and Vision Models

Many enterprise documents exist only as scanned images: legacy archives, signed contracts, faxed purchase orders, and handwritten inspection forms. These documents have no text layer, so standard text extraction returns nothing. LM-Kit.NET provides three built-in OCR engines: VlmOcr (Vision Language Model with Dynamic Sampling) for layout-aware understanding, TesseractOcr for traditional character recognition, and TextractOcr for cloud-based OCR via Amazon Textract. All three inherit from the OcrEngine abstract class, which you can also extend to integrate any other OCR provider. For engines that return word bounding boxes (TesseractOcr, TextractOcr, or any custom provider), LM-Kit.NET's internal layout analysis system reconstructs the full document structure: paragraphs with correct reading order, lines, and words. The InferenceModality setting controls how extraction and analysis use text, vision, or both. This tutorial builds a scanned document processor that selects the right OCR strategy per document type and shows how to plug in custom OCR backends.


Why Choosing the Right OCR Approach Matters

Two enterprise problems that a configurable OCR strategy solves:

  1. Mixed-quality document archives. An insurance company digitizing 20 years of claims has clean typed forms alongside handwritten adjuster notes and faded fax copies. VLM OCR handles degraded inputs and handwriting, while Tesseract OCR is faster for clean typed documents. A strategy that routes documents to the right engine maximizes throughput without sacrificing accuracy.
  2. Complex document layouts. Financial statements, engineering drawings, and medical forms combine tables, charts, stamps, and free-form text. LM-Kit.NET handles layout reconstruction at two levels. For bounding-box engines (TesseractOcr, TextractOcr, or custom providers), the internal layout analysis system reconstructs paragraphs, reading order, and line grouping from word coordinates. For VLM OCR with the recommended lightonocr-2:1b model, Dynamic Sampling produces structured Markdown that preserves tables and headings directly. Both paths enable accurate downstream extraction and search.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 2+ GB for VLM OCR, none for Tesseract
Disk ~2 GB free for model download
Input formats Scanned PDF, PNG, JPEG, TIFF, BMP, WebP

Step 1: Create the Project

dotnet new console -n ScannedDocProcessor
cd ScannedDocProcessor
dotnet add package LM-Kit.NET

Step 2: Understand the OCR Architecture

All OCR engines in LM-Kit.NET inherit from the abstract OcrEngine class. This means any engine can be used interchangeably with TextExtraction, DocumentRag, and other document processing components.

Layout reconstruction. TesseractOcr and TextractOcr return word-level bounding boxes. LM-Kit.NET feeds these bounding boxes into its internal layout analysis system, which reconstructs the full document structure: paragraphs with correct reading order, lines, and words. As long as an OCR engine provides word bounding boxes, LM-Kit.NET can reconstruct the layout with very high precision. This layout analysis system is the result of continuous research in document layout understanding and is improved with every release.

VLM OCR with Dynamic Sampling. VlmOcr takes a different approach: it sends the page image directly to a Vision Language Model, which understands the layout visually and produces structured Markdown. When paired with the recommended lightonocr-2:1b model, LM-Kit.NET applies Dynamic Sampling technology on top of the model, achieving exceptional precision and speed for OCR workloads.

                            OcrEngine (abstract)
                                  │
              ┌───────────────────┼───────────────────┐
              ▼                   ▼                   ▼
     ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
     │   VlmOcr        │ │  TesseractOcr   │ │  TextractOcr    │
     │                 │ │                 │ │                 │
     │  Vision LLM     │ │  Traditional    │ │  Amazon         │
     │  + Dynamic      │ │  character      │ │  Textract       │
     │  Sampling       │ │  recognition    │ │  cloud API      │
     │                 │ │                 │ │                 │
     │  Output:        │ │  Output:        │ │  Output:        │
     │  Structured     │ │  Reconstructed  │ │  Reconstructed  │
     │  Markdown       │ │  layout via     │ │  layout via     │
     │  (visual)       │ │  bounding boxes │ │  bounding boxes │
     └─────────────────┘ └─────────────────┘ └─────────────────┘

     You can also subclass OcrEngine to add Google Vision,
     Azure AI Vision, or any other OCR backend.
Feature VlmOcr TesseractOcr TextractOcr
Layout preservation Structured Markdown (visual understanding) Reconstructed paragraphs, lines, words via layout analysis Reconstructed paragraphs, lines, words via layout analysis
Handwriting Good (context-aware) Limited Good
Speed Fast with lightonocr-2:1b + Dynamic Sampling Faster (CPU-based) Fast (cloud)
GPU required Yes No No (cloud-based)
Internet required No No Yes
Best for Complex layouts, mixed content, degraded scans Clean typed text, high-volume batch High-throughput cloud workloads

Step 3: VLM OCR for Complex Documents

VLM OCR sends each page image directly to a Vision Language Model, which visually interprets the layout and produces structured Markdown. The recommended model for OCR workloads is lightonocr-2:1b, a purpose-built OCR model that LM-Kit.NET enhances with Dynamic Sampling technology. Dynamic Sampling optimizes the token generation strategy at inference time, delivering exceptional accuracy and speed that surpasses what the base model achieves alone.

using System.Text;
using LMKit.Data;
using LMKit.Extraction.Ocr;
using LMKit.Graphics;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the recommended OCR model (lightonocr-2:1b with Dynamic Sampling)
// ──────────────────────────────────────
Console.WriteLine("Loading vision model for OCR...");
using LM visionModel = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Process a scanned image with VLM OCR
// ──────────────────────────────────────
var vlmOcr = new VlmOcr(visionModel)
{
    MaximumCompletionTokens = 4096
};

Console.WriteLine("=== VLM OCR: Scanned Document ===\n");

string imagePath = "scanned_invoice.png";
if (File.Exists(imagePath))
{
    var image = new ImageBuffer(imagePath);
    Console.Write($"Processing {imagePath}... ");

    VlmOcr.VlmOcrResult result = vlmOcr.Run(image);
    string markdown = result.TextGeneration.Completion;

    Console.WriteLine($"done ({result.TextGeneration.GeneratedTokenCount} tokens)\n");
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine(markdown);
    Console.ResetColor();

    // Save as Markdown
    File.WriteAllText("output.md", markdown);
    Console.WriteLine("\nSaved to output.md");
}

Step 4: Custom OCR Instructions

Tailor OCR behavior for specific document types:

// Standard document transcription
vlmOcr.Instruction = "Transcribe this document as Markdown, preserving headings, tables, and lists.";

// Focus on tabular data
vlmOcr.Instruction = "This is a financial statement. Extract all tables as Markdown tables. " +
                     "Preserve column headers and alignment. Include all numeric values.";

// Handwritten notes
vlmOcr.Instruction = "This is a handwritten document. Transcribe the handwriting as accurately as possible. " +
                     "Use [illegible] for text that cannot be read.";

// Forms with labeled fields
vlmOcr.Instruction = "This is a filled form. Extract each field as 'Label: Value' on a separate line. " +
                     "Include checkboxes as [x] checked or [ ] unchecked.";

// Code or technical diagrams
vlmOcr.Instruction = "This contains source code. Transcribe as a fenced code block with language annotation.";

Step 5: Process Multi-Page Scanned PDFs

Console.WriteLine("\n=== Multi-Page Scanned PDF ===\n");

string pdfPath = "scanned_report.pdf";
if (File.Exists(pdfPath))
{
    var attachment = new Attachment(pdfPath);
    int pageCount = attachment.PageCount;
    Console.WriteLine($"Processing {pageCount} pages from {Path.GetFileName(pdfPath)}...\n");

    var fullDocument = new StringBuilder();

    for (int page = 0; page < pageCount; page++)
    {
        Console.Write($"  Page {page + 1}/{pageCount}... ");

        VlmOcr.VlmOcrResult pageResult = vlmOcr.Run(attachment, pageIndex: page);
        string pageMarkdown = pageResult.TextGeneration.Completion;

        fullDocument.AppendLine($"## Page {page + 1}");
        fullDocument.AppendLine();
        fullDocument.AppendLine(pageMarkdown);
        fullDocument.AppendLine();

        Console.WriteLine($"{pageResult.TextGeneration.GeneratedTokenCount} tokens");
    }

    string outputPath = Path.ChangeExtension(pdfPath, ".md");
    File.WriteAllText(outputPath, fullDocument.ToString());
    Console.WriteLine($"\nSaved {pageCount} pages to {outputPath}");
}

Step 6: Using InferenceModality for Extraction

When combining OCR with data extraction, the InferenceModality property controls how the model processes the input:

using LMKit.Extraction;
using LMKit.Inference;

Console.WriteLine("\n=== Extraction from Scanned Documents ===\n");

// Load a general-purpose model for extraction
Console.WriteLine("Loading extraction model...");
using LM extractionModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var extractor = new TextExtraction(extractionModel)
{
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", TextExtractionElement.ElementType.String, "Invoice number"),
        new("vendor_name", TextExtractionElement.ElementType.String, "Vendor or company name"),
        new("total_amount", TextExtractionElement.ElementType.Number, "Total amount"),
    }
};

// Text mode: uses extracted text only (fast, needs text layer or pre-OCR)
extractor.PreferredInferenceModality = InferenceModality.Text;

// Vision mode: sends the image directly to the model (no OCR needed)
extractor.PreferredInferenceModality = InferenceModality.Vision;

// Multimodal: combines both text and image for best accuracy
extractor.PreferredInferenceModality = InferenceModality.Multimodal;

// BestModality: model picks the best single modality automatically
extractor.PreferredInferenceModality = InferenceModality.BestModality;

// Extract from scanned image using vision
extractor.PreferredInferenceModality = InferenceModality.Vision;
extractor.SetContent(new ImageBuffer("scanned_invoice.png"));
TextExtractionResult result = extractor.Parse();

Console.WriteLine($"Invoice #: {result.GetValue<string>("invoice_number")}");
Console.WriteLine($"Vendor:    {result.GetValue<string>("vendor_name")}");
Console.WriteLine($"Total:     {result.GetValue<double>("total_amount")}");

Step 7: OCR Engine Events

Monitor OCR processing with events:

vlmOcr.OcrStarting += (sender, e) =>
{
    Console.WriteLine($"  OCR starting for page...");
    // Set e.Cancel = true to skip this page
};

vlmOcr.OcrCompleted += (sender, e) =>
{
    Console.WriteLine($"  OCR completed: {e.Result.TextGeneration.GeneratedTokenCount} tokens");
};

Step 8: Amazon Textract OCR

For cloud-based OCR with Amazon Textract, use TextractOcr. This sends images to the AWS Textract API and returns word-level bounding boxes. LM-Kit.NET's layout analysis system then reconstructs the full document structure (paragraphs with reading order, lines, and words) from these bounding boxes with very high precision:

using LMKit.Integrations.AWS;
using LMKit.Integrations.AWS.Ocr.Textract;

Console.WriteLine("\n=== Amazon Textract OCR ===\n");

// ──────────────────────────────────────
// Configure Textract with AWS credentials
// ──────────────────────────────────────
var textractOcr = new TextractOcr(
    awsAccessKeyId: Environment.GetEnvironmentVariable("AWS_ACCESS_KEY_ID"),
    awsSecretAccessKey: Environment.GetEnvironmentVariable("AWS_SECRET_ACCESS_KEY"),
    region: AWSRegion.USEast1)
{
    Timeout = TimeSpan.FromSeconds(30)
};

// Monitor progress with events (inherited from OcrEngine)
textractOcr.OcrStarting += (_, e) =>
{
    Console.WriteLine($"  Sending page to Textract...");
};

textractOcr.OcrCompleted += (_, e) =>
{
    if (e.Exception != null)
        Console.WriteLine($"  Textract error: {e.Exception.Message}");
    else
        Console.WriteLine($"  Textract completed: {e.Result.PageText.Length} chars");
};

// Process a scanned image
string imagePath = "scanned_invoice.png";
if (File.Exists(imagePath))
{
    var parameters = new OcrParameters(new ImageBuffer(imagePath));
    OcrResult textractResult = await textractOcr.RunAsync(parameters);

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"\n{textractResult.PageText}");
    Console.ResetColor();

    // Access bounding box information for layout analysis
    foreach (var element in textractResult.TextElements)
    {
        Console.WriteLine($"  Text: \"{element.Text}\" at ({element.X:F0}, {element.Y:F0})");
    }
}

You can parse the region from a string using AWSRegionConverter:

// Parse region from configuration
AWSRegion region = AWSRegionConverter.ParseRegion("eu-west-1");
string regionId = AWSRegionConverter.ToIdentifier(AWSRegion.EUWest1);  // "eu-west-1"

Step 9: Use Any OCR Engine with TextExtraction and DocumentRag

Every OcrEngine subclass works interchangeably with TextExtraction and DocumentRag through the OcrEngine property:

using LMKit.Extraction;
using LMKit.Retrieval;

// ──────────────────────────────────────
// Use Textract with TextExtraction
// ──────────────────────────────────────
var extractor = new TextExtraction(extractionModel)
{
    OcrEngine = textractOcr,  // Swap in any OcrEngine implementation
    Elements = new List<TextExtractionElement>
    {
        new("invoice_number", TextExtractionElement.ElementType.String, "Invoice number"),
        new("vendor_name", TextExtractionElement.ElementType.String, "Vendor or company name"),
        new("total_amount", TextExtractionElement.ElementType.Number, "Total amount"),
    }
};

// ──────────────────────────────────────
// Use Textract with DocumentRag
// ──────────────────────────────────────
var rag = new DocumentRag(embeddingModel)
{
    OcrEngine = textractOcr  // Scanned pages use Textract for text extraction
};

// Switch to VLM OCR for vision-based understanding
rag.OcrEngine = vlmOcr;

// Switch to Tesseract for CPU-only environments
rag.OcrEngine = new TesseractOcr();

Step 10: Build a Custom OCR Provider

The OcrEngine abstract class lets you integrate any OCR backend (Google Cloud Vision, Azure AI Vision, ABBYY, or a custom service). Override the RunAsync method and return an OcrResult. If your OCR provider returns word bounding boxes, include them in the OcrResult so that LM-Kit.NET's layout analysis system can reconstruct paragraphs, reading order, lines, and words with high precision:

using LMKit.Extraction.Ocr;

public sealed class GoogleVisionOcr : OcrEngine
{
    private readonly string _apiKey;

    public GoogleVisionOcr(string apiKey)
    {
        _apiKey = apiKey;
    }

    public override async Task<OcrResult> RunAsync(
        OcrParameters ocrParameters,
        CancellationToken cancellationToken = default)
    {
        // 1. Get the image bytes from OcrParameters
        byte[] imageBytes = ocrParameters.ImageData;  // PNG-encoded image
        string mime = ocrParameters.Mime;              // Always "image/png"

        // 2. Call your OCR service
        // ... send imageBytes to Google Cloud Vision API ...
        string extractedText = "Text from Google Vision...";

        // 3. Return as OcrResult
        // Option A: Simple text result (no bounding boxes, no layout reconstruction)
        return new OcrResult(extractedText);

        // Option B (recommended): With word bounding boxes for layout reconstruction.
        // When you provide bounding boxes, LM-Kit.NET's layout analysis system
        // automatically reconstructs paragraphs, reading order, lines, and words
        // with very high precision.
        // var textElements = new List<TextElement>
        // {
        //     new TextElement("Invoice #123", x: 100, y: 50, width: 200, height: 20),
        //     new TextElement("Total: $500", x: 100, y: 300, width: 150, height: 20),
        // };
        // return new OcrResult(textElements,
        //     pageWidth: ocrParameters.Image.Width,
        //     pageHeight: ocrParameters.Image.Height);
    }
}

// Use your custom provider anywhere an OcrEngine is accepted
var customOcr = new GoogleVisionOcr("your-api-key");

var extractor = new TextExtraction(model) { OcrEngine = customOcr };
var rag = new DocumentRag(embeddingModel) { OcrEngine = customOcr };

The OcrEngine base class provides OcrStarting and OcrCompleted events automatically, so any custom provider gets event support without additional code.


Step 11: Batch Processing with Adaptive Strategy

Route documents to the best OCR approach based on their characteristics:

Console.WriteLine("\n=== Adaptive Batch OCR ===\n");

string inputDir = "scanned_docs";
string outputDir = "ocr_output";
Directory.CreateDirectory(outputDir);

string[] files = Directory.GetFiles(inputDir)
    .Where(f => new[] { ".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp" }
        .Contains(Path.GetExtension(f).ToLowerInvariant()))
    .ToArray();

Console.WriteLine($"Processing {files.Length} file(s)...\n");

foreach (string file in files)
{
    string fileName = Path.GetFileName(file);
    Console.Write($"  {fileName}: ");

    var attachment = new Attachment(file);
    var fullText = new StringBuilder();

    for (int page = 0; page < Math.Max(1, attachment.PageCount); page++)
    {
        // Use VLM OCR for all scanned content
        VlmOcr.VlmOcrResult pageResult = attachment.PageCount > 0
            ? vlmOcr.Run(attachment, pageIndex: page)
            : vlmOcr.Run(new ImageBuffer(file));

        fullText.AppendLine(pageResult.TextGeneration.Completion);
        fullText.AppendLine();
    }

    string outPath = Path.Combine(outputDir, Path.ChangeExtension(fileName, ".md"));
    File.WriteAllText(outPath, fullText.ToString());

    Console.ForegroundColor = ConsoleColor.Green;
    Console.WriteLine($"VLM OCR → {outPath}");
    Console.ResetColor();
}

Console.WriteLine($"\nAll files processed to {Path.GetFullPath(outputDir)}");

Model Selection for OCR

Model ID VRAM Speed Best For
lightonocr-2:1b (recommended) ~2 GB Fastest Purpose-built OCR with Dynamic Sampling. Best precision and speed
qwen3-vl:2b ~2.5 GB Very fast Lightweight multilingual OCR
qwen3-vl:4b ~4 GB Fast Multilingual documents, good accuracy
gemma3:4b ~5.7 GB Moderate Mixed text and vision tasks
qwen3-vl:8b ~6.5 GB Moderate High-quality multilingual OCR
gemma3:12b ~11 GB Slow Complex layouts, degraded scans, handwriting

For dedicated OCR workloads, lightonocr-2:1b is the top recommendation. LM-Kit.NET applies Dynamic Sampling technology on top of this model, achieving precision and speed that outperforms much larger models. For multilingual scanned documents, use the Qwen3-VL family.


When to Use Each Approach

Document Type Recommended Approach Why
Clean typed text, receipts TesseractOcr Fast, no GPU needed
Tables, financial statements VlmOcr Preserves table structure
Handwritten notes VlmOcr with large model Context-aware recognition
Mixed typed/handwritten forms VlmOcr with form instruction Handles both content types
High-volume batch (1000+ pages) TesseractOcr for triage, VlmOcr for flagged pages Balance speed and quality
Multi-language scanned docs VlmOcr with Qwen3-VL Strong multilingual support
Cloud-first infrastructure TextractOcr No local GPU needed, scalable
Existing AWS pipeline TextractOcr Native integration with S3, Lambda
Air-gapped environments VlmOcr or TesseractOcr No internet required
Proprietary OCR service Custom OcrEngine subclass Integrate any backend

Common Issues

Problem Cause Fix
VLM output truncated MaximumCompletionTokens too low Increase to 4096 or higher
Tables not properly formatted Model too small Use a larger model; add table-specific Instruction
Blank output from VlmOcr Image too small or low contrast Preprocess with CropAuto and Deskew first
Slow on large batches VLM processes every page Use lightonocr-2:1b for speed; process critical pages only
Tesseract returns garbled text Image is skewed or noisy Preprocess with deskew and crop before OCR
Textract timeout Large image or slow network Increase Timeout; reduce image resolution before sending
Textract authentication error Invalid AWS credentials Verify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables
Custom OcrEngine returns empty text RunAsync not returning proper OcrResult Ensure you construct OcrResult with the extracted text string

Agent-Based OCR with Built-In Tools

If you are building an AI agent that needs OCR as part of a larger workflow, LM-Kit.NET provides a built-in OcrTool that wraps Tesseract OCR with support for 34 languages. The agent can call OCR autonomously alongside other document tools:

using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;

var agent = Agent.CreateBuilder(model)
    .WithPersona("Scanned Document Processor")
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.Ocr);            // Tesseract OCR (34 languages)
        tools.Register(BuiltInTools.ImageDeskew);     // Correct page rotation
        tools.Register(BuiltInTools.ImageCrop);       // Remove borders
        tools.Register(BuiltInTools.PdfSplit);        // Split multi-document PDFs
        tools.Register(BuiltInTools.DocumentText);    // Extract text from PDFs
    })
    .Build();

var result = await agent.RunAsync(
    "Deskew 'scan.png', then run OCR on it in French. " +
    "Also extract the text from page 2 of 'report.pdf'.");

See Equip an Agent with Built-In Tools for the complete Document tools reference.


Next Steps