Table of Contents

Property OcrEngine

Namespace
LMKit.Document.Conversion
Assembly
LM-Kit.NET.dll

OcrEngine

Gets or sets the optional OCR engine used to recover text from raster content when Strategy is TextExtraction. Defaults to null, in which case the text-extraction strategy only reads the embedded text layer.

public OcrEngine OcrEngine { get; set; }

Property Value

OcrEngine

Examples

using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;

using var ocr = new LMKitOcr();

var converter = new DocumentToMarkdown();
var result = converter.Convert("scan.pdf", new DocumentToMarkdownOptions
{
    Strategy  = DocumentToMarkdownStrategy.TextExtraction,
    OcrEngine = ocr
});

Remarks

When supplied (for example LMKitOcr or any custom OcrEngine subclass), the engine is applied at three complementary points in the text-extraction pipeline:

  1. Image attachments (PNG, JPEG, TIFF, BMP, WEBP, GIF, ...) are routed through the engine instead of being ignored, so standalone scans produce Markdown without a vision-language model.
  2. Embedded raster images on each PDF page are OCRed and the recognised text is projected back into page coordinates, so rasterised charts, labels, and figure legends flow through the same Markdown pipeline as the native text layer. Concurrency across a page's images is controlled by OcrImageParallelism.
  3. Full-page fallback. When a PDF page's native text layer is essentially empty (scanned pages, flattened print-to-PDF output), the page is rendered as a full raster and OCRed end-to-end. This allows the fast text-extraction strategy to handle scanned PDFs without escalating to a vision-language model.

This property is ignored by VlmOcr and Hybrid; those strategies always use the vision-language model for pages that require OCR.

Share