Table of Contents

Convert Documents to Markdown

LM-Kit.NET ships a single universal converter, DocumentToMarkdown, that turns any supported document format into clean, LLM-ready Markdown. It replaces a whole stack of legacy components (PDF text extractors, Tesseract-style OCR, DOCX/XLSX parsers, email rippers, HTML-to-Markdown libraries) with one API, one result type, and one unified quality signal. Everything runs 100% on-device.

This guide walks through the three conversion strategies, per-page progress, and the common patterns you will actually use in production.


Supported Formats

Category Formats
Documents PDF, DOCX, PPTX, XLSX, TXT
Email EML, MBOX
Web HTML
Images PNG, JPG, JPEG, TIFF, BMP, WEBP, GIF

EML, MBOX, HTML, and DOCX flow through dedicated format-aware converters that preserve email headers, HTML structure, and DOCX tables in a single pass. Every other input (PDF, images, TXT, XLSX, PPTX) flows through the strategy-driven page pipeline described below.


The Three Strategies

Strategy Model Needed Best For Speed
TextExtraction No (or LMKitOcr for OCR paths) Born-digital PDFs, DOCX, XLSX, PPTX, HTML, EML, MBOX 🔥 Fastest
VlmOcr Vision model Scans, photos, handwriting, layout-heavy pages 🐢 Slowest
Hybrid (recommended) Vision model (lazy) Mixed PDFs, unknown corpora ⚡ Adaptive

Under Hybrid, each page is inspected individually: pages with a clean text layer stay on the fast text path, pages without extractable text or with embedded images are routed to VLM OCR. No pre-classification is required from the caller.

TextExtraction becomes a full traditional-OCR pipeline the moment you set options.OcrEngine: standalone images get transcribed, embedded raster images on each PDF page are OCRed and merged back into the page layout, and scanned PDFs fall back to a full-page OCR pass on the fly. See Step 7 for the details.


Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM ~2 GB if a vision strategy is used (default lightonocr-2:1b)
Disk ~1 GB free for the model download on first run

TextExtraction on paginated formats (PDF, DOCX, XLSX, PPTX, EML, MBOX, HTML, TXT) needs no VRAM and no model.


Step 1: Create the Project

dotnet new console -n MarkdownQuickstart
cd MarkdownQuickstart
dotnet add package LM-Kit.NET

Step 2: Convert a PDF (Zero-Config, Hybrid)

The fastest path to production is the zero-config constructor. No model is loaded until a VLM-bound page actually needs one; if the PDF is fully born-digital, the engine stays on the CPU text path.

using System.Text;
using LMKit.Document.Conversion;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

var converter = new DocumentToMarkdown();

DocumentToMarkdownResult result = converter.Convert("report.pdf");

File.WriteAllText("report.md", result.Markdown);

Console.WriteLine($"Pages    : {result.Pages.Count}");
Console.WriteLine($"Strategy : {result.EffectiveStrategy}");
Console.WriteLine($"Elapsed  : {result.Elapsed.TotalSeconds:F2} s");

Step 3: Bring Your Own Vision Model

Pass an explicit LM when you want to reuse the model across converters, pick a different vision model, or take full control of download and loading progress.

using LMKit.Document.Conversion;
using LMKit.Model;

using LM model = LM.LoadFromModelID("lightonocr-2:1b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });

var converter = new DocumentToMarkdown(model);

var result = await converter.ConvertAsync("scan.pdf", new DocumentToMarkdownOptions
{
    Strategy = DocumentToMarkdownStrategy.VlmOcr
});

File.WriteAllText("scan.md", result.Markdown);

Step 4: Stream Per-Page Progress

Subscribe to PageStarting and PageCompleted to drive a progress bar, log per-page diagnostics, or cancel mid-flight by flipping e.Cancel.

using LMKit.Document.Conversion;

var converter = new DocumentToMarkdown();

converter.PageStarting += (_, e) =>
    Console.WriteLine($"▶ Page {e.PageNumber}/{e.PageCount}  planned: {e.PlannedStrategy}");

converter.PageCompleted += (_, e) =>
{
    if (e.Exception != null)
    {
        Console.WriteLine($"✗ Page {e.PageNumber} failed: {e.Exception.Message}");
        return;
    }

    var p = e.PageResult!;
    string q = p.QualityScore.HasValue ? $", quality={p.QualityScore:F2}" : "";
    string t = p.GeneratedTokenCount > 0 ? $", {p.GeneratedTokenCount} tok" : "";
    Console.WriteLine($"✓ Page {p.PageNumber} in {p.Elapsed.TotalMilliseconds:F0} ms  [{p.StrategyUsed}{t}{q}]");
};

var result = converter.Convert("mixed.pdf", new DocumentToMarkdownOptions
{
    Strategy = DocumentToMarkdownStrategy.Hybrid
});

Step 5: Pick Pages and Shape the Output

Use PageRange to slice large PDFs. Use EmitFrontMatter, IncludePageSeparators, and PreferMarkdownTablesForNonNested to shape the final Markdown for LLM ingestion or a static site.

var result = converter.Convert("big-report.pdf", new DocumentToMarkdownOptions
{
    Strategy                         = DocumentToMarkdownStrategy.Hybrid,
    PageRange                        = "1-5, 7, 10-12",
    EmitFrontMatter                  = true,
    IncludePageSeparators            = true,
    PageSeparatorFormat              = "\n\n---\n\n<!-- Page {pageNumber} -->\n\n",
    PreferMarkdownTablesForNonNested = true,
    NormalizeWhitespace              = true
});

Step 6: Convert Straight to Disk

ConvertToFile / ConvertToFileAsync skip the intermediate in-memory string and stream the Markdown to the target path.

await converter.ConvertToFileAsync("invoice.pdf", "out/invoice.md",
    new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid });

Step 7: Traditional OCR Without a Vision Model

When you want to run on very constrained hardware, pair TextExtraction with an OcrEngine such as LMKitOcr. Supplying the engine extends the text-extraction strategy at three complementary points:

  • Image attachments (PNG, JPEG, TIFF, BMP, WEBP, GIF, ...) are transcribed by the engine instead of producing empty Markdown.
  • Embedded raster images on PDF pages (charts, figure legends, scanned tables) are OCRed and their text projected back into the page's layout, so rasterised content flows alongside the native text.
  • Full-page fallback. PDF pages whose native text layer is empty (scans, flattened print-to-PDF) are rendered as a full-page raster and OCRed end-to-end.
using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;

using var ocr = new LMKitOcr();
var converter = new DocumentToMarkdown();

var result = converter.Convert("invoice.png", new DocumentToMarkdownOptions
{
    Strategy            = DocumentToMarkdownStrategy.TextExtraction,
    OcrEngine           = ocr,
    OcrImageParallelism = 4   // concurrent OCR calls per page (clamped to [1, 12])
});

Raise OcrImageParallelism on machines with spare CPU cores to speed up image-heavy PDFs; lower it to protect an OCR engine with its own internal thread pool. The converter also caps the per-image pipeline at 20 images per page: any page carrying more than that is routed to the full-page OCR fallback instead of spawning an unbounded number of per-image calls (DoS guard against pathological PDFs).

Tip: TextExtraction + LMKitOcr gives you OCR on PDFs, scans, and standalone images with no language model loaded at all, the lightest possible deployment for a pure OCR pipeline.


Step 8: Batch a Folder

DocumentToMarkdown is stateless across calls, so the same instance can be reused to process a whole directory.

using LMKit.Document.Conversion;

string inputDir  = "inbox";
string outputDir = "markdown";
Directory.CreateDirectory(outputDir);

string[] files = Directory.GetFiles(inputDir, "*.*", SearchOption.TopDirectoryOnly);

var converter = new DocumentToMarkdown();
var options   = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid };

foreach (string file in files)
{
    string outPath = Path.Combine(outputDir, Path.GetFileNameWithoutExtension(file) + ".md");
    try
    {
        var result = await converter.ConvertToFileAsync(file, outPath, options);
        Console.WriteLine($"{Path.GetFileName(file),-40} {result.Pages.Count} page(s)  [{result.EffectiveStrategy}]");
    }
    catch (Exception ex)
    {
        Console.WriteLine($"{Path.GetFileName(file),-40} FAILED: {ex.Message}");
    }
}

Advanced: Tune the VLM Path

DocumentToMarkdownOptions exposes the VLM knobs that used to live on VlmOcr directly:

var options = new DocumentToMarkdownOptions
{
    Strategy                   = DocumentToMarkdownStrategy.VlmOcr,
    VlmImageDetail             = LMKit.Inference.Vision.ImageDetail.High,
    VlmMaximumCompletionTokens = 4096,
    VlmStripImageMarkup        = true,
    VlmStripStyleAttributes    = true
};

For workflows that need raw access to the vision model (intent selection, coordinate extraction, custom instructions), drop down to VlmOcr directly. See the VLM OCR demo and the VLM OCR with Coordinates demo.


Model Selection for the VLM Strategies

Model ID VRAM Speed Quality Best For
lightonocr-2:1b ~2 GB Fastest Very good Purpose-built OCR specialist (default)
glm-ocr ~1 GB Very fast Good Lightweight OCR specialist
qwen3.5:2b ~2 GB Very fast Good Lightweight multilingual OCR
qwen3.5:4b ~3.5 GB Fast Very good Multilingual documents
gemma4:e4b ~6 GB Moderate Very good Mixed text and vision tasks
minicpm-o-45 ~5.9 GB Moderate Very good Strong all-round vision model
qwen3.5:9b ~7 GB Moderate Excellent High-quality multilingual OCR
ministral3:8b ~6.5 GB Moderate Very good Complex document layouts
glm-4.6v-flash ~7 GB Moderate Excellent Highest fidelity on complex layouts
qwen3.6:27b ~18 GB Slow Excellent Critical documents, demanding layouts

lightonocr-2:1b is a compact 1B model specifically trained for high-accuracy OCR and document understanding. It is the best default for dedicated OCR workloads. Switch to a larger model like qwen3.6:27b or glm-4.6v-flash when dealing with complex layouts, degraded scans, or handwriting.


Common Issues

Problem Cause Fix
Output truncated mid-sentence VlmMaximumCompletionTokens too low Raise to 4096+ or set to -1 for unlimited
Empty Markdown on an image input with TextExtraction No OCR engine supplied Set OcrEngine = new LMKitOcr() or switch to Hybrid / VlmOcr
Empty Markdown on a scanned PDF with TextExtraction No OCR engine supplied; the text layer is empty Add OcrEngine = new LMKitOcr(); the full-page OCR fallback kicks in automatically when the text layer is sparse
Chart labels / figure legends missing on a born-digital PDF Text sits inside embedded raster images, not the text layer Add OcrEngine = new LMKitOcr() under TextExtraction to recognise embedded images and merge their text back into the page layout
Tables rendered as HTML <table> uses nested tables or rowspan/colspan Leave as HTML (Markdown cannot express those layouts) or post-process
PDFs take a long time on scans Every page is routed to VLM Use Hybrid, which keeps born-digital pages on the CPU text path
Per-image OCR feels slow Image-heavy page with default OcrImageParallelism = 4 Raise OcrImageParallelism (up to 12) on CPU-rich machines
Model downloads on first run First-time use of the default vision model Pre-load with LM.LoadFromModelID("lightonocr-2:1b") and pass to the constructor
VLM page quality looks low Dense layout or small fonts exceeding VlmImageDetail.Low Set VlmImageDetail = ImageDetail.High (default) and raise VlmMaximumCompletionTokens

Next Steps

Share