Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_to_markdown

Document-to-Markdown Universal Conversion Engine for C# .NET Applications


🎯 Purpose of the Demo

Document to Markdown showcases DocumentToMarkdown, LM-Kit.NET's state-of-the-art universal conversion engine. In a single API it replaces a whole stack of legacy components (PDF text extractors, Tesseract-style OCR, DOCX/XLSX parsers, email rippers, HTML-to-Markdown libraries) and turns any office file, email, web page, PDF, or image into clean, LLM-ready Markdown that keeps headings, tables, lists, code blocks, and reading order intact.

Everything runs 100% on-device: no cloud round trips, no per-page pricing, no data leaving your infrastructure.

The sample shows how to:

  • Build a DocumentToMarkdown instance with or without a vision model.
  • Switch between the three conversion strategies (Hybrid, TextExtraction, VlmOcr) and watch how the effective strategy is resolved per page.
  • Feed heterogeneous inputs (PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, TXT, images).
  • Subscribe to the live PageStarting and PageCompleted events for streaming progress and per-page diagnostics.
  • Emit YAML front matter, configure page separators, rewrite non-nested HTML tables into GitHub-flavored Markdown, and pick arbitrary page ranges.
  • Write the final Markdown straight to disk with ConvertToFile.
  • Plug a traditional OCR engine (LMKitOcr) into the TextExtraction strategy to cover image inputs, enrich PDF pages with OCR of their embedded raster images (charts, figure legends, scanned tables), and fall back to full-page OCR on scanned PDFs — all without a vision model.

👥 Target Audience

  • Platform and Backend Engineers: add a single, unified "to-Markdown" step to any .NET ingestion or AI pipeline.
  • RAG and Knowledge Base Builders: produce the Markdown corpus that powers embeddings, search, and grounded generation.
  • Document Automation Teams: replace a legacy stack of PDF, DOCX, OCR, and email parsers with a single, governed component.
  • Compliance-sensitive Organizations: convert sensitive documents without sending them to a third-party API.

🚀 Problem Solved

  • One engine, every format. PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, TXT, and every common raster image (PNG, JPG, TIFF, BMP, WEBP, GIF).
  • Mixed-content PDFs, solved. The Hybrid strategy keeps born-digital pages on the fast text-layer path and automatically escalates scanned or image-heavy pages to vision OCR, with no pre-classification required from the caller.
  • Structural fidelity. Headings, tables, lists, code blocks, and reading order survive the round trip. Email headers and HTML structure are preserved by dedicated format-aware converters.
  • Deterministic fast path. When every page has a clean text layer, no model is loaded and conversion is CPU-only and deterministic.
  • Zero-config startup. Omit the model and DocumentToMarkdown will lazily load the bundled lightonocr-2:1b specialist only if a vision-dependent page is encountered.
  • Streaming observability. PageStarting and PageCompleted let you build progress bars, cancel mid-flight, or log per-page strategy, elapsed time, and quality score.

💻 Sample Application Description

Interactive console app that:

  1. Picks a strategy: Hybrid (default, recommended), TextExtraction, or VlmOcr.
  2. Loads a vision model (only if vision may be needed). Default is LightOnOCR 2 1B (~2 GB VRAM), with nine additional model options and custom-URI support.
  3. Builds a DocumentToMarkdown converter and hooks into the PageStarting and PageCompleted events to print a live per-page log.
  4. Runs a conversion loop that prompts for:
    • a document path (PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, TXT, or image),
    • an optional page range (1-5,7),
    • an optional .md output path.
  5. Prints the full Markdown (or a preview plus on-disk path) and a summary block with the requested/effective strategy, per-page breakdown (text vs VLM), total VLM tokens, character count, and elapsed times.
  6. Auto-detects image-only inputs when running the TextExtraction strategy and wires an LMKitOcr instance for that run so TextExtraction remains usable without a vision model.

✨ Key Features

  • 🧠 Universal engine: one API for every supported format.
  • 🔀 Hybrid routing: per-page decision between text extraction and VLM OCR.
  • 📩 Format-aware specialists: EML, MBOX, HTML, and DOCX are converted in a single pass by dedicated converters that preserve email headers, HTML structure, and DOCX tables.
  • 📏 Page ranges: convert "1-5, 7, 9-12" of a 500-page PDF.
  • 📊 Rich telemetry per page: StrategyUsed, Elapsed, GeneratedTokenCount, QualityScore, Warning, HasExtractableText.
  • 📝 YAML front matter and page separators: ready for LLM ingestion or static-site pipelines.
  • 📦 Lazy model loading: no model is downloaded or loaded until a VLM page actually needs it.
  • 🛡️ Local-first: nothing leaves the process.

🧭 Strategy Matrix

Strategy Model Needed Best For Speed
TextExtraction No (or LMKitOcr for OCR paths) Born-digital PDFs, DOCX, XLSX, PPTX, HTML, EML, MBOX 🔥 Fastest
VlmOcr Vision model Scans, photos, handwriting, layout-heavy pages 🐢 Slowest
Hybrid (recommended) Vision model (lazy) Mixed PDFs (born-digital plus scanned), unknown corpora ⚡ Adaptive

TextExtraction becomes a full OCR pipeline the moment you set options.OcrEngine: it transcribes image attachments, enriches PDF pages with OCR of their embedded raster images, and falls back to full-page OCR on scanned PDFs — no language model required.


🧰 Built-In Models (menu)

On startup, the sample exposes a vision-model menu (only prompted when vision may be used):

Option Model Approx. VRAM
0 LightOn LightOnOCR 2 1B (★ default) ~2 GB
1 Z.ai GLM-OCR 0.9B ~1 GB
2 Z.ai GLM-V 4.6 Flash 10B ~7 GB
3 MiniCPM o 4.5 9B ~5.9 GB
4 Alibaba Qwen 3.5 2B ~2 GB
5 Alibaba Qwen 3.5 4B ~3.5 GB
6 Alibaba Qwen 3.5 9B ~7 GB
7 Google Gemma 4 E4B ~6 GB
8 Alibaba Qwen 3.5 27B ~18 GB
9 Mistral Ministral 3 8B ~6.5 GB
other Custom model URI depends

💻 Minimal Integration Snippet

using LMKit.Document.Conversion;
using LMKit.Model;

// Zero-config: lightonocr-2:1b is loaded lazily only if a VLM page is encountered.
var converter = new DocumentToMarkdown();

converter.PageStarting  += (_, e) => Console.WriteLine($"Page {e.PageNumber}/{e.PageCount}  [{e.PlannedStrategy}]");
converter.PageCompleted += (_, e) =>
{
    if (e.PageResult != null)
    {
        Console.WriteLine($"Page {e.PageResult.PageNumber} in {e.PageResult.Elapsed.TotalMilliseconds:F0} ms " +
                          $"[{e.PageResult.StrategyUsed}]");
    }
};

var result = converter.Convert("report.pdf", new DocumentToMarkdownOptions
{
    Strategy                         = DocumentToMarkdownStrategy.Hybrid,
    PageRange                        = "1-10",
    EmitFrontMatter                  = true,
    PreferMarkdownTablesForNonNested = true
});

File.WriteAllText("report.md", result.Markdown);

foreach (var page in result.Pages)
{
    Console.WriteLine($"Page {page.PageNumber}: {page.StrategyUsed}  {page.Elapsed.TotalMilliseconds:F0} ms");
}

Bring-your-own model

using var model = LM.LoadFromModelID("lightonocr-2:1b");
var converter = new DocumentToMarkdown(model);

Convert straight to disk

await converter.ConvertToFileAsync("invoice.pdf", "invoice.md",
    new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid });

Pure TextExtraction with traditional OCR

Supplying OcrEngine extends TextExtraction at three points: standalone image inputs are transcribed, embedded raster images on each PDF page are OCRed and merged into the page layout (chart labels, figure legends), and scanned PDFs fall back to a full-page OCR pass. The whole pipeline runs with no language model loaded at all — the leanest possible OCR deployment.

using LMKit.Extraction.Ocr;
using LMKit.Document.Conversion;

using var ocr = new LMKitOcr();
var converter = new DocumentToMarkdown();

var result = converter.Convert("invoice.png", new DocumentToMarkdownOptions
{
    Strategy            = DocumentToMarkdownStrategy.TextExtraction,
    OcrEngine           = ocr,
    OcrImageParallelism = 4        // concurrent OCR calls per page (1..12)
});

The per-page pipeline caps at 20 images per page (a DoS guard against pathological PDFs); any page beyond that limit is transparently handled by the full-page OCR fallback instead of spawning an unbounded number of per-image calls.


🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later
  • ~2 GB VRAM if a vision strategy is selected (default model: lightonocr-2:1b)
  • No VRAM needed when running TextExtraction on paginated formats (PDF, DOCX, XLSX, PPTX, EML, MBOX, HTML, TXT)

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document_to_markdown

▶️ Run

dotnet build
dotnet run

Then:

  1. Select a strategy (0 = Hybrid, 1 = TextExtraction, 2 = VlmOcr).
  2. If vision may be used, select a vision model or paste a custom URI.
  3. Enter a document path, an optional page range, and an optional output .md path.
  4. Read the per-page log, the Markdown preview, and the conversion summary.
  5. Press Enter to convert another file, or q to quit.

🔍 Notes on Key Types

  • DocumentToMarkdown (LMKit.Document.Conversion): entry point for every conversion. Accepts file paths, byte arrays, streams, ImageBuffer, Uri, and pre-built Attachment objects, with both synchronous and async overloads plus direct-to-file variants.
  • DocumentToMarkdownOptions: strategy, page range, OCR engine and per-image parallelism, VLM image detail and token budget, DOCX/email-specific toggles, and output shaping (front matter, separators, table rewriting, whitespace normalization).
  • DocumentToMarkdownStrategy: TextExtraction, VlmOcr, or Hybrid.
  • DocumentToMarkdownResult: aggregated Markdown plus Pages list, requested vs effective strategy, total elapsed time, and source name.
  • DocumentToMarkdownPageResult: per-page strategy, Markdown body, elapsed time, token count, quality score, warnings.

🔧 Extend the Demo

  • Add a batch mode that recursively walks a folder and writes one .md per input.
  • Pipe the Markdown into LM-Kit.NET's RAG or Structured Extraction stack to go from raw documents, to Markdown, to embeddings or structured JSON in one flow.
  • Add a cancellation UI by wiring CancellationToken or flipping DocumentToMarkdownPageStartingEventArgs.Cancel.
  • Replace the console log with a progress bar driven by PageStarting and PageCompleted.
  • Swap LMKitOcr for a custom OcrEngine subclass to integrate an in-house OCR service.

Share