👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_to_markdown
Document-to-Markdown Universal Conversion Engine for C# .NET Applications
🎯 Purpose of the Demo
Document to Markdown showcases DocumentToMarkdown, LM-Kit.NET's state-of-the-art
universal conversion engine. In a single API it replaces a whole stack of legacy components
(PDF text extractors, Tesseract-style OCR, DOCX/XLSX parsers, email rippers, HTML-to-Markdown
libraries) and turns any office file, email, web page, PDF, or image into clean, LLM-ready
Markdown that keeps headings, tables, lists, code blocks, and reading order intact.
Everything runs 100% on-device: no cloud round trips, no per-page pricing, no data leaving your infrastructure.
The sample shows how to:
- Build a
DocumentToMarkdowninstance with or without a vision model. - Switch between the three conversion strategies
(
Hybrid,TextExtraction,VlmOcr) and watch how the effective strategy is resolved per page. - Feed heterogeneous inputs (PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, TXT, images).
- Subscribe to the live
PageStartingandPageCompletedevents for streaming progress and per-page diagnostics. - Emit YAML front matter, configure page separators, rewrite non-nested HTML tables into GitHub-flavored Markdown, and pick arbitrary page ranges.
- Write the final Markdown straight to disk with
ConvertToFile. - Plug a traditional OCR engine (
LMKitOcr) into the TextExtraction strategy to cover image inputs, enrich PDF pages with OCR of their embedded raster images (charts, figure legends, scanned tables), and fall back to full-page OCR on scanned PDFs — all without a vision model.
👥 Target Audience
- Platform and Backend Engineers: add a single, unified "to-Markdown" step to any .NET ingestion or AI pipeline.
- RAG and Knowledge Base Builders: produce the Markdown corpus that powers embeddings, search, and grounded generation.
- Document Automation Teams: replace a legacy stack of PDF, DOCX, OCR, and email parsers with a single, governed component.
- Compliance-sensitive Organizations: convert sensitive documents without sending them to a third-party API.
🚀 Problem Solved
- One engine, every format. PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, TXT, and every common raster image (PNG, JPG, TIFF, BMP, WEBP, GIF).
- Mixed-content PDFs, solved. The
Hybridstrategy keeps born-digital pages on the fast text-layer path and automatically escalates scanned or image-heavy pages to vision OCR, with no pre-classification required from the caller. - Structural fidelity. Headings, tables, lists, code blocks, and reading order survive the round trip. Email headers and HTML structure are preserved by dedicated format-aware converters.
- Deterministic fast path. When every page has a clean text layer, no model is loaded and conversion is CPU-only and deterministic.
- Zero-config startup. Omit the model and
DocumentToMarkdownwill lazily load the bundledlightonocr-2:1bspecialist only if a vision-dependent page is encountered. - Streaming observability.
PageStartingandPageCompletedlet you build progress bars, cancel mid-flight, or log per-page strategy, elapsed time, and quality score.
💻 Sample Application Description
Interactive console app that:
- Picks a strategy:
Hybrid(default, recommended),TextExtraction, orVlmOcr. - Loads a vision model (only if vision may be needed). Default is LightOnOCR 2 1B (~2 GB VRAM), with nine additional model options and custom-URI support.
- Builds a
DocumentToMarkdownconverter and hooks into thePageStartingandPageCompletedevents to print a live per-page log. - Runs a conversion loop that prompts for:
- a document path (PDF, DOCX, PPTX, XLSX, EML, MBOX, HTML, TXT, or image),
- an optional page range (
1-5,7), - an optional
.mdoutput path.
- Prints the full Markdown (or a preview plus on-disk path) and a summary block with the requested/effective strategy, per-page breakdown (text vs VLM), total VLM tokens, character count, and elapsed times.
- Auto-detects image-only inputs when running the TextExtraction strategy and wires an
LMKitOcrinstance for that run so TextExtraction remains usable without a vision model.
✨ Key Features
- 🧠 Universal engine: one API for every supported format.
- 🔀 Hybrid routing: per-page decision between text extraction and VLM OCR.
- 📩 Format-aware specialists: EML, MBOX, HTML, and DOCX are converted in a single pass by dedicated converters that preserve email headers, HTML structure, and DOCX tables.
- 📏 Page ranges: convert
"1-5, 7, 9-12"of a 500-page PDF. - 📊 Rich telemetry per page:
StrategyUsed,Elapsed,GeneratedTokenCount,QualityScore,Warning,HasExtractableText. - 📝 YAML front matter and page separators: ready for LLM ingestion or static-site pipelines.
- 📦 Lazy model loading: no model is downloaded or loaded until a VLM page actually needs it.
- 🛡️ Local-first: nothing leaves the process.
🧭 Strategy Matrix
| Strategy | Model Needed | Best For | Speed |
|---|---|---|---|
TextExtraction |
No (or LMKitOcr for OCR paths) |
Born-digital PDFs, DOCX, XLSX, PPTX, HTML, EML, MBOX | 🔥 Fastest |
VlmOcr |
Vision model | Scans, photos, handwriting, layout-heavy pages | 🐢 Slowest |
Hybrid (recommended) |
Vision model (lazy) | Mixed PDFs (born-digital plus scanned), unknown corpora | ⚡ Adaptive |
TextExtraction becomes a full OCR pipeline the moment you set options.OcrEngine:
it transcribes image attachments, enriches PDF pages with OCR of their embedded
raster images, and falls back to full-page OCR on scanned PDFs — no language
model required.
🧰 Built-In Models (menu)
On startup, the sample exposes a vision-model menu (only prompted when vision may be used):
| Option | Model | Approx. VRAM |
|---|---|---|
| 0 | LightOn LightOnOCR 2 1B (★ default) | ~2 GB |
| 1 | Z.ai GLM-OCR 0.9B | ~1 GB |
| 2 | Z.ai GLM-V 4.6 Flash 10B | ~7 GB |
| 3 | MiniCPM o 4.5 9B | ~5.9 GB |
| 4 | Alibaba Qwen 3.5 2B | ~2 GB |
| 5 | Alibaba Qwen 3.5 4B | ~3.5 GB |
| 6 | Alibaba Qwen 3.5 9B | ~7 GB |
| 7 | Google Gemma 4 E4B | ~6 GB |
| 8 | Alibaba Qwen 3.5 27B | ~18 GB |
| 9 | Mistral Ministral 3 8B | ~6.5 GB |
| other | Custom model URI | depends |
💻 Minimal Integration Snippet
using LMKit.Document.Conversion;
using LMKit.Model;
// Zero-config: lightonocr-2:1b is loaded lazily only if a VLM page is encountered.
var converter = new DocumentToMarkdown();
converter.PageStarting += (_, e) => Console.WriteLine($"Page {e.PageNumber}/{e.PageCount} [{e.PlannedStrategy}]");
converter.PageCompleted += (_, e) =>
{
if (e.PageResult != null)
{
Console.WriteLine($"Page {e.PageResult.PageNumber} in {e.PageResult.Elapsed.TotalMilliseconds:F0} ms " +
$"[{e.PageResult.StrategyUsed}]");
}
};
var result = converter.Convert("report.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.Hybrid,
PageRange = "1-10",
EmitFrontMatter = true,
PreferMarkdownTablesForNonNested = true
});
File.WriteAllText("report.md", result.Markdown);
foreach (var page in result.Pages)
{
Console.WriteLine($"Page {page.PageNumber}: {page.StrategyUsed} {page.Elapsed.TotalMilliseconds:F0} ms");
}
Bring-your-own model
using var model = LM.LoadFromModelID("lightonocr-2:1b");
var converter = new DocumentToMarkdown(model);
Convert straight to disk
await converter.ConvertToFileAsync("invoice.pdf", "invoice.md",
new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid });
Pure TextExtraction with traditional OCR
Supplying OcrEngine extends TextExtraction at three points: standalone image
inputs are transcribed, embedded raster images on each PDF page are OCRed and
merged into the page layout (chart labels, figure legends), and scanned PDFs
fall back to a full-page OCR pass. The whole pipeline runs with no language
model loaded at all — the leanest possible OCR deployment.
using LMKit.Extraction.Ocr;
using LMKit.Document.Conversion;
using var ocr = new LMKitOcr();
var converter = new DocumentToMarkdown();
var result = converter.Convert("invoice.png", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.TextExtraction,
OcrEngine = ocr,
OcrImageParallelism = 4 // concurrent OCR calls per page (1..12)
});
The per-page pipeline caps at 20 images per page (a DoS guard against pathological PDFs); any page beyond that limit is transparently handled by the full-page OCR fallback instead of spawning an unbounded number of per-image calls.
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
- ~2 GB VRAM if a vision strategy is selected (default model:
lightonocr-2:1b) - No VRAM needed when running
TextExtractionon paginated formats (PDF, DOCX, XLSX, PPTX, EML, MBOX, HTML, TXT)
📥 Download
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document_to_markdown
▶️ Run
dotnet build
dotnet run
Then:
- Select a strategy (0 = Hybrid, 1 = TextExtraction, 2 = VlmOcr).
- If vision may be used, select a vision model or paste a custom URI.
- Enter a document path, an optional page range, and an optional output
.mdpath. - Read the per-page log, the Markdown preview, and the conversion summary.
- Press Enter to convert another file, or
qto quit.
🔍 Notes on Key Types
DocumentToMarkdown(LMKit.Document.Conversion): entry point for every conversion. Accepts file paths, byte arrays, streams,ImageBuffer,Uri, and pre-builtAttachmentobjects, with both synchronous and async overloads plus direct-to-file variants.DocumentToMarkdownOptions: strategy, page range, OCR engine and per-image parallelism, VLM image detail and token budget, DOCX/email-specific toggles, and output shaping (front matter, separators, table rewriting, whitespace normalization).DocumentToMarkdownStrategy:TextExtraction,VlmOcr, orHybrid.DocumentToMarkdownResult: aggregated Markdown plusPageslist, requested vs effective strategy, total elapsed time, and source name.DocumentToMarkdownPageResult: per-page strategy, Markdown body, elapsed time, token count, quality score, warnings.
🔧 Extend the Demo
- Add a batch mode that recursively walks a folder and writes one
.mdper input. - Pipe the Markdown into LM-Kit.NET's RAG or Structured Extraction stack to go from raw documents, to Markdown, to embeddings or structured JSON in one flow.
- Add a cancellation UI by wiring
CancellationTokenor flippingDocumentToMarkdownPageStartingEventArgs.Cancel. - Replace the console log with a progress bar driven by
PageStartingandPageCompleted. - Swap
LMKitOcrfor a customOcrEnginesubclass to integrate an in-house OCR service.
📚 Related Content
- How-To: Convert Documents to Markdown: Step-by-step guide to the universal converter.
- VLM OCR Demo: Lower-level VLM OCR loop with intent selection.
- Chat with PDF Demo: Use the engine as the ingestion stage of a RAG pipeline.
- Glossary: Vision Language Models: Background on the multimodal models powering VLM OCR.
- Glossary: Optical Character Recognition: OCR fundamentals used by the text-extraction strategy's image path.