👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/document-conversion/multi_format_to_markdown
Multi-Format to Markdown for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that converts PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, and image files to Markdown. Hybrid pipeline: native text where it works (deterministic, fast), VLM-OCR fallback for scanned pages and image attachments.
All conversion runs on-device.
👥 Industry Target Audience
- RAG / search teams: normalize a heterogeneous corpus into one indexable format.
- Compliance / legal: produce searchable Markdown from PDFs, emails, and contracts.
- Documentation pipelines: convert legacy Word/PowerPoint to Markdown for static sites.
- Knowledge bases: turn email archives into queryable text.
- Fine-tuning data prep: build training datasets from mixed-format sources.
🚀 Problem Solved
Building a pipeline that reads every file type your business produces is months of plumbing: per-format parsers, OCR for scans, image extraction, layout reconstruction. DocumentToMarkdown collapses it into one call. The demo wraps it behind a menu that scales from a single document to a folder of thousands.
💻 Application Overview
Interactive menu (no command-line arguments) with two modes:
| Mode | What it does |
|---|---|
| File | Convert a single document. Prompts for the path, strategy (Hybrid / TextExtraction / VlmOcr), page separators, and output path. |
| Folder | Convert every supported file in a folder (recursive). Same option prompts. |
| Quit | Exit. |
The OCR model is loaded once at startup. Per-file output includes strategy, certainty, elapsed time, page count, and character count.
✨ Key Features
DocumentToMarkdown.Convert(path, options): one call, every supported format.- Strategy picker:
Hybrid(default),TextExtraction,VlmOcr. OcrImageParallelism: tune VLM-OCR throughput per page.IncludePageSeparators: keep page boundaries in the output Markdown.DocumentToMarkdownResult: rich result withMarkdown,Pages,Certainty,Elapsed,EffectiveStrategy.
🧠 Model
paddleocr-vl-1.6:0.9b(fast, low-VRAM VLM for OCR fallback).
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/document-conversion/multi_format_to_markdown
dotnet run
Pick a mode from the menu and follow the prompts.