Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/document-conversion/multi_format_to_markdown

Multi-Format to Markdown for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that converts PDF, DOCX, PPTX, XLSX, HTML, EML, MBOX, and image files to Markdown. Hybrid pipeline: native text where it works (deterministic, fast), VLM-OCR fallback for scanned pages and image attachments.

All conversion runs on-device.


👥 Industry Target Audience

  • RAG / search teams: normalize a heterogeneous corpus into one indexable format.
  • Compliance / legal: produce searchable Markdown from PDFs, emails, and contracts.
  • Documentation pipelines: convert legacy Word/PowerPoint to Markdown for static sites.
  • Knowledge bases: turn email archives into queryable text.
  • Fine-tuning data prep: build training datasets from mixed-format sources.

🚀 Problem Solved

Building a pipeline that reads every file type your business produces is months of plumbing: per-format parsers, OCR for scans, image extraction, layout reconstruction. DocumentToMarkdown collapses it into one call. The demo wraps it behind a menu that scales from a single document to a folder of thousands.


💻 Application Overview

Interactive menu (no command-line arguments) with two modes:

Mode What it does
File Convert a single document. Prompts for the path, strategy (Hybrid / TextExtraction / VlmOcr), page separators, and output path.
Folder Convert every supported file in a folder (recursive). Same option prompts.
Quit Exit.

The OCR model is loaded once at startup. Per-file output includes strategy, certainty, elapsed time, page count, and character count.

✨ Key Features

  • DocumentToMarkdown.Convert(path, options): one call, every supported format.
  • Strategy picker: Hybrid (default), TextExtraction, VlmOcr.
  • OcrImageParallelism: tune VLM-OCR throughput per page.
  • IncludePageSeparators: keep page boundaries in the output Markdown.
  • DocumentToMarkdownResult: rich result with Markdown, Pages, Certainty, Elapsed, EffectiveStrategy.

🧠 Model

  • paddleocr-vl-1.6:0.9b (fast, low-VRAM VLM for OCR fallback).

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/document-conversion/multi_format_to_markdown
dotnet run

Pick a mode from the menu and follow the prompts.

Share