Convert Documents to Markdown
LM-Kit.NET ships a single universal converter, DocumentToMarkdown, that turns any
supported document format into clean, LLM-ready Markdown. It replaces a whole stack of legacy
components (PDF text extractors, Tesseract-style OCR, DOCX/XLSX parsers, email rippers,
HTML-to-Markdown libraries) with one API, one result type, and one unified quality signal.
Everything runs 100% on-device.
This guide walks through the three conversion strategies, per-page progress, and the common patterns you will actually use in production.
Supported Formats
| Category | Formats |
|---|---|
| Documents | PDF, DOCX, PPTX, XLSX, TXT |
| EML, MBOX | |
| Web | HTML |
| Images | PNG, JPG, JPEG, TIFF, BMP, WEBP, GIF |
EML, MBOX, HTML, and DOCX flow through dedicated format-aware converters that preserve email headers, HTML structure, and DOCX tables in a single pass. Every other input (PDF, images, TXT, XLSX, PPTX) flows through the strategy-driven page pipeline described below.
The Three Strategies
| Strategy | Model Needed | Best For | Speed |
|---|---|---|---|
TextExtraction |
No (or LMKitOcr for OCR paths) |
Born-digital PDFs, DOCX, XLSX, PPTX, HTML, EML, MBOX | 🔥 Fastest |
VlmOcr |
Vision model | Scans, photos, handwriting, layout-heavy pages | 🐢 Slowest |
Hybrid (recommended) |
Vision model (lazy) | Mixed PDFs, unknown corpora | ⚡ Adaptive |
Under Hybrid, each page is inspected individually: pages with a clean text layer stay on
the fast text path, pages without extractable text or with embedded images are routed to VLM
OCR. No pre-classification is required from the caller.
TextExtraction becomes a full traditional-OCR pipeline the moment you set
options.OcrEngine: standalone images get transcribed, embedded raster images on each
PDF page are OCRed and merged back into the page layout, and scanned PDFs fall back
to a full-page OCR pass on the fly. See Step 7 for the details.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | ~2 GB if a vision strategy is used (default lightonocr-2:1b) |
| Disk | ~1 GB free for the model download on first run |
TextExtraction on paginated formats (PDF, DOCX, XLSX, PPTX, EML, MBOX, HTML, TXT) needs no
VRAM and no model.
Step 1: Create the Project
dotnet new console -n MarkdownQuickstart
cd MarkdownQuickstart
dotnet add package LM-Kit.NET
Step 2: Convert a PDF (Zero-Config, Hybrid)
The fastest path to production is the zero-config constructor. No model is loaded until a VLM-bound page actually needs one; if the PDF is fully born-digital, the engine stays on the CPU text path.
using System.Text;
using LMKit.Document.Conversion;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
var converter = new DocumentToMarkdown();
DocumentToMarkdownResult result = converter.Convert("report.pdf");
File.WriteAllText("report.md", result.Markdown);
Console.WriteLine($"Pages : {result.Pages.Count}");
Console.WriteLine($"Strategy : {result.EffectiveStrategy}");
Console.WriteLine($"Elapsed : {result.Elapsed.TotalSeconds:F2} s");
Step 3: Bring Your Own Vision Model
Pass an explicit LM when you want to reuse the model across converters, pick a different
vision model, or take full control of download and loading progress.
using LMKit.Document.Conversion;
using LMKit.Model;
using LM model = LM.LoadFromModelID("lightonocr-2:1b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
var converter = new DocumentToMarkdown(model);
var result = await converter.ConvertAsync("scan.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.VlmOcr
});
File.WriteAllText("scan.md", result.Markdown);
Step 4: Stream Per-Page Progress
Subscribe to PageStarting and PageCompleted to drive a progress bar, log per-page
diagnostics, or cancel mid-flight by flipping e.Cancel.
using LMKit.Document.Conversion;
var converter = new DocumentToMarkdown();
converter.PageStarting += (_, e) =>
Console.WriteLine($"▶ Page {e.PageNumber}/{e.PageCount} planned: {e.PlannedStrategy}");
converter.PageCompleted += (_, e) =>
{
if (e.Exception != null)
{
Console.WriteLine($"✗ Page {e.PageNumber} failed: {e.Exception.Message}");
return;
}
var p = e.PageResult!;
string q = p.QualityScore.HasValue ? $", quality={p.QualityScore:F2}" : "";
string t = p.GeneratedTokenCount > 0 ? $", {p.GeneratedTokenCount} tok" : "";
Console.WriteLine($"✓ Page {p.PageNumber} in {p.Elapsed.TotalMilliseconds:F0} ms [{p.StrategyUsed}{t}{q}]");
};
var result = converter.Convert("mixed.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.Hybrid
});
Step 5: Pick Pages and Shape the Output
Use PageRange to slice large PDFs. Use EmitFrontMatter, IncludePageSeparators, and
PreferMarkdownTablesForNonNested to shape the final Markdown for LLM ingestion or a static
site.
var result = converter.Convert("big-report.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.Hybrid,
PageRange = "1-5, 7, 10-12",
EmitFrontMatter = true,
IncludePageSeparators = true,
PageSeparatorFormat = "\n\n---\n\n<!-- Page {pageNumber} -->\n\n",
PreferMarkdownTablesForNonNested = true,
NormalizeWhitespace = true
});
Step 6: Convert Straight to Disk
ConvertToFile / ConvertToFileAsync skip the intermediate in-memory string and stream the
Markdown to the target path.
await converter.ConvertToFileAsync("invoice.pdf", "out/invoice.md",
new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid });
Step 7: Traditional OCR Without a Vision Model
When you want to run on very constrained hardware, pair TextExtraction with an
OcrEngine such as LMKitOcr. Supplying the engine extends the text-extraction
strategy at three complementary points:
- Image attachments (PNG, JPEG, TIFF, BMP, WEBP, GIF, ...) are transcribed by the engine instead of producing empty Markdown.
- Embedded raster images on PDF pages (charts, figure legends, scanned tables) are OCRed and their text projected back into the page's layout, so rasterised content flows alongside the native text.
- Full-page fallback. PDF pages whose native text layer is empty (scans, flattened print-to-PDF) are rendered as a full-page raster and OCRed end-to-end.
using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;
using var ocr = new LMKitOcr();
var converter = new DocumentToMarkdown();
var result = converter.Convert("invoice.png", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.TextExtraction,
OcrEngine = ocr,
OcrImageParallelism = 4 // concurrent OCR calls per page (clamped to [1, 12])
});
Raise OcrImageParallelism on machines with spare CPU cores to speed up
image-heavy PDFs; lower it to protect an OCR engine with its own internal thread
pool. The converter also caps the per-image pipeline at 20 images per page: any
page carrying more than that is routed to the full-page OCR fallback instead of
spawning an unbounded number of per-image calls (DoS guard against pathological
PDFs).
Tip:
TextExtraction + LMKitOcrgives you OCR on PDFs, scans, and standalone images with no language model loaded at all, the lightest possible deployment for a pure OCR pipeline.
Step 8: Batch a Folder
DocumentToMarkdown is stateless across calls, so the same instance can be reused to process
a whole directory.
using LMKit.Document.Conversion;
string inputDir = "inbox";
string outputDir = "markdown";
Directory.CreateDirectory(outputDir);
string[] files = Directory.GetFiles(inputDir, "*.*", SearchOption.TopDirectoryOnly);
var converter = new DocumentToMarkdown();
var options = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid };
foreach (string file in files)
{
string outPath = Path.Combine(outputDir, Path.GetFileNameWithoutExtension(file) + ".md");
try
{
var result = await converter.ConvertToFileAsync(file, outPath, options);
Console.WriteLine($"{Path.GetFileName(file),-40} {result.Pages.Count} page(s) [{result.EffectiveStrategy}]");
}
catch (Exception ex)
{
Console.WriteLine($"{Path.GetFileName(file),-40} FAILED: {ex.Message}");
}
}
Advanced: Tune the VLM Path
DocumentToMarkdownOptions exposes the VLM knobs that used to live on VlmOcr directly:
var options = new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.VlmOcr,
VlmImageDetail = LMKit.Inference.Vision.ImageDetail.High,
VlmMaximumCompletionTokens = 4096,
VlmStripImageMarkup = true,
VlmStripStyleAttributes = true
};
For workflows that need raw access to the vision model (intent selection, coordinate
extraction, custom instructions), drop down to VlmOcr directly. See the
VLM OCR demo and the
VLM OCR with Coordinates demo.
Model Selection for the VLM Strategies
| Model ID | VRAM | Speed | Quality | Best For |
|---|---|---|---|---|
lightonocr-2:1b |
~2 GB | Fastest | Very good | Purpose-built OCR specialist (default) |
glm-ocr |
~1 GB | Very fast | Good | Lightweight OCR specialist |
qwen3.5:2b |
~2 GB | Very fast | Good | Lightweight multilingual OCR |
qwen3.5:4b |
~3.5 GB | Fast | Very good | Multilingual documents |
gemma4:e4b |
~6 GB | Moderate | Very good | Mixed text and vision tasks |
minicpm-o-45 |
~5.9 GB | Moderate | Very good | Strong all-round vision model |
qwen3.5:9b |
~7 GB | Moderate | Excellent | High-quality multilingual OCR |
ministral3:8b |
~6.5 GB | Moderate | Very good | Complex document layouts |
glm-4.6v-flash |
~7 GB | Moderate | Excellent | Highest fidelity on complex layouts |
qwen3.6:27b |
~18 GB | Slow | Excellent | Critical documents, demanding layouts |
lightonocr-2:1b is a compact 1B model specifically trained for high-accuracy OCR and
document understanding. It is the best default for dedicated OCR workloads. Switch to a
larger model like qwen3.6:27b or glm-4.6v-flash when dealing with complex layouts,
degraded scans, or handwriting.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Output truncated mid-sentence | VlmMaximumCompletionTokens too low |
Raise to 4096+ or set to -1 for unlimited |
Empty Markdown on an image input with TextExtraction |
No OCR engine supplied | Set OcrEngine = new LMKitOcr() or switch to Hybrid / VlmOcr |
Empty Markdown on a scanned PDF with TextExtraction |
No OCR engine supplied; the text layer is empty | Add OcrEngine = new LMKitOcr(); the full-page OCR fallback kicks in automatically when the text layer is sparse |
| Chart labels / figure legends missing on a born-digital PDF | Text sits inside embedded raster images, not the text layer | Add OcrEngine = new LMKitOcr() under TextExtraction to recognise embedded images and merge their text back into the page layout |
| Tables rendered as HTML | <table> uses nested tables or rowspan/colspan |
Leave as HTML (Markdown cannot express those layouts) or post-process |
| PDFs take a long time on scans | Every page is routed to VLM | Use Hybrid, which keeps born-digital pages on the CPU text path |
| Per-image OCR feels slow | Image-heavy page with default OcrImageParallelism = 4 |
Raise OcrImageParallelism (up to 12) on CPU-rich machines |
| Model downloads on first run | First-time use of the default vision model | Pre-load with LM.LoadFromModelID("lightonocr-2:1b") and pass to the constructor |
| VLM page quality looks low | Dense layout or small fonts exceeding VlmImageDetail.Low |
Set VlmImageDetail = ImageDetail.High (default) and raise VlmMaximumCompletionTokens |
Next Steps
- Samples: Document to Markdown: the complete interactive demo.
- Samples: VLM OCR: drop down to
VlmOcrfor intent selection and custom instructions. - Samples: VLM OCR with Coordinates: extract bounding boxes alongside text.
- Analyze Images with Vision Language Models: image Q&A and analysis beyond OCR.
- Extract Structured Data from Unstructured Text: turn the Markdown output into typed fields.
- Preprocess Images for Vision Pipelines: deskew, crop, and resize scans before conversion.