Class DocumentToMarkdown
- Namespace
- LMKit.Document.Conversion
- Assembly
- LM-Kit.NET.dll
Converts a document into Markdown using one of several strategies: text-layer extraction (optionally backed by a traditional OCR engine for images), vision-language OCR, or a hybrid per-page selection between the two.
public sealed class DocumentToMarkdown
- Inheritance
-
DocumentToMarkdown
- Inherited Members
Examples
Example 1: Convert a PDF using only the text layer (no model required).
using LMKit.Document.Conversion;
var converter = new DocumentToMarkdown();
var result = converter.Convert("report.pdf");
File.WriteAllText("report.md", result.Markdown);
Example 2: Convert a scanned PDF using a vision-language model.
using LMKit.Document.Conversion;
using LMKit.Model;
var model = LM.LoadFromModelID("lightonocr-2:1b");
var converter = new DocumentToMarkdown(model);
var result = await converter.ConvertAsync("scan.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.VlmOcr
});
Example 3: Hybrid conversion of a mixed PDF (born-digital + scanned pages).
using LMKit.Document.Conversion;
using LMKit.Model;
var model = LM.LoadFromModelID("lightonocr-2:1b");
var converter = new DocumentToMarkdown(model);
converter.PageStarting += (s, e) =>
Console.WriteLine($"Page {e.PageNumber}/{e.PageCount} ({e.PlannedStrategy})");
var result = await converter.ConvertAsync("mixed.pdf");
Example 4: Omit the model — the default lightonocr-2:1b is loaded on demand.
using LMKit.Document.Conversion;
var converter = new DocumentToMarkdown();
// No model supplied: when VLM is required (e.g. for a scanned page),
// the default "lightonocr-2:1b" is loaded automatically on first use.
var result = converter.Convert("scan.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.VlmOcr
});
Example 5: Convert a scanned image using a traditional OCR engine (no VLM required).
using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;
var converter = new DocumentToMarkdown();
var result = converter.Convert("invoice.png", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.TextExtraction,
OcrEngine = new LMKitOcr()
});
Remarks
Format dispatch. When the input is a format that has a dedicated converter, DocumentToMarkdown delegates to that converter to produce structurally rich Markdown, bypassing the page-by-page text/VLM pipeline:
- EML email (
message/rfc822) → EmlToMarkdown - MBOX mailbox (
application/mbox) → MboxToMarkdown - HTML (
text/html) → HtmlToMarkdown - DOCX (
application/vnd.openxmlformats-officedocument.wordprocessingml.document) → DocxToMarkdown
Specialized conversion is single-pass and format-aware (preserves email headers, HTML structure, DOCX tables, etc.), so the Strategy setting is ignored for these formats. All other inputs (PDF, images, plain text, XLSX, PPTX, ...) flow through the strategy-driven pipeline described below.
Strategies. See DocumentToMarkdownStrategy. TextExtraction reads the embedded text layer (fast, no model required). Supplying OcrEngine extends it into a full traditional-OCR pipeline: image attachments are transcribed, embedded raster images on PDF pages are OCRed and their text projected back into the page layout, and PDF pages whose native text layer is empty fall back to a full-page OCR render. VlmOcr rasterizes each page and asks a vision-language model to transcribe it, recovering content from scanned and image-heavy documents. Hybrid inspects each page individually: pages with a clean text layer and no embedded images stay on the fast text path, while pages that have no extractable text or contain embedded images are routed to VLM OCR. Image attachments always resolve to VLM OCR under Hybrid.
Input modes. The converter mirrors the Attachment class in
accepting file paths, raw bytes, streams, ImageBuffer instances,
Uri downloads, and pre-built Attachment objects.
Every input mode ships with both a synchronous Convert and an asynchronous
ConvertAsync overload, plus a ConvertToFile / ConvertToFileAsync
variant that writes the Markdown directly to disk.
Observability. Subscribe to PageStarting and PageCompleted to report progress, cancel mid-flight, or log per-page diagnostics (strategy used, elapsed time, token count, quality score).
Reuse and threading. The converter itself is stateless across calls, so one instance can drive an entire corpus sequentially. Page-level work runs one page at a time; to parallelise across documents, spin up one DocumentToMarkdown per worker (each carrying its own vision model reference if the VLM path is used).
Constructors
- DocumentToMarkdown()
Initializes a new instance of the DocumentToMarkdown class without a user-supplied vision model. All strategies remain available: if a vision-language strategy is selected, the default model
lightonocr-2:1b(see DefaultVisionModelId) is loaded lazily on first use.
- DocumentToMarkdown(LM)
Initializes a new instance of the DocumentToMarkdown class with the specified vision-language model. The model is used for all vision-based strategies and must support text generation and vision.
Fields
- DefaultVisionModelId
Model ID loaded on demand when a vision-language strategy is requested but the converter was built without a user-supplied model.
Properties
- HasVisionModel
Gets a value indicating whether the caller supplied a vision-language model at construction time. Even when this returns
false, vision-based strategies remain available because the converter can lazily load DefaultVisionModelId on first use.
- ResolvedVisionModel
Gets the vision-language model currently in use by this converter, or
nullif no vision-based strategy has been run yet. After the first call that requires a vision model, this returns either the user-supplied model or the lazily loaded default (DefaultVisionModelId).
- VisionModel
Gets the vision-language model supplied by the caller, or
nullwhen the converter was built without one. The default model (see DefaultVisionModelId) is loaded lazily in that case and is exposed through ResolvedVisionModel after the first vision call.
Methods
- Convert(Attachment, DocumentToMarkdownOptions, CancellationToken)
Converts the supplied Attachment to Markdown. All other overloads delegate to this method after wrapping their input in an attachment. Use this overload directly when you already hold a pre-built attachment (e.g. carried through a pipeline) and want to avoid re-wrapping it.
- Convert(ImageBuffer, DocumentToMarkdownOptions, CancellationToken)
Converts a single in-memory image to Markdown. The vision model is used by default; pair TextExtraction with OcrEngine to run traditional OCR instead.
- Convert(byte[], string, DocumentToMarkdownOptions, CancellationToken)
Converts in-memory document bytes to Markdown. The
fileNameis used only for MIME detection and logical naming; no file is written.
- Convert(Stream, string, DocumentToMarkdownOptions, CancellationToken)
Converts a document read from the specified
streamto Markdown. The stream is consumed in full; caller retains ownership and is responsible for disposal.
- Convert(string, DocumentToMarkdownOptions, CancellationToken)
Converts the document at
inputPathto Markdown.
- Convert(Uri, DocumentToMarkdownOptions, CancellationToken)
Downloads the document at the specified
uriand converts it to Markdown.
- ConvertAsync(Attachment, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts the supplied Attachment to Markdown.
- ConvertAsync(ImageBuffer, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts a single in-memory image to Markdown using the vision OCR strategy.
- ConvertAsync(byte[], string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts in-memory document bytes to Markdown.
- ConvertAsync(Stream, string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts a document read from the specified
streamto Markdown.
- ConvertAsync(string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts the document at
inputPathto Markdown.
- ConvertAsync(Uri, DocumentToMarkdownOptions, CancellationToken)
Asynchronously downloads the document at the specified
uriand converts it to Markdown.
- ConvertToFile(Attachment, string, DocumentToMarkdownOptions, CancellationToken)
Converts the supplied Attachment and writes the Markdown to
outputPath.
- ConvertToFile(byte[], string, string, DocumentToMarkdownOptions, CancellationToken)
Converts in-memory document bytes and writes the Markdown to
outputPath.
- ConvertToFile(Stream, string, string, DocumentToMarkdownOptions, CancellationToken)
Converts a document read from a stream and writes the Markdown to
outputPath.
- ConvertToFile(string, string, DocumentToMarkdownOptions, CancellationToken)
Converts the document at
inputPathand writes the Markdown tooutputPath.
- ConvertToFileAsync(Attachment, string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts the supplied Attachment and writes the Markdown to
outputPath.
- ConvertToFileAsync(byte[], string, string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts in-memory document bytes and writes the Markdown to
outputPath.
- ConvertToFileAsync(Stream, string, string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts a document read from a stream and writes the Markdown to
outputPath.
- ConvertToFileAsync(string, string, DocumentToMarkdownOptions, CancellationToken)
Asynchronously converts the document at
inputPathand writes the Markdown tooutputPath.
Events
- PageCompleted
Raised after a page has been processed, whether it succeeded or failed.
- PageStarting
Raised just before a page starts being processed. Set Cancel to
trueto abort the conversion.