Class DocumentToMarkdown

Namespace: LMKit.Document.Conversion

Assembly: LM-Kit.NET.dll

Converts a document into Markdown using one of several strategies: text-layer extraction (optionally backed by a traditional OCR engine for images), vision-language OCR, or a hybrid per-page selection between the two.

public sealed class DocumentToMarkdown

Inheritance: object

DocumentToMarkdown

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Examples

Example 1: Convert a PDF using only the text layer (no model required).

using LMKit.Document.Conversion;
var converter = new DocumentToMarkdown();
var result = converter.Convert("report.pdf");
File.WriteAllText("report.md", result.Markdown);

Example 2: Convert a scanned PDF using a vision-language model.

using LMKit.Document.Conversion;
using LMKit.Model;
var model = LM.LoadFromModelID("lightonocr-2:1b");
var converter = new DocumentToMarkdown(model);
var result = await converter.ConvertAsync("scan.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.VlmOcr
});

Example 3: Hybrid conversion of a mixed PDF (born-digital + scanned pages).

using LMKit.Document.Conversion;
using LMKit.Model;
var model = LM.LoadFromModelID("lightonocr-2:1b");
var converter = new DocumentToMarkdown(model);
converter.PageStarting += (s, e) =>
Console.WriteLine($"Page {e.PageNumber}/{e.PageCount} ({e.PlannedStrategy})");
var result = await converter.ConvertAsync("mixed.pdf");

Example 4: Omit the model — the default lightonocr-2:1b is loaded on demand.

using LMKit.Document.Conversion;
var converter = new DocumentToMarkdown();
// No model supplied: when VLM is required (e.g. for a scanned page),
// the default "lightonocr-2:1b" is loaded automatically on first use.
var result = converter.Convert("scan.pdf", new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.VlmOcr
});

Example 5: Convert a scanned image using a traditional OCR engine (no VLM required).

using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;
var converter = new DocumentToMarkdown();
var result = converter.Convert("invoice.png", new DocumentToMarkdownOptions
{
Strategy  = DocumentToMarkdownStrategy.TextExtraction,
OcrEngine = new LMKitOcr()
});

Remarks

Format dispatch. When the input is a format that has a dedicated converter, DocumentToMarkdown delegates to that converter to produce structurally rich Markdown, bypassing the page-by-page text/VLM pipeline:

EML email (message/rfc822) → EmlToMarkdown
MBOX mailbox (application/mbox) → MboxToMarkdown
HTML (text/html) → HtmlToMarkdown
DOCX (application/vnd.openxmlformats-officedocument.wordprocessingml.document) → DocxToMarkdown

Specialized conversion is single-pass and format-aware (preserves email headers, HTML structure, DOCX tables, etc.), so the Strategy setting is ignored for these formats. All other inputs (PDF, images, plain text, XLSX, PPTX, ...) flow through the strategy-driven pipeline described below.

Strategies. See DocumentToMarkdownStrategy. TextExtraction reads the embedded text layer (fast, no model required). Supplying OcrEngine extends it into a full traditional-OCR pipeline: image attachments are transcribed, embedded raster images on PDF pages are OCRed and their text projected back into the page layout, and PDF pages whose native text layer is empty fall back to a full-page OCR render. VlmOcr rasterizes each page and asks a vision-language model to transcribe it, recovering content from scanned and image-heavy documents. Hybrid inspects each page individually: pages with a clean text layer and no embedded images stay on the fast text path, while pages that have no extractable text or contain embedded images are routed to VLM OCR. Image attachments always resolve to VLM OCR under Hybrid.

Input modes. The converter mirrors the Attachment class in accepting file paths, raw bytes, streams, ImageBuffer instances, Uri downloads, and pre-built Attachment objects. Every input mode ships with both a synchronous Convert and an asynchronous ConvertAsync overload, plus a ConvertToFile / ConvertToFileAsync variant that writes the Markdown directly to disk.

Observability. Subscribe to PageStarting and PageCompleted to report progress, cancel mid-flight, or log per-page diagnostics (strategy used, elapsed time, token count, quality score).

Reuse and threading. The converter itself is stateless across calls, so one instance can drive an entire corpus sequentially. Page-level work runs one page at a time; to parallelise across documents, spin up one DocumentToMarkdown per worker (each carrying its own vision model reference if the VLM path is used).

Constructors

DocumentToMarkdown(): Initializes a new instance of the DocumentToMarkdown class without a user-supplied vision model. All strategies remain available: if a vision-language strategy is selected, the default model lightonocr-2:1b (see DefaultVisionModelId) is loaded lazily on first use.

DocumentToMarkdown(LM): Initializes a new instance of the DocumentToMarkdown class with the specified vision-language model. The model is used for all vision-based strategies and must support text generation and vision.

Fields

DefaultVisionModelId: Model ID loaded on demand when a vision-language strategy is requested but the converter was built without a user-supplied model.

Properties

HasVisionModel: Gets a value indicating whether the caller supplied a vision-language model at construction time. Even when this returns false, vision-based strategies remain available because the converter can lazily load DefaultVisionModelId on first use.

ResolvedVisionModel: Gets the vision-language model currently in use by this converter, or null if no vision-based strategy has been run yet. After the first call that requires a vision model, this returns either the user-supplied model or the lazily loaded default (DefaultVisionModelId).

VisionModel: Gets the vision-language model supplied by the caller, or null when the converter was built without one. The default model (see DefaultVisionModelId) is loaded lazily in that case and is exposed through ResolvedVisionModel after the first vision call.

Methods

Convert(Attachment, DocumentToMarkdownOptions, CancellationToken): Converts the supplied Attachment to Markdown. All other overloads delegate to this method after wrapping their input in an attachment. Use this overload directly when you already hold a pre-built attachment (e.g. carried through a pipeline) and want to avoid re-wrapping it.

Convert(ImageBuffer, DocumentToMarkdownOptions, CancellationToken): Converts a single in-memory image to Markdown. The vision model is used by default; pair TextExtraction with OcrEngine to run traditional OCR instead.

Convert(byte[], string, DocumentToMarkdownOptions, CancellationToken): Converts in-memory document bytes to Markdown. The fileName is used only for MIME detection and logical naming; no file is written.

Convert(Stream, string, DocumentToMarkdownOptions, CancellationToken): Converts a document read from the specified stream to Markdown. The stream is consumed in full; caller retains ownership and is responsible for disposal.

Convert(string, DocumentToMarkdownOptions, CancellationToken): Converts the document at inputPath to Markdown.

Convert(Uri, DocumentToMarkdownOptions, CancellationToken): Downloads the document at the specified uri and converts it to Markdown.

ConvertAsync(Attachment, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts the supplied Attachment to Markdown.

ConvertAsync(ImageBuffer, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts a single in-memory image to Markdown using the vision OCR strategy.

ConvertAsync(byte[], string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts in-memory document bytes to Markdown.

ConvertAsync(Stream, string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts a document read from the specified stream to Markdown.

ConvertAsync(string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts the document at inputPath to Markdown.

ConvertAsync(Uri, DocumentToMarkdownOptions, CancellationToken): Asynchronously downloads the document at the specified uri and converts it to Markdown.

ConvertToFile(Attachment, string, DocumentToMarkdownOptions, CancellationToken): Converts the supplied Attachment and writes the Markdown to outputPath.

ConvertToFile(byte[], string, string, DocumentToMarkdownOptions, CancellationToken): Converts in-memory document bytes and writes the Markdown to outputPath.

ConvertToFile(Stream, string, string, DocumentToMarkdownOptions, CancellationToken): Converts a document read from a stream and writes the Markdown to outputPath.

ConvertToFile(string, string, DocumentToMarkdownOptions, CancellationToken): Converts the document at inputPath and writes the Markdown to outputPath.

ConvertToFileAsync(Attachment, string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts the supplied Attachment and writes the Markdown to outputPath.

ConvertToFileAsync(byte[], string, string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts in-memory document bytes and writes the Markdown to outputPath.

ConvertToFileAsync(Stream, string, string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts a document read from a stream and writes the Markdown to outputPath.

ConvertToFileAsync(string, string, DocumentToMarkdownOptions, CancellationToken): Asynchronously converts the document at inputPath and writes the Markdown to outputPath.

Events

PageCompleted: Raised after a page has been processed, whether it succeeded or failed.

PageStarting: Raised just before a page starts being processed. Set Cancel to true to abort the conversion.

Table of Contents