Table of Contents

Class DocumentToMarkdownOptions

Namespace
LMKit.Document.Conversion
Assembly
LM-Kit.NET.dll

Options controlling the behavior of DocumentToMarkdown.

public sealed class DocumentToMarkdownOptions
Inheritance
DocumentToMarkdownOptions
Inherited Members

Examples

Minimal: tune just the strategy and a page range.

using LMKit.Document.Conversion;

var options = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.Hybrid, PageRange = "1-10" };

Traditional OCR enrichment: charts, scans, and full-page fallback. Supplying an OcrEngine extends the text-extraction strategy so it (a) recognises embedded raster images on PDF pages, (b) falls back to full-page OCR when a PDF page has no text layer, and (c) turns standalone image files into Markdown without a vision model.

using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;

var options = new DocumentToMarkdownOptions { Strategy = DocumentToMarkdownStrategy.TextExtraction, OcrEngine = new LMKitOcr(), OcrImageParallelism = 4 };

Output shaping for LLM ingestion.

using LMKit.Document.Conversion;

var options = new DocumentToMarkdownOptions { EmitFrontMatter = true, IncludePageSeparators = true, PageSeparatorFormat = "\n\n---\n\n<!-- Page {pageNumber} -->\n\n", PreferMarkdownTablesForNonNested = true, NormalizeWhitespace = true };

Remarks

Instances of this class are plain data holders with sensible defaults. You typically construct one, tweak the values you care about, and pass it to one of the Convert / ConvertAsync overloads. Every property is independently optional: the defaults alone already produce LLM-ready Markdown for most inputs.

The properties fall into four groups:

Properties

EmitFrontMatter

Gets or sets a value indicating whether a YAML front-matter block is emitted at the top of the output containing conversion metadata (source name, page count, strategy, timestamp). Defaults to false. Enable when the Markdown feeds a static-site generator or a knowledge base that indexes front-matter fields.

EmlStripQuotes

Gets or sets a value indicating whether quoted and reply content is stripped from the email body when the input is an EML or MBOX message. Defaults to false: the full quote trail is retained. Enable when the downstream LLM only needs the newest reply.

IncludeEmptyParagraphs

Gets or sets a value indicating whether empty paragraphs are preserved when the input is a DOCX document. Defaults to false: blank paragraphs are collapsed so the Markdown is not padded with spurious whitespace.

IncludeHyperlinks

Gets or sets a value indicating whether hyperlinks are preserved when the input is a DOCX document. Defaults to true. Disable to keep only the visible anchor text.

IncludeImages

Gets or sets a value indicating whether image references are preserved when the input is a DOCX document. Defaults to true. Disable to drop the placeholder Markdown image syntax for a text-only output.

IncludePageSeparators

Gets or sets a value indicating whether a separator is inserted between pages in the aggregated Markdown output. Defaults to true. Set to false to produce a continuous stream of Markdown with no page boundaries.

IncludeTables

Gets or sets a value indicating whether tables are preserved when the input is a DOCX document. Defaults to true. When disabled, table cells are flattened into inline paragraphs, producing leaner Markdown for LLM ingestion at the cost of losing the row/column structure.

NormalizeWhitespace

Gets or sets a value indicating whether consecutive blank lines in the final output are collapsed into a single blank line. Defaults to true.

OcrEngine

Gets or sets the optional OCR engine used to recover text from raster content when Strategy is TextExtraction. Defaults to null, in which case the text-extraction strategy only reads the embedded text layer.

OcrImageParallelism

Gets or sets the maximum number of concurrent OCR calls used when enriching text-extraction pages with their embedded raster images. Input is clamped to the [1, 12] range: values <= 1 run each image sequentially, values

= 12 cap to 12 to avoid over-subscribing the OCR engine's internal worker pool. Defaults to 4.

PageRange

Gets or sets the 1-based page range to convert (e.g. "1-5, 7, 9-12"). Use null, an empty string, or "*" to convert every page. Invalid page numbers are silently ignored.

PageSeparatorFormat

Gets or sets the separator format used between pages. The placeholder {pageNumber} is replaced with the 1-based page number. Defaults to a horizontal rule followed by an HTML comment carrying the page number. Ignored when IncludePageSeparators is false.

PreferMarkdownTablesForNonNested

Gets or sets a value indicating whether non-nested HTML <table> fragments present in the converted Markdown are rewritten into GitHub-flavored Markdown table syntax. Tables that contain a nested <table>, or any cell using rowspan/colspan, are left as HTML because Markdown cannot express those layouts. Fenced code blocks are preserved verbatim so example HTML inside ```/~~~ fences is never altered. Defaults to false.

PreserveLineBreaks

Gets or sets a value indicating whether source line breaks are preserved when the input is a DOCX document. Defaults to true. When disabled, intra-paragraph line breaks are merged into flowing prose.

Strategy

Gets or sets the conversion strategy. Defaults to Hybrid, which adaptively picks the fastest strategy that can recover each page's content.

VlmImageDetail

Gets or sets the image-detail level forwarded to the vision model. Higher detail improves fidelity on dense pages (small fonts, tight tables, footnotes) but raises the token budget the VLM consumes per page. Defaults to High.

VlmMaximumCompletionTokens

Gets or sets the maximum number of completion tokens emitted by the vision model per page. Use -1 to disable the limit. Defaults to 3072.

VlmStripImageMarkup

Gets or sets a value indicating whether Markdown image references (the !-bracket-paren image syntax) should be stripped from the VLM output. Defaults to true, which keeps the Markdown free of the placeholder image references VLMs sometimes emit for figures they cannot transcribe.

VlmStripStyleAttributes

Gets or sets a value indicating whether inline HTML style="..." attributes should be stripped from the VLM output. Defaults to true, which produces cleaner Markdown suited to LLM ingestion.

Share