Class DocumentToMarkdownOptions

Namespace: LMKit.Document.Conversion

Assembly: LM-Kit.NET.dll

Options controlling the behavior of DocumentToMarkdown.

public sealed class DocumentToMarkdownOptions

Inheritance: object

DocumentToMarkdownOptions

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Examples

Minimal: tune just the strategy and a page range.

using LMKit.Document.Conversion;
var options = new DocumentToMarkdownOptions
{
Strategy  = DocumentToMarkdownStrategy.Hybrid,
PageRange = "1-10"
};

Traditional OCR enrichment: charts, scans, and full-page fallback. Supplying an OcrEngine extends the text-extraction strategy so it (a) recognises embedded raster images on PDF pages, (b) falls back to full-page OCR when a PDF page has no text layer, and (c) turns standalone image files into Markdown without a vision model.

using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;
var options = new DocumentToMarkdownOptions
{
Strategy             = DocumentToMarkdownStrategy.TextExtraction,
OcrEngine            = new LMKitOcr(),
OcrImageParallelism  = 4
};

Output shaping for LLM ingestion.

using LMKit.Document.Conversion;
var options = new DocumentToMarkdownOptions
{
EmitFrontMatter                  = true,
IncludePageSeparators            = true,
PageSeparatorFormat              = "\n\n---\n\n<!-- Page {pageNumber} -->\n\n",
PreferMarkdownTablesForNonNested = true,
NormalizeWhitespace              = true
};

Remarks

Instances of this class are plain data holders with sensible defaults. You typically construct one, tweak the values you care about, and pass it to one of the Convert / ConvertAsync overloads. Every property is independently optional: the defaults alone already produce LLM-ready Markdown for most inputs.

The properties fall into four groups:

Strategy & pagination: Strategy, PageRange.
Traditional OCR (image inputs and raster enrichment of PDF pages): OcrEngine, OcrImageParallelism.
Vision-language OCR: VlmImageDetail, VlmMaximumCompletionTokens, VlmStripImageMarkup, VlmStripStyleAttributes.
Output shaping & per-format knobs: IncludePageSeparators, PageSeparatorFormat, EmitFrontMatter, NormalizeWhitespace, PreferMarkdownTablesForNonNested, plus DOCX (IncludeTables, IncludeImages, IncludeHyperlinks, IncludeEmptyParagraphs, PreserveLineBreaks) and email (EmlStripQuotes) forwards.

Properties

EmitFrontMatter: Gets or sets a value indicating whether a YAML front-matter block is emitted at the top of the output containing conversion metadata (source name, page count, strategy, timestamp). Defaults to false. Enable when the Markdown feeds a static-site generator or a knowledge base that indexes front-matter fields.

EmlStripQuotes: Gets or sets a value indicating whether quoted and reply content is stripped from the email body when the input is an EML or MBOX message. Defaults to false: the full quote trail is retained. Enable when the downstream LLM only needs the newest reply.

IncludeEmptyParagraphs: Gets or sets a value indicating whether empty paragraphs are preserved when the input is a DOCX document. Defaults to false: blank paragraphs are collapsed so the Markdown is not padded with spurious whitespace.

IncludeHyperlinks: Gets or sets a value indicating whether hyperlinks are preserved when the input is a DOCX document. Defaults to true. Disable to keep only the visible anchor text.

IncludeImages: Gets or sets a value indicating whether image references are preserved when the input is a DOCX document. Defaults to true. Disable to drop the placeholder Markdown image syntax for a text-only output.

IncludePageSeparators: Gets or sets a value indicating whether a separator is inserted between pages in the aggregated Markdown output. Defaults to true. Set to false to produce a continuous stream of Markdown with no page boundaries.

IncludeTables: Gets or sets a value indicating whether tables are preserved when the input is a DOCX document. Defaults to true. When disabled, table cells are flattened into inline paragraphs, producing leaner Markdown for LLM ingestion at the cost of losing the row/column structure.

NormalizeWhitespace: Gets or sets a value indicating whether consecutive blank lines in the final output are collapsed into a single blank line. Defaults to true.

OcrEngine: Gets or sets the optional OCR engine used to recover text from raster content when Strategy is TextExtraction. Defaults to null, in which case the text-extraction strategy only reads the embedded text layer.

OcrImageParallelism: Gets or sets the maximum number of concurrent OCR calls used when enriching text-extraction pages with their embedded raster images. Input is clamped to the [1, 12] range: values <= 1 run each image sequentially, values

= 12 cap to 12 to avoid over-subscribing the OCR engine's internal worker pool. Defaults to 4.

OcrLanguages: Gets or sets the languages the OCR engine should recognize when OcrEngine is supplied. Defaults to null, in which case the OCR engine uses its own configured default language (or its language detection, when enabled). Supply one or more Language values to bias OCR toward the document's language, which improves recognition accuracy. Ignored when OcrEngine is null or when a vision-language strategy is used.

PageRange: Gets or sets the 1-based page range to convert (e.g. "1-5, 7, 9-12"). Use null, an empty string, or "*" to convert every page. Invalid page numbers are silently ignored.

PageSeparatorFormat: Gets or sets the separator format used between pages. The placeholder {pageNumber} is replaced with the 1-based page number. Defaults to a horizontal rule followed by an HTML comment carrying the page number. Ignored when IncludePageSeparators is false.

PreferMarkdownTablesForNonNested: Gets or sets a value indicating whether non-nested HTML <table> fragments present in the converted Markdown are rewritten into GitHub-flavored Markdown table syntax. Tables that contain a nested <table>, or any cell using rowspan/colspan, are left as HTML because Markdown cannot express those layouts. Fenced code blocks are preserved verbatim so example HTML inside ```/~~~ fences is never altered. Defaults to false.

PreserveLineBreaks: Gets or sets a value indicating whether source line breaks are preserved when the input is a DOCX document. Defaults to true. When disabled, intra-paragraph line breaks are merged into flowing prose.

Strategy: Gets or sets the conversion strategy. Defaults to Hybrid, which adaptively picks the fastest strategy that can recover each page's content.

VlmImageDetail: Gets or sets the image-detail level forwarded to the vision model. Higher detail improves fidelity on dense pages (small fonts, tight tables, footnotes) but raises the token budget the VLM consumes per page. Defaults to High.

VlmMaximumCompletionTokens: Gets or sets the maximum number of completion tokens emitted by the vision model per page. Use -1 to disable the limit. Defaults to 3072.

VlmStripImageMarkup: Gets or sets a value indicating whether Markdown image references (the !-bracket-paren image syntax) should be stripped from the VLM output. Defaults to true, which keeps the Markdown free of the placeholder image references VLMs sometimes emit for figures they cannot transcribe.

VlmStripStyleAttributes: Gets or sets a value indicating whether inline HTML style="..." attributes should be stripped from the VLM output. Defaults to true, which produces cleaner Markdown suited to LLM ingestion.

Table of Contents

Class DocumentToMarkdownOptions

Examples

Remarks

Properties