Class DocumentToMarkdownOptions
- Namespace
- LMKit.Document.Conversion
- Assembly
- LM-Kit.NET.dll
Options controlling the behavior of DocumentToMarkdown.
public sealed class DocumentToMarkdownOptions
- Inheritance
-
DocumentToMarkdownOptions
- Inherited Members
Examples
Minimal: tune just the strategy and a page range.
using LMKit.Document.Conversion;
var options = new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.Hybrid,
PageRange = "1-10"
};
Traditional OCR enrichment: charts, scans, and full-page fallback. Supplying an OcrEngine extends the text-extraction strategy so it (a) recognises embedded raster images on PDF pages, (b) falls back to full-page OCR when a PDF page has no text layer, and (c) turns standalone image files into Markdown without a vision model.
using LMKit.Document.Conversion;
using LMKit.Extraction.Ocr;
var options = new DocumentToMarkdownOptions
{
Strategy = DocumentToMarkdownStrategy.TextExtraction,
OcrEngine = new LMKitOcr(),
OcrImageParallelism = 4
};
Output shaping for LLM ingestion.
using LMKit.Document.Conversion;
var options = new DocumentToMarkdownOptions
{
EmitFrontMatter = true,
IncludePageSeparators = true,
PageSeparatorFormat = "\n\n---\n\n<!-- Page {pageNumber} -->\n\n",
PreferMarkdownTablesForNonNested = true,
NormalizeWhitespace = true
};
Remarks
Instances of this class are plain data holders with sensible defaults. You typically
construct one, tweak the values you care about, and pass it to one of the
Convert / ConvertAsync overloads. Every property is independently
optional: the defaults alone already produce LLM-ready Markdown for most inputs.
The properties fall into four groups:
- Strategy & pagination: Strategy, PageRange.
- Traditional OCR (image inputs and raster enrichment of PDF pages): OcrEngine, OcrImageParallelism.
- Vision-language OCR: VlmImageDetail, VlmMaximumCompletionTokens, VlmStripImageMarkup, VlmStripStyleAttributes.
- Output shaping & per-format knobs: IncludePageSeparators, PageSeparatorFormat, EmitFrontMatter, NormalizeWhitespace, PreferMarkdownTablesForNonNested, plus DOCX (IncludeTables, IncludeImages, IncludeHyperlinks, IncludeEmptyParagraphs, PreserveLineBreaks) and email (EmlStripQuotes) forwards.
Properties
- EmitFrontMatter
Gets or sets a value indicating whether a YAML front-matter block is emitted at the top of the output containing conversion metadata (source name, page count, strategy, timestamp). Defaults to
false. Enable when the Markdown feeds a static-site generator or a knowledge base that indexes front-matter fields.
- EmlStripQuotes
Gets or sets a value indicating whether quoted and reply content is stripped from the email body when the input is an EML or MBOX message. Defaults to
false: the full quote trail is retained. Enable when the downstream LLM only needs the newest reply.
- IncludeEmptyParagraphs
Gets or sets a value indicating whether empty paragraphs are preserved when the input is a DOCX document. Defaults to
false: blank paragraphs are collapsed so the Markdown is not padded with spurious whitespace.
- IncludeHyperlinks
Gets or sets a value indicating whether hyperlinks are preserved when the input is a DOCX document. Defaults to
true. Disable to keep only the visible anchor text.
- IncludeImages
Gets or sets a value indicating whether image references are preserved when the input is a DOCX document. Defaults to
true. Disable to drop the placeholder Markdown image syntax for a text-only output.
- IncludePageSeparators
Gets or sets a value indicating whether a separator is inserted between pages in the aggregated Markdown output. Defaults to
true. Set tofalseto produce a continuous stream of Markdown with no page boundaries.
- IncludeTables
Gets or sets a value indicating whether tables are preserved when the input is a DOCX document. Defaults to
true. When disabled, table cells are flattened into inline paragraphs, producing leaner Markdown for LLM ingestion at the cost of losing the row/column structure.
- NormalizeWhitespace
Gets or sets a value indicating whether consecutive blank lines in the final output are collapsed into a single blank line. Defaults to
true.
- OcrEngine
Gets or sets the optional OCR engine used to recover text from raster content when Strategy is TextExtraction. Defaults to
null, in which case the text-extraction strategy only reads the embedded text layer.
- OcrImageParallelism
Gets or sets the maximum number of concurrent OCR calls used when enriching text-extraction pages with their embedded raster images. Input is clamped to the
[1, 12]range: values <= 1 run each image sequentially, values= 12 cap to 12 to avoid over-subscribing the OCR engine's internal worker pool. Defaults to
4.
- PageRange
Gets or sets the 1-based page range to convert (e.g.
"1-5, 7, 9-12"). Usenull, an empty string, or"*"to convert every page. Invalid page numbers are silently ignored.
- PageSeparatorFormat
Gets or sets the separator format used between pages. The placeholder
{pageNumber}is replaced with the 1-based page number. Defaults to a horizontal rule followed by an HTML comment carrying the page number. Ignored when IncludePageSeparators isfalse.
- PreferMarkdownTablesForNonNested
Gets or sets a value indicating whether non-nested HTML
<table>fragments present in the converted Markdown are rewritten into GitHub-flavored Markdown table syntax. Tables that contain a nested<table>, or any cell usingrowspan/colspan, are left as HTML because Markdown cannot express those layouts. Fenced code blocks are preserved verbatim so example HTML inside```/~~~fences is never altered. Defaults tofalse.
- PreserveLineBreaks
Gets or sets a value indicating whether source line breaks are preserved when the input is a DOCX document. Defaults to
true. When disabled, intra-paragraph line breaks are merged into flowing prose.
- Strategy
Gets or sets the conversion strategy. Defaults to Hybrid, which adaptively picks the fastest strategy that can recover each page's content.
- VlmImageDetail
Gets or sets the image-detail level forwarded to the vision model. Higher detail improves fidelity on dense pages (small fonts, tight tables, footnotes) but raises the token budget the VLM consumes per page. Defaults to High.
- VlmMaximumCompletionTokens
Gets or sets the maximum number of completion tokens emitted by the vision model per page. Use
-1to disable the limit. Defaults to3072.
- VlmStripImageMarkup
Gets or sets a value indicating whether Markdown image references (the
!-bracket-paren image syntax) should be stripped from the VLM output. Defaults totrue, which keeps the Markdown free of the placeholder image references VLMs sometimes emit for figures they cannot transcribe.
- VlmStripStyleAttributes
Gets or sets a value indicating whether inline HTML
style="..."attributes should be stripped from the VLM output. Defaults totrue, which produces cleaner Markdown suited to LLM ingestion.