Table of Contents

Enum TextOutputMode

Namespace
LMKit.Document.Layout
Assembly
LM-Kit.NET.dll

Controls how extracted text is aggregated and formatted when exported as plain text.

public enum TextOutputMode

Fields

RawLines = 0

Output one line per detected text line with no layout analysis.

GridAligned = 1

Preserve approximate column alignment and indentation (grid-style spacing).

ParagraphFlow = 2

Group lines into paragraphs ordered for reading; insert blank lines between paragraphs.

Structured = 3

Preserve both paragraph flow and tabular structure; optimized for semantic extraction.

Markdown = 4

Render the page as Markdown, converting detected layout primitives into standard Markdown syntax: headings become #/## sections (driven by the IsHeader flag and font-size ratios), bullet and numbered items become lists, and tabular blocks are reconstructed as pipe tables. Smart quotes, typographic dashes, ellipses, and invisible control characters are normalized to ASCII equivalents, and consecutive blank lines are collapsed. Recommended for LLM prompting, Markdown-based chunking pipelines, and documentation export.

Auto = 5

Automatically evaluate page structure to choose the optimal formatting strategy.

Examples

Example: Extract text using different output modes.

using LMKit.Document.Layout;
using LMKit.Document.Pdf;

PdfInfo info = PdfInfo.Load("document.pdf"); PageElement page = info.Pages[0].GetLayout();

// Raw lines: one line per detected line, no formatting. string raw = page.GetText(TextOutputMode.RawLines);

// Grid-aligned: preserves column indentation. string grid = page.GetText(TextOutputMode.GridAligned);

// Paragraph flow: groups lines into readable paragraphs. string flow = page.GetText(TextOutputMode.ParagraphFlow);

// Structured: best for RAG pipelines and semantic chunking. string structured = page.GetText(TextOutputMode.Structured);

// Markdown: headings, lists, and tables rendered as Markdown. string markdown = page.GetText(TextOutputMode.Markdown);

// Auto: let the engine pick the best mode for the page. string auto = page.GetText(TextOutputMode.Auto);

Remarks

  • RawLines – one logical line per detected line; no grid/column analysis. Words are joined with single spaces; indentation and column alignment are not preserved.
  • GridAligned – preserves approximate columns/indentation by inserting spaces based on word positions within the page bounds; adds 0–5 blank lines based on measured inter-line spacing.
  • ParagraphFlow – groups lines into paragraphs in reading order and separates paragraphs with a blank line; best for natural reading.
  • Structured – maintains paragraph boundaries and tabular layouts as logical blocks; ideal for RAG pipelines where semantic chunking and context preservation are critical.
  • Markdown – renders the page as Markdown: detected headings become #/## sections, bullet and numbered items become lists, and tables are emitted as Markdown pipe tables. Smart quotes, typographic dashes, and invisible control characters are normalized to ASCII equivalents. Best suited for downstream Markdown-aware pipelines such as LLM prompting, documentation generation, and Markdown-based chunking.
  • Auto – inspects document structure and selects the most suitable mode depending on layout characteristics and intended use case.

All modes operate in a normalized "view space" (deskewed, de-rotated) for analysis, then return text in plain UTF-8 with Unix line endings; trailing whitespace is trimmed.

Share