Enum TextOutputMode
Controls how extracted text is aggregated and formatted when exported as plain text.
public enum TextOutputMode
Fields
RawLines = 0Output one line per detected text line with no layout analysis.
GridAligned = 1Preserve approximate column alignment and indentation (grid-style spacing).
ParagraphFlow = 2Group lines into paragraphs ordered for reading; insert blank lines between paragraphs.
Structured = 3Preserve both paragraph flow and tabular structure; optimized for semantic extraction.
Markdown = 4Render the page as Markdown, converting detected layout primitives into standard Markdown syntax: headings become
#/##sections (driven by theIsHeaderflag and font-size ratios), bullet and numbered items become lists, and tabular blocks are reconstructed as pipe tables. Smart quotes, typographic dashes, ellipses, and invisible control characters are normalized to ASCII equivalents, and consecutive blank lines are collapsed. Recommended for LLM prompting, Markdown-based chunking pipelines, and documentation export.Auto = 5Automatically evaluate page structure to choose the optimal formatting strategy.
Examples
Example: Extract text using different output modes.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
PdfInfo info = PdfInfo.Load("document.pdf");
PageElement page = info.Pages[0].GetLayout();
// Raw lines: one line per detected line, no formatting.
string raw = page.GetText(TextOutputMode.RawLines);
// Grid-aligned: preserves column indentation.
string grid = page.GetText(TextOutputMode.GridAligned);
// Paragraph flow: groups lines into readable paragraphs.
string flow = page.GetText(TextOutputMode.ParagraphFlow);
// Structured: best for RAG pipelines and semantic chunking.
string structured = page.GetText(TextOutputMode.Structured);
// Markdown: headings, lists, and tables rendered as Markdown.
string markdown = page.GetText(TextOutputMode.Markdown);
// Auto: let the engine pick the best mode for the page.
string auto = page.GetText(TextOutputMode.Auto);
Remarks
- RawLines – one logical line per detected line; no grid/column analysis. Words are joined with single spaces; indentation and column alignment are not preserved.
- GridAligned – preserves approximate columns/indentation by inserting spaces based on word positions within the page bounds; adds 0–5 blank lines based on measured inter-line spacing.
- ParagraphFlow – groups lines into paragraphs in reading order and separates paragraphs with a blank line; best for natural reading.
- Structured – maintains paragraph boundaries and tabular layouts as logical blocks; ideal for RAG pipelines where semantic chunking and context preservation are critical.
-
Markdown – renders the page as Markdown: detected headings become
#/##sections, bullet and numbered items become lists, and tables are emitted as Markdown pipe tables. Smart quotes, typographic dashes, and invisible control characters are normalized to ASCII equivalents. Best suited for downstream Markdown-aware pipelines such as LLM prompting, documentation generation, and Markdown-based chunking. - Auto – inspects document structure and selects the most suitable mode depending on layout characteristics and intended use case.
All modes operate in a normalized "view space" (deskewed, de-rotated) for analysis, then return text in plain UTF-8 with Unix line endings; trailing whitespace is trimmed.