Enum TextOutputMode
Controls how extracted text is aggregated and formatted when exported as plain text.
public enum TextOutputMode
Fields
RawLines = 0Output one line per detected text line with no layout analysis.
GridAligned = 1Preserve approximate column alignment and indentation (grid-style spacing).
ParagraphFlow = 2Group lines into paragraphs ordered for reading; insert blank lines between paragraphs.
Structured = 3Preserve both paragraph flow and tabular structure; optimized for semantic extraction.
Auto = 4Automatically evaluate page structure to choose the optimal formatting strategy.
Examples
Example: Extract text using different output modes.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
PdfInfo info = PdfInfo.Load("document.pdf");
PageElement page = info.Pages[0].GetLayout();
// Raw lines: one line per detected line, no formatting.
string raw = page.GetText(TextOutputMode.RawLines);
// Grid-aligned: preserves column indentation.
string grid = page.GetText(TextOutputMode.GridAligned);
// Paragraph flow: groups lines into readable paragraphs.
string flow = page.GetText(TextOutputMode.ParagraphFlow);
// Structured: best for RAG pipelines and semantic chunking.
string structured = page.GetText(TextOutputMode.Structured);
// Auto: let the engine pick the best mode for the page.
string auto = page.GetText(TextOutputMode.Auto);
Remarks
- RawLines – one logical line per detected line; no grid/column analysis. Words are joined with single spaces; indentation and column alignment are not preserved.
- GridAligned – preserves approximate columns/indentation by inserting spaces based on word positions within the page bounds; adds 0–5 blank lines based on measured inter-line spacing.
- ParagraphFlow – groups lines into paragraphs in reading order and separates paragraphs with a blank line; best for natural reading.
- Structured – maintains paragraph boundaries and tabular layouts as logical blocks; ideal for RAG pipelines where semantic chunking and context preservation are critical.
- Auto – inspects document structure and selects the most suitable mode depending on layout characteristics and intended use case.
All modes operate in a normalized "view space" (deskewed, de-rotated) for analysis, then return text in plain UTF-8 with Unix line endings; trailing whitespace is trimmed.