Method DetectParagraphs

Namespace: LMKit.Document.Layout

Assembly: LM-Kit.NET.dll

DetectParagraphs()

Detects and groups lines into paragraphs via a joint layout-and-NLP analysis. Combines geometric cues (inter-line spacing ratios, indentation deltas, baseline alignment, left/right rag, column/region membership, style changes) with linguistic signals (sentence boundary confidence, discourse/continuation markers, list/quote/heading patterns, cross-line semantic cohesion). Designed to be robust to OCR noise, rotation/skew, and multilingual scripts. Returns paragraphs ordered in reading order.

public List<ParagraphElement> DetectParagraphs()

Returns

List<ParagraphElement>: A list of ParagraphElement objects in reading order, positioned in the page's original coordinate space.

Remarks

Paragraph detection operates in normalized "view space" (deskewed and de-rotated) to improve stability, then remaps the result back to the page's original coordinate system before returning.

Reading order. The returned list of paragraphs is ordered in the page's reading order: typically top-to-bottom within each column/region, then column-by-column as inferred from the layout (for right-to-left scripts, the detected directionality is respected). When two items are nearly aligned, ordering is deterministically broken by Y-then-X position in view space. Within each paragraph, the contained LineElement instances are also ordered in reading order.

If no layout information is available, the method returns a single ParagraphElement that contains a single LineElement wrapping all text elements.

Results are cached internally and invalidated whenever page content or geometry changes. Each call returns a new list composed of new ParagraphElement and LineElement instances that reference the original TextElements.

Table of Contents

Method DetectParagraphs

DetectParagraphs()

Returns

Remarks