Class PageElement
Represents the content of a single page, including its textual elements and a plain text aggregation. Typically used for layout-aware extraction results from documents such as PDFs or OCR-processed images.
public sealed class PageElement : ILayoutElement
- Inheritance
-
PageElement
- Implements
- Inherited Members
Constructors
- PageElement(IEnumerable<TextElement>, double, double, int, double)
Initializes a new instance of the PageElement class with structured text elements, page dimensions, rotation, and skew information. Each TextElement may represent a single character, a word, a line, or an entire paragraph; for best layout results you should supply word-level elements.
- PageElement(string)
Initializes a new instance of the PageElement class with plain unstructured text only. This constructor should be used when no layout or bounding box information is available.
Properties
- Bounds
Gets the page’s bounding quadrilateral in page coordinates, enclosing all TextElement bounds.
- Height
Gets the height of the page in the original document, in points.
- Rotation
Gets the detected rotation of the page in degrees clockwise (e.g., 0, 90, 180, or 270).
- Skew
Gets the detected skew angle of the page in degrees clockwise.
- Text
Gets the full textual content of the page as a single aggregated string.
- TextElements
Gets the collection of TextElement instances found on the page. Each text element may optionally include bounding box coordinates describing its layout.
- Width
Gets the width of the page in the original document, in points.
Methods
- Clone()
Creates a deep copy of this PageElement, including cloned text elements and preserving current layout settings (size, rotation, skew, and formatting flag).
- DetectLines()
Detects and groups TextElement instances into reading-order text lines.
- DetectParagraphs()
Detects and groups lines into paragraphs via a joint layout-and-NLP analysis. Combines geometric cues (inter-line spacing ratios, indentation deltas, baseline alignment, left/right rag, column/region membership, style changes) with linguistic signals (sentence boundary confidence, discourse/continuation markers, list/quote/heading patterns, cross-line semantic cohesion). Designed to be robust to OCR noise, rotation/skew, and multilingual scripts.
- FromJson(string)
Deserializes the specified JSON string into a PageElement instance.
- ToJson()
Serializes this PageElement into a JSON-formatted string.