Class PageElement
Represents the content of a single page, including its textual elements and a plain text aggregation. Typically used for layout-aware extraction results from documents such as PDFs or OCR-processed images.
[Obfuscation(Exclude = true)]
public sealed class PageElement : ILayoutElement
- Inheritance
-
PageElement
- Implements
- Inherited Members
Examples
Example 1: Load a PDF page layout and extract text.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
PdfInfo info = PdfInfo.Load("report.pdf");
PageElement page = info.Pages[0].GetLayout();
Console.WriteLine($"Page size: {page.Width} x {page.Height} pts");
Console.WriteLine($"Elements : {page.TextElements.Count}");
Console.WriteLine(page.GetText());
Example 2: Iterate text elements and inspect bounding boxes.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
PdfInfo info = PdfInfo.Load("invoice.pdf");
PageElement page = info.Pages[0].GetLayout();
foreach (TextElement element in page.TextElements)
{
Console.WriteLine($""{element.Text}" at ({element.Left:F1}, {element.Top:F1}) " +
$"size {element.Width:F1}x{element.Height:F1}");
}
Example 3: Use different text output modes.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
PdfInfo info = PdfInfo.Load("table.pdf");
PageElement page = info.Pages[0].GetLayout();
// Grid-aligned preserves column alignment.
string gridText = page.GetText(TextOutputMode.GridAligned);
// Paragraph flow groups lines into readable paragraphs.
string flowText = page.GetText(TextOutputMode.ParagraphFlow);
// Structured is optimized for RAG pipelines.
string structuredText = page.GetText(TextOutputMode.Structured);
Constructors
- PageElement(IEnumerable<TextElement>, double, double, int, double, UnitMode)
Initializes a new instance of the PageElement class with structured text elements, page dimensions, rotation, and skew information. Each TextElement may represent a single character, a word, a line, or an entire paragraph; for best layout results you should supply word-level elements.
- PageElement(string)
Initializes a new instance of the PageElement class with plain unstructured text only. This constructor should be used when no layout or bounding box information is available.
Properties
- Bounds
Gets the page's bounding quadrilateral in page coordinates, enclosing all TextElement bounds.
- Height
Gets the height of the page in the original document, in points.
- Rotation
Gets the detected rotation of the page in degrees clockwise (e.g., 0, 90, 180, or 270).
- Skew
Gets the detected skew angle of the page in degrees clockwise.
- Text
Gets the full textual content of the page as a single aggregated string. Uses GridAligned by default.
- TextElements
Gets the collection of TextElement instances found on the page. Each text element may optionally include bounding box coordinates describing its layout.
- Unit
The measurement unit used for Width, Height, and all geometry carried by TextElements (their bounds etc.).
- Width
Gets the width of the page in the original document, in points.
Methods
- Clone()
Creates a deep copy of this PageElement, including cloned text elements and preserving current layout settings (size, rotation, skew, and formatting flag).
- DetectLines()
Detects and groups TextElement instances into reading-order text lines.
- DetectParagraphs()
Detects and groups lines into paragraphs via a joint layout-and-NLP analysis. Combines geometric cues (inter-line spacing ratios, indentation deltas, baseline alignment, left/right rag, column/region membership, style changes) with linguistic signals (sentence boundary confidence, discourse/continuation markers, list/quote/heading patterns, cross-line semantic cohesion). Designed to be robust to OCR noise, rotation/skew, and multilingual scripts. Returns paragraphs ordered in reading order.
- FromJson(string)
Deserializes the specified JSON string into a PageElement instance.
- GetText(TextOutputMode)
Aggregate the page's textual content using the specified TextOutputMode.
- ToJson()
Serializes this PageElement into a JSON-formatted string.