Table of Contents

Analyze Document Layout and Search by Region, Proximity, and Structure

Enterprise documents encode meaning through spatial arrangement: invoices place totals in the bottom-right corner, contracts indent sub-clauses under parent sections, and financial reports align figures in columns. Extracting text alone discards this structural information. LM-Kit.NET's PageElement class detects lines and paragraphs from raw text elements, and the LayoutSearchEngine provides six search modes that exploit spatial relationships: exact text, regex patterns, fuzzy matching, rectangular region queries, proximity search near anchor points, and between-text extraction. All operations are CPU-based and require no LLM model loading. This tutorial builds a layout-aware document analysis pipeline that handles multi-column, rotated, and OCR-processed documents.


Why Layout-Aware Analysis Matters

Two enterprise problems that structural document analysis solves:

  1. Automated invoice field extraction. Invoice layouts vary across vendors, but fields follow spatial conventions: the invoice number appears near the top, line items occupy the middle, and totals cluster in the bottom-right. Region and proximity search locates values relative to their labels, even when the exact position shifts between vendors.
  2. Regulatory document auditing. Compliance teams must verify that specific clauses, dates, and signatures appear in the correct sections of contracts and filings. Between-text extraction pulls content between section headers, and fuzzy search finds terms despite OCR errors in scanned originals.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 4 GB
Input formats PDF (with text layer), DOCX, images (with prior OCR)

No GPU or LLM model is required. Layout analysis and search are CPU-based operations that run on any hardware.


Step 1: Create the Project

dotnet new console -n LayoutAnalysis
cd LayoutAnalysis
dotnet add package LM-Kit.NET

Step 2: Understand Document Layout Architecture

LM-Kit.NET processes document layout in two stages. First, the Attachment class parses the document and exposes PageElement objects that contain raw text elements with positional coordinates. Second, PageElement provides structure detection (DetectLines, DetectParagraphs) and the LayoutSearchEngine provides six spatial search modes.

Document File (PDF, DOCX)
        │
        ▼
    Attachment
        │
        ├── PageCount
        └── PageElements[]
                │
                ▼
           PageElement
            ├── TextElements     (raw positioned text)
            ├── DetectLines()    → List<LineElement>
            ├── DetectParagraphs() → List<ParagraphElement>
            └── GetText(mode)    → string (5 output modes)
┌────────────────────────────────────┐
│       LayoutSearchEngine           │
├────────────────────────────────────┤
│  FindText()      Exact substring   │
│  FindRegex()     Pattern matching  │
│  FindFuzzy()     Approximate match │
│  FindInRegion()  Spatial rectangle │
│  FindNear()      Proximity anchor  │
│  FindBetween()   Range extraction  │
└────────────────┬───────────────────┘
                 │
                 ▼
        List<TextMatch>
        ├── Text        (matched content)
        ├── Snippet     (surrounding context)
        ├── Score       (relevance 0..1)
        ├── Bounds      (bounding quadrilateral)
        ├── PageIndex   (source page)
        └── Elements    (constituent TextElements)

Every search method has two overloads: one that accepts a single PageElement for single-page queries, and one that accepts IEnumerable<PageElement> for cross-page searches.


Step 3: Load a Document and Access Layout Elements

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Inspect page metadata
// ──────────────────────────────────────
for (int i = 0; i < attachment.PageCount; i++)
{
    PageElement page = attachment.PageElements[i];
    Console.WriteLine($"Page {i + 1}: {page.Width:F0} x {page.Height:F0} ({page.Unit})");
    Console.WriteLine($"  Rotation: {page.Rotation}°, Skew: {page.Skew:F2}°");
    Console.WriteLine($"  Text elements: {page.TextElements.Count()}");
}

The Rotation property reports the detected page rotation in degrees (0, 90, 180, 270). The Skew property reports the text skew angle, which is useful for identifying scanned documents that were fed at an angle. Both values inform downstream layout analysis and search accuracy.


Step 4: Detect Lines and Paragraphs

DetectLines() groups individual text elements into reading-order lines, handling multi-column layouts and varying font sizes. DetectParagraphs() goes further by grouping lines into logical paragraphs based on spacing, indentation, and alignment patterns.

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);

PageElement page = attachment.PageElements[0];

// ──────────────────────────────────────
// 3. Detect reading-order lines
// ──────────────────────────────────────
List<LineElement> lines = page.DetectLines();
Console.WriteLine($"Detected {lines.Count} lines:\n");

foreach (LineElement line in lines.Take(5))
{
    Console.WriteLine($"  [{line.Left:F0},{line.Top:F0}] \"{line.Text}\"");
    Console.WriteLine($"    Size: {line.Width:F0}x{line.Height:F0}, " +
                      $"Words: {line.Words.Count}, " +
                      $"Angle: {line.DominantTextAngleDegrees:F1}°, " +
                      $"Direction: {line.TextDirection}");
}

// ──────────────────────────────────────
// 4. Detect logical paragraphs
// ──────────────────────────────────────
List<ParagraphElement> paragraphs = page.DetectParagraphs();
Console.WriteLine($"\nDetected {paragraphs.Count} paragraphs:\n");

foreach (ParagraphElement para in paragraphs)
{
    string preview = para.Text.Length > 80 ? para.Text[..80] + "..." : para.Text;
    Console.WriteLine($"  [{para.Left:F0},{para.Top:F0}] ({para.Lines.Count} lines) \"{preview}\"");
    Console.WriteLine($"    LayerId: {para.LayerId}, " +
                      $"Angle: {para.DominantTextAngleDegrees:F1}°");
}

LineElement.Words gives you an IReadOnlyList<TextElement> with per-word bounding boxes, enabling fine-grained coordinate access. ParagraphElement.LayerId identifies which layout layer the paragraph belongs to, which is useful for separating main body text from headers, footers, or sidebar content in multi-column documents.


Step 5: Extract Text with Different Output Modes

PageElement.GetText(TextOutputMode) converts the raw text elements into a formatted string. Each mode applies a different layout reconstruction strategy:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Inspect page metadata
// ──────────────────────────────────────
for (int i = 0; i < attachment.PageCount; i++)
{
    PageElement page = attachment.PageElements[i];
    Console.WriteLine($"Page {i + 1}: {page.Width:F0} x {page.Height:F0} ({page.Unit})");
}

PageElement page = attachment.PageElements[0];

// ──────────────────────────────────────
// 5. Compare text output modes
// ──────────────────────────────────────
TextOutputMode[] modes =
{
    TextOutputMode.RawLines,
    TextOutputMode.GridAligned,
    TextOutputMode.ParagraphFlow,
    TextOutputMode.Structured,
    TextOutputMode.Auto
};

foreach (TextOutputMode mode in modes)
{
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"\n=== {mode} ===");
    Console.ResetColor();

    string text = page.GetText(mode);
    string preview = text.Length > 400 ? text[..400] + "\n[...]" : text;
    Console.WriteLine(preview);
}
Mode Behavior Best For
RawLines One line per text element, no grouping Debugging, raw element inspection
GridAligned Preserves column alignment and indentation with spaces Multi-column documents, financial statements
ParagraphFlow Groups lines into paragraphs separated by blank lines Articles, contracts, legal documents
Structured Preserves both tabular and paragraph structure Tables, forms, mixed-layout reports
Auto Evaluates the page structure and selects the optimal mode Unknown document types, general pipelines

For most document intelligence workflows, use Structured mode. It preserves the information hierarchy that downstream search, extraction, and RAG indexing need. Use GridAligned when column alignment is critical, such as financial statements with side-by-side figures. Use ParagraphFlow when you need clean, readable text for LLM prompt context.


Step 6: Search for Exact Text and Regex Patterns

The LayoutSearchEngine operates on PageElement objects and returns TextMatch results that include the matched text, a contextual snippet, a bounding box, and a relevance score.

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Inspect page metadata
// ──────────────────────────────────────
for (int i = 0; i < attachment.PageCount; i++)
{
    PageElement page = attachment.PageElements[i];
    Console.WriteLine($"Page {i + 1}: {page.Width:F0} x {page.Height:F0} ({page.Unit})");
}

PageElement page = attachment.PageElements[0];

// ──────────────────────────────────────
// 6. Create the search engine
// ──────────────────────────────────────
var searchEngine = new LayoutSearchEngine();

// ──────────────────────────────────────
// 7. Exact text search
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Exact Text Search ===");
Console.ResetColor();

List<TextMatch> textMatches = searchEngine.FindText(page, "Total Revenue",
    new TextSearchOptions
    {
        Comparison = StringComparison.OrdinalIgnoreCase,
        WholeWord = true,
        MaxResults = 10,
        ContextChars = 40
    });

Console.WriteLine($"Found {textMatches.Count} match(es) for \"Total Revenue\":\n");
foreach (TextMatch match in textMatches)
{
    Console.WriteLine($"  \"{match.Text}\" at ({match.Bounds.Left:F0}, {match.Bounds.Top:F0})");
    Console.WriteLine($"    Context: ...{match.Snippet}...");
    Console.WriteLine($"    Score: {match.Score:F2}, Index: {match.StartIndex}..{match.EndIndex}");
}

// ──────────────────────────────────────
// 8. Regex pattern search
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Regex Search: Dollar Amounts ===");
Console.ResetColor();

List<TextMatch> amounts = searchEngine.FindRegex(page, @"\$[\d,]+\.?\d*",
    new RegexSearchOptions { MaxResults = 20, ContextChars = 30 });

Console.WriteLine($"Found {amounts.Count} monetary amount(s):\n");
foreach (TextMatch match in amounts)
{
    Console.WriteLine($"  {match.Text} at ({match.Bounds.Left:F0}, {match.Bounds.Top:F0})");
    Console.WriteLine($"    Context: ...{match.Snippet}...");
}

// Find dates in various formats
List<TextMatch> dates = searchEngine.FindRegex(page,
    @"\b\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}\b",
    new RegexSearchOptions { MaxResults = 50 });

Console.WriteLine($"\nFound {dates.Count} date(s):");
foreach (TextMatch match in dates)
{
    Console.WriteLine($"  {match.Text}");
}

TextSearchOptions.WholeWord prevents partial matches (for example, "Revenue" inside "RevenueStream"). ContextChars controls how many characters of surrounding text appear in the Snippet property. The RegexSearchOptions.RegexOptions property accepts standard System.Text.RegularExpressions.RegexOptions flags for multiline, single-line, and other regex behaviors.


Step 7: Fuzzy Search for OCR-Noisy Text

Scanned documents often produce OCR errors: "l" becomes "1", "rn" becomes "m", and characters are misread. Fuzzy search uses edit distance matching to find terms despite these errors.

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Inspect page metadata
// ──────────────────────────────────────
for (int i = 0; i < attachment.PageCount; i++)
{
    PageElement page = attachment.PageElements[i];
    Console.WriteLine($"Page {i + 1}: {page.Width:F0} x {page.Height:F0} ({page.Unit})");
}

PageElement page = attachment.PageElements[0];
var searchEngine = new LayoutSearchEngine();

// ──────────────────────────────────────
// 9. Fuzzy matching
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Fuzzy Search ===");
Console.ResetColor();

// Find "Invoice Number" even with OCR errors like "lnvoice Nurnber"
List<TextMatch> fuzzyMatches = searchEngine.FindFuzzy(page, "Invoice Number",
    new FuzzySearchOptions
    {
        MaxEditDistance = 3,
        MinScore = 0.65,
        TokenAware = true,
        MaxResults = 10,
        ContextChars = 40
    });

Console.WriteLine($"Fuzzy matches for \"Invoice Number\":\n");
foreach (TextMatch match in fuzzyMatches)
{
    Console.WriteLine($"  \"{match.Text}\" (score: {match.Score:F2})");
    Console.WriteLine($"    Position: ({match.Bounds.Left:F0}, {match.Bounds.Top:F0})");
    Console.WriteLine($"    Context: ...{match.Snippet}...");
}
Option Default Description
MaxEditDistance 2 Maximum Damerau-Levenshtein edit distance allowed
MinScore 0.75 Minimum similarity score (0..1) to include in results
TokenAware false When true, compares tokens (words) individually, improving accuracy for multi-word queries

Setting TokenAware = true is important for multi-word queries. Without it, "Invoice Number" is treated as a single string. With it, each word is matched independently, so "lnvoice Nurnber" still scores highly because both individual tokens are close to their targets.


Step 8: Search Within a Page Region

Region search restricts results to a rectangular area of the page. This is essential for extracting data from specific zones: headers, footers, sidebars, or form fields at known positions.

using LMKit.Graphics.Geometry;

// ──────────────────────────────────────
// 10. Region search
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Region Search: Top-Right Quadrant ===");
Console.ResetColor();

// Define the top-right quadrant (where invoice totals typically appear)
var topRight = new Rectangle(
    page.Width / 2,           // left edge at page midpoint
    0,                         // top of page
    page.Width / 2,           // half the page width
    page.Height / 2           // half the page height
);

List<TextMatch> regionMatches = searchEngine.FindInRegion(page, topRight,
    new RegionSearchOptions
    {
        IncludePartial = true,     // include text that overlaps the boundary
        MergeTouching = true       // merge adjacent text elements into phrases
    });

Console.WriteLine($"Text in top-right quadrant ({regionMatches.Count} match(es)):\n");
foreach (TextMatch match in regionMatches)
{
    Console.WriteLine($"  \"{match.Text}\"");
    Console.WriteLine($"    Position: ({match.Bounds.Left:F0}, {match.Bounds.Top:F0})");
}

// Search a specific form field zone
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Region Search: Invoice Number Field ===");
Console.ResetColor();

var invoiceFieldZone = Rectangle.FromBounds(
    left: 300, top: 50, right: 550, bottom: 90);

List<TextMatch> fieldMatches = searchEngine.FindInRegion(page, invoiceFieldZone);

Console.WriteLine($"Text in invoice field zone:");
foreach (TextMatch match in fieldMatches)
{
    Console.WriteLine($"  \"{match.Text}\"");
}

IncludePartial (default true) includes text elements that partially overlap the region boundary. Set it to false when you need only elements fully contained within the rectangle. MergeTouching (default true) merges adjacent text elements into complete phrases, which is important when individual characters or words are stored as separate elements.


Step 9: Proximity Search and Between-Text Extraction

Proximity search finds text near an anchor point on the page. This is useful when you know where a label is but need to find the associated value nearby. Between-text extraction pulls all content between two text anchors, which is ideal for extracting section content.

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Inspect page metadata
// ──────────────────────────────────────
for (int i = 0; i < attachment.PageCount; i++)
{
    PageElement page = attachment.PageElements[i];
    Console.WriteLine($"Page {i + 1}: {page.Width:F0} x {page.Height:F0} ({page.Unit})");
}

PageElement page = attachment.PageElements[0];
var searchEngine = new LayoutSearchEngine();

// ──────────────────────────────────────
// 11. Proximity search
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Proximity Search: Values Near 'Total' ===");
Console.ResetColor();

// Find dollar amounts near the "Total:" label
// First, locate the "Total:" label
List<TextMatch> totalLabel = searchEngine.FindText(page, "Total:");
if (totalLabel.Count > 0)
{
    TextMatch label = totalLabel[0];
    Console.WriteLine($"Found label \"Total:\" at ({label.Bounds.Left:F0}, {label.Bounds.Top:F0})\n");

    // Search for dollar amounts near that label
    var anchorRegion = new Rectangle(
        label.Bounds.Left, label.Bounds.Top,
        label.Bounds.Right - label.Bounds.Left,
        label.Bounds.Bottom - label.Bounds.Top);

    List<TextMatch> nearTotal = searchEngine.FindNear(page, @"\$[\d,]+\.\d{2}",
        new ProximityOptions
        {
            AnchorRegion = anchorRegion,
            Radius = 0.1     // search radius as fraction of page diagonal
        });

    Console.WriteLine($"Dollar amounts near \"Total:\" ({nearTotal.Count} match(es)):");
    foreach (TextMatch match in nearTotal)
    {
        Console.WriteLine($"  {match.Text} (score: {match.Score:F2})");
    }
}

// ──────────────────────────────────────
// 12. Between-text extraction
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Between-Text Extraction ===");
Console.ResetColor();

// Extract content between "Terms and Conditions" and "Signature"
List<TextMatch> betweenMatches = searchEngine.FindBetween(page,
    "Terms and Conditions",
    "Signature",
    new BetweenOptions
    {
        Inclusive = false,      // exclude the anchor text itself
        MaxChars = 5000         // limit extraction length
    });

Console.WriteLine($"Content between anchors ({betweenMatches.Count} segment(s)):\n");
foreach (TextMatch match in betweenMatches)
{
    string preview = match.Text.Length > 200
        ? match.Text[..200] + "..."
        : match.Text;
    Console.WriteLine($"  Page {match.PageIndex + 1}:");
    Console.WriteLine($"    {preview.Replace("\n", " ")}");
    Console.WriteLine($"    Length: {match.Length} chars");
}

The ProximityOptions.Radius property is expressed as a fraction of the page diagonal. A value of 0.05 searches a small neighborhood; 0.1 covers a broader area. The ProximityOptions.TextOptions property lets you apply TextSearchOptions to the query pattern, enabling case-insensitive or whole-word proximity matching.

BetweenOptions.Inclusive controls whether the anchor text ("Terms and Conditions" and "Signature" in the example) is included in the result. BetweenOptions.AnchorTextOptions lets you apply TextSearchOptions to the anchor queries, enabling case-insensitive anchor matching.


Step 10: Cross-Page Search and Complete Pipeline

Every search method has a cross-page overload that accepts IEnumerable<PageElement>. This example builds a complete document analysis pipeline that loads a multi-page PDF, detects structure on each page, searches across all pages, extracts data from specific regions, and produces structured JSON output.

using System.Text.Json;

// ──────────────────────────────────────
// 13. Cross-page analysis pipeline
// ──────────────────────────────────────
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine("\n=== Complete Document Analysis Pipeline ===");
Console.ResetColor();

// Collect all pages
var allPages = new List<PageElement>();
for (int i = 0; i < attachment.PageCount; i++)
{
    allPages.Add(attachment.PageElements[i]);
}

// A. Detect structure on each page
Console.WriteLine("\n--- Structure Detection ---\n");
int totalLines = 0;
int totalParagraphs = 0;

for (int i = 0; i < allPages.Count; i++)
{
    List<LineElement> pageLines = allPages[i].DetectLines();
    List<ParagraphElement> pageParas = allPages[i].DetectParagraphs();
    totalLines += pageLines.Count;
    totalParagraphs += pageParas.Count;

    Console.WriteLine($"  Page {i + 1}: {pageLines.Count} lines, {pageParas.Count} paragraphs, " +
                      $"rotation: {allPages[i].Rotation}°");
}

Console.WriteLine($"\n  Total: {totalLines} lines, {totalParagraphs} paragraphs across " +
                  $"{allPages.Count} page(s)");

// B. Cross-page search for key terms
Console.WriteLine("\n--- Cross-Page Key Term Search ---\n");

string[] keyTerms = { "Revenue", "Net Income", "Total Assets", "Liabilities" };

foreach (string term in keyTerms)
{
    // Cross-page overload: pass the full list of pages
    List<TextMatch> termMatches = searchEngine.FindText(allPages, term,
        new TextSearchOptions
        {
            Comparison = StringComparison.OrdinalIgnoreCase,
            WholeWord = true
        });

    Console.WriteLine($"  \"{term}\": {termMatches.Count} occurrence(s)");
    foreach (TextMatch match in termMatches.Take(3))
    {
        Console.WriteLine($"    Page {match.PageIndex + 1}: ...{match.Snippet}...");
    }
}

// C. Extract all monetary values across pages
Console.WriteLine("\n--- Cross-Page Monetary Value Extraction ---\n");

List<TextMatch> allAmounts = searchEngine.FindRegex(allPages, @"\$[\d,]+\.?\d*",
    new RegexSearchOptions { MaxResults = 100, ContextChars = 30 });

Console.WriteLine($"  Found {allAmounts.Count} monetary value(s) across all pages");

// D. Build structured JSON output
var analysisResult = new
{
    FileName = Path.GetFileName(pdfPath),
    PageCount = attachment.PageCount,
    TotalLines = totalLines,
    TotalParagraphs = totalParagraphs,
    MonetaryValues = allAmounts.Select(m => new
    {
        Value = m.Text,
        Page = m.PageIndex + 1,
        Context = m.Snippet,
        Position = new
        {
            Left = Math.Round(m.Bounds.Left, 1),
            Top = Math.Round(m.Bounds.Top, 1)
        }
    }).ToArray(),
    KeyTermOccurrences = keyTerms.Select(term =>
    {
        var matches = searchEngine.FindText(allPages, term,
            new TextSearchOptions
            {
                Comparison = StringComparison.OrdinalIgnoreCase,
                WholeWord = true
            });
        return new
        {
            Term = term,
            Count = matches.Count,
            Pages = matches.Select(m => m.PageIndex + 1).Distinct().ToArray()
        };
    }).ToArray()
};

string json = JsonSerializer.Serialize(analysisResult, new JsonSerializerOptions
{
    WriteIndented = true
});

string jsonOutputPath = Path.ChangeExtension(pdfPath, ".analysis.json");
File.WriteAllText(jsonOutputPath, json);

Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"\nAnalysis saved to: {jsonOutputPath}");
Console.ResetColor();

Console.WriteLine($"\nJSON preview:\n{json[..Math.Min(json.Length, 500)]}");

Global Search Options

The LayoutSearchEngine constructor accepts a LayoutSearchOptions object that applies text normalization to all searches performed by that engine instance:

Option Default Effect
NormalizeWhitespace false Collapses multiple spaces and newlines into single spaces
IgnoreDiacritics false Treats accented characters as their base form (e.g., "e" matches "e")
IgnorePunctuation false Strips punctuation before matching
IgnoreSymbols false Strips symbols (currency signs, math operators) before matching
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// LMKit.Licensing.LicenseManager.SetLicenseKey("YOUR_LICENSE_KEY");

// ──────────────────────────────────────
// 1. Load the document
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "financial_report.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({attachment.PageCount} pages)\n");

// ──────────────────────────────────────
// 2. Inspect page metadata
// ──────────────────────────────────────
for (int i = 0; i < attachment.PageCount; i++)
{
    PageElement page = attachment.PageElements[i];
    Console.WriteLine($"Page {i + 1}: {page.Width:F0} x {page.Height:F0} ({page.Unit})");
    Console.WriteLine($"  Rotation: {page.Rotation}°, Skew: {page.Skew:F2}°");
    Console.WriteLine($"  Text elements: {page.TextElements.Count()}");
}

// Create an engine that normalizes whitespace and ignores diacritics
var normalizedSearch = new LayoutSearchEngine(new LayoutSearchOptions
{
    NormalizeWhitespace = true,
    IgnoreDiacritics = true
});

These options are useful when searching OCR-processed documents where whitespace is inconsistent or when searching multilingual documents where diacritical marks vary.


Common Issues

Problem Cause Fix
No text elements on page PDF is scanned (image-only) with no text layer Run OCR first using VlmOcr or LMKitOcr to create a text layer
Lines detected out of order Complex multi-column layout or overlapping text boxes DetectLines() handles most multi-column layouts; verify with RawLines output mode
Fuzzy search returns too many results MaxEditDistance too high or MinScore too low Reduce MaxEditDistance to 1 or 2 and increase MinScore to 0.8 or higher
Region search misses expected text Coordinate system mismatch or incorrect rectangle PDF coordinates use points (72 points per inch); verify bounds with TextElement positions
Proximity search returns no results Radius too small for the page scale Increase Radius from 0.05 to 0.1 or 0.15; the value is relative to the page diagonal
Between search spans pages unexpectedly Start anchor on one page and end anchor on another Use the single-page overload when anchors are on the same page, or handle PageIndex in results

Next Steps

Share