Search and Locate Content Within Documents

Enterprise document workflows often require finding specific content inside large documents: matching invoice numbers, locating regulatory clauses, or finding all monetary amounts on a page. LM-Kit.NET's LayoutSearchEngine performs layout-aware searches that go beyond plain text matching. It supports exact text search, regular expressions, fuzzy matching, spatial region queries, proximity search, and range extraction between anchors. All searches return bounding-box coordinates alongside matched text, enabling downstream tasks like redaction, annotation, and visual highlighting.

Why Layout-Aware Search Matters

Two enterprise problems that layout-aware document search solves:

Regulatory compliance field extraction. Auditors need to locate specific clauses, dates, and amounts scattered across hundreds of pages in contracts or regulatory filings. A layout-aware search returns not just the text but its exact position on the page, enabling automated annotation and evidence collection.
Automated form field location. Insurance claims, tax forms, and government applications have labeled fields in fixed positions. Spatial region search finds values near known labels, even when the document layout varies slightly between versions.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	4 GB
Input formats	PDF (with text layer), DOCX, images (with OCR)

No GPU is required for search operations. GPU is only needed if you also run OCR or VLM processing.

Step 1: Create the Project

dotnet new console -n DocumentSearch
cd DocumentSearch
dotnet add package LM-Kit.NET

Step 2: Understand the Search Capabilities

┌────────────────────────────────────┐
│         LayoutSearchEngine         │
├────────────────────────────────────┤
│  FindText()     Exact substring    │
│  FindRegex()    Pattern matching   │
│  FindFuzzy()    Approximate match  │
│  FindInRegion() Spatial area       │
│  FindNear()     Proximity search   │
│  FindBetween()  Range extraction   │
└────────────────┬───────────────────┘
                 │
                 ▼
        List<TextMatch>
        ├── Text        (matched content)
        ├── Snippet     (surrounding context)
        ├── Bounds      (bounding box)
        ├── Score       (relevance 0..1)
        └── PageIndex   (page location)

All search methods operate on PageElement objects, which represent a single page with positioned text elements. Every method also has a cross-page overload that accepts IEnumerable<PageElement>.

Step 3: Load a Document and Extract Page Layout

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Step 4: Exact Text Search

Find all occurrences of a specific string across the document:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("=== Exact Text Search ===\n");

List<TextMatch> matches = search.FindText(pages, "payment due");

Console.WriteLine($"Found {matches.Count} match(es) for \"payment due\":\n");

foreach (TextMatch match in matches)
{
    Console.WriteLine($"  Page {match.PageIndex + 1}: \"{match.Text}\"");
    Console.WriteLine($"    Context: ...{match.Snippet}...");
    Console.WriteLine($"    Position: ({match.Bounds.TopLeft.X:F0}, {match.Bounds.TopLeft.Y:F0})");
    Console.WriteLine();
}

Step 5: Regex Pattern Search

Find all monetary amounts, dates, or other patterns:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("=== Regex Search: Monetary Amounts ===\n");

// Find dollar amounts like $1,234.56
List<TextMatch> amounts = search.FindRegex(pages, @"\$[\d,]+\.\d{2}");

Console.WriteLine($"Found {amounts.Count} monetary amount(s):\n");

foreach (TextMatch match in amounts)
{
    Console.WriteLine($"  Page {match.PageIndex + 1}: {match.Text}");
    Console.WriteLine($"    Context: ...{match.Snippet}...");
}

Console.WriteLine();

// Find dates in MM/DD/YYYY format
List<TextMatch> dates = search.FindRegex(pages, @"\d{2}/\d{2}/\d{4}");

Console.WriteLine($"Found {dates.Count} date(s):\n");
foreach (TextMatch match in dates)
{
    Console.WriteLine($"  Page {match.PageIndex + 1}: {match.Text}");
}

Step 6: Fuzzy Matching

Find text even when it contains typos or OCR errors:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("\n=== Fuzzy Search ===\n");

var fuzzyOptions = new FuzzySearchOptions
{
    MaxDistance = 2  // allow up to 2 character edits (Damerau-Levenshtein)
};

List<TextMatch> fuzzyMatches = search.FindFuzzy(pages, "recievable", fuzzyOptions);

Console.WriteLine($"Fuzzy matches for \"recievable\" (max distance 2):\n");
foreach (TextMatch match in fuzzyMatches)
{
    Console.WriteLine($"  Page {match.PageIndex + 1}: \"{match.Text}\" (score: {match.Score:F2})");
}

Fuzzy search is particularly useful for scanned documents where OCR may introduce character-level errors. A search for "receivable" with max distance 2 will match "recievable", "receivble", and other common OCR misreads.

Step 7: Spatial Region Search

Find all text within a specific rectangular area of a page:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("\n=== Region Search: Top-Right Header ===\n");

// Define a region in the top-right corner of page 1 (in points)
// PDF coordinates: origin at bottom-left, 72 points per inch
PageElement firstPage = pages[0];
var headerRegion = new LMKit.Graphics.Geometry.Rectangle(
    x: firstPage.Width * 0.5,     // right half
    y: firstPage.Height * 0.85,   // top 15%
    width: firstPage.Width * 0.5,
    height: firstPage.Height * 0.15
);

List<TextMatch> regionMatches = search.FindInRegion(firstPage, headerRegion);

Console.WriteLine($"Text in header region ({regionMatches.Count} match(es)):\n");
foreach (TextMatch match in regionMatches)
{
    Console.WriteLine($"  \"{match.Text}\"");
}

Step 8: Proximity Search

Find text that appears near other text on the page:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("\n=== Proximity Search ===\n");

// Find text near the label "Total Due"
var proximityOptions = new ProximityOptions
{
    Anchor = "Total Due",
    MaxDistance = 100  // within 100 points of the anchor
};

List<TextMatch> nearTotal = search.FindNear(pages, "Total Due", proximityOptions);

Console.WriteLine($"Text near \"Total Due\":\n");
foreach (TextMatch match in nearTotal)
{
    Console.WriteLine($"  Page {match.PageIndex + 1}: \"{match.Text}\" (distance: {match.Score:F0}pt)");
}

Step 9: Range Extraction Between Anchors

Extract all text that appears between two markers:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("\n=== Between Search: Extract Clause Content ===\n");

// Extract text between "TERMS AND CONDITIONS" and "SIGNATURES"
List<TextMatch> clauseContent = search.FindBetween(
    pages,
    startQuery: "TERMS AND CONDITIONS",
    endQuery: "SIGNATURES");

Console.WriteLine($"Content between markers ({clauseContent.Count} segment(s)):\n");
foreach (TextMatch match in clauseContent)
{
    Console.WriteLine($"  Page {match.PageIndex + 1}:");
    Console.WriteLine($"    {Truncate(match.Text, 120)}");
    Console.WriteLine();
}

static string Truncate(string text, int max)
{
    if (string.IsNullOrEmpty(text)) return "";
    string cleaned = text.Replace("\n", " ").Replace("\r", "");
    return cleaned.Length <= max ? cleaned : cleaned[..max] + "...";
}

Step 10: Building a Complete Document Audit Tool

Combine multiple search types into a document audit pipeline:

using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";

if (!File.Exists(pdfPath))
{
    Console.WriteLine($"File not found: {pdfPath}");
    Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
    return;
}

var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");

// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;

// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();

Console.WriteLine("=== Document Audit Report ===\n");

// 1. Find all email addresses
var emails = search.FindRegex(pages, @"[\w.+-]+@[\w-]+\.[\w.-]+");
Console.WriteLine($"Email addresses found: {emails.Count}");
foreach (var e in emails)
    Console.WriteLine($"  p.{e.PageIndex + 1}: {e.Text}");

// 2. Find all phone numbers
var phones = search.FindRegex(pages, @"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}");
Console.WriteLine($"\nPhone numbers found: {phones.Count}");
foreach (var p in phones)
    Console.WriteLine($"  p.{p.PageIndex + 1}: {p.Text}");

// 3. Find all dates
var allDates = search.FindRegex(pages, @"\b\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}\b");
Console.WriteLine($"\nDates found: {allDates.Count}");
foreach (var d in allDates)
    Console.WriteLine($"  p.{d.PageIndex + 1}: {d.Text}");

// 4. Find all monetary values
var money = search.FindRegex(pages, @"[\$\€\£][\d,]+\.?\d*");
Console.WriteLine($"\nMonetary values found: {money.Count}");
foreach (var m in money)
    Console.WriteLine($"  p.{m.PageIndex + 1}: {m.Text}");

Console.WriteLine("\n=== Audit Complete ===");

Search Method Reference

Method	Purpose	Key Options
`FindText`	Exact substring match	Case sensitivity
`FindRegex`	Regular expression pattern	Standard .NET regex syntax
`FindFuzzy`	Approximate match (typo-tolerant)	`MaxDistance` (edit distance)
`FindInRegion`	Text within a spatial rectangle	Rectangle coordinates (points)
`FindNear`	Text near an anchor string	`Anchor`, `MaxDistance` (points)
`FindBetween`	Text between two markers	`startQuery`, `endQuery`

All methods return List<TextMatch> with Text, Snippet, Bounds, Score, and PageIndex.

Common Issues

Problem	Cause	Fix
No matches on scanned PDF	PDF has no text layer	Use `VlmOcr` or `LMKitOcr` to extract text first
Regex finds partial matches	Pattern too broad	Add word boundaries (`\b`) to patterns
Fuzzy search returns too many results	`MaxDistance` too high	Reduce to 1 or 2 for tighter matching
Region search misses text	Coordinate system mismatch	PDF uses points (72/inch) with origin at bottom-left
Between search spans multiple pages	Text crosses a page break	Use the cross-page overload with `IEnumerable<PageElement>`

Next Steps

Search and Highlight Demo: interactive console app that searches PDFs and images, then produces highlighted copies with all matches visually marked.
Extract Text with Layout Preservation from PDFs: control how extracted text preserves columns, tables, and paragraphs.
Convert Documents to Markdown with VLM OCR: convert scanned documents to searchable text.
Extract Structured Data from Unstructured Text: pull typed fields from search results.
Extract PII and Redact Sensitive Data: find and redact sensitive information.

Table of Contents