Search and Locate Content Within Documents
Enterprise document workflows often require finding specific content inside large documents: matching invoice numbers, locating regulatory clauses, or finding all monetary amounts on a page. LM-Kit.NET's LayoutSearchEngine performs layout-aware searches that go beyond plain text matching. It supports exact text search, regular expressions, fuzzy matching, spatial region queries, proximity search, and range extraction between anchors. All searches return bounding-box coordinates alongside matched text, enabling downstream tasks like redaction, annotation, and visual highlighting.
Why Layout-Aware Search Matters
Two enterprise problems that layout-aware document search solves:
- Regulatory compliance field extraction. Auditors need to locate specific clauses, dates, and amounts scattered across hundreds of pages in contracts or regulatory filings. A layout-aware search returns not just the text but its exact position on the page, enabling automated annotation and evidence collection.
- Automated form field location. Insurance claims, tax forms, and government applications have labeled fields in fixed positions. Spatial region search finds values near known labels, even when the document layout varies slightly between versions.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 4 GB |
| Input formats | PDF (with text layer), DOCX, images (with OCR) |
No GPU is required for search operations. GPU is only needed if you also run OCR or VLM processing.
Step 1: Create the Project
dotnet new console -n DocumentSearch
cd DocumentSearch
dotnet add package LM-Kit.NET
Step 2: Understand the Search Capabilities
┌────────────────────────────────────┐
│ LayoutSearchEngine │
├────────────────────────────────────┤
│ FindText() Exact substring │
│ FindRegex() Pattern matching │
│ FindFuzzy() Approximate match │
│ FindInRegion() Spatial area │
│ FindNear() Proximity search │
│ FindBetween() Range extraction │
└────────────────┬───────────────────┘
│
▼
List<TextMatch>
├── Text (matched content)
├── Snippet (surrounding context)
├── Bounds (bounding box)
├── Score (relevance 0..1)
└── PageIndex (page location)
All search methods operate on PageElement objects, which represent a single page with positioned text elements. Every method also has a cross-page overload that accepts IEnumerable<PageElement>.
Step 3: Load a Document and Extract Page Layout
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Step 4: Exact Text Search
Find all occurrences of a specific string across the document:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("=== Exact Text Search ===\n");
List<TextMatch> matches = search.FindText(pages, "payment due");
Console.WriteLine($"Found {matches.Count} match(es) for \"payment due\":\n");
foreach (TextMatch match in matches)
{
Console.WriteLine($" Page {match.PageIndex + 1}: \"{match.Text}\"");
Console.WriteLine($" Context: ...{match.Snippet}...");
Console.WriteLine($" Position: ({match.Bounds.TopLeft.X:F0}, {match.Bounds.TopLeft.Y:F0})");
Console.WriteLine();
}
Step 5: Regex Pattern Search
Find all monetary amounts, dates, or other patterns:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("=== Regex Search: Monetary Amounts ===\n");
// Find dollar amounts like $1,234.56
List<TextMatch> amounts = search.FindRegex(pages, @"\$[\d,]+\.\d{2}");
Console.WriteLine($"Found {amounts.Count} monetary amount(s):\n");
foreach (TextMatch match in amounts)
{
Console.WriteLine($" Page {match.PageIndex + 1}: {match.Text}");
Console.WriteLine($" Context: ...{match.Snippet}...");
}
Console.WriteLine();
// Find dates in MM/DD/YYYY format
List<TextMatch> dates = search.FindRegex(pages, @"\d{2}/\d{2}/\d{4}");
Console.WriteLine($"Found {dates.Count} date(s):\n");
foreach (TextMatch match in dates)
{
Console.WriteLine($" Page {match.PageIndex + 1}: {match.Text}");
}
Step 6: Fuzzy Matching
Find text even when it contains typos or OCR errors:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("\n=== Fuzzy Search ===\n");
var fuzzyOptions = new FuzzySearchOptions
{
MaxDistance = 2 // allow up to 2 character edits (Damerau-Levenshtein)
};
List<TextMatch> fuzzyMatches = search.FindFuzzy(pages, "recievable", fuzzyOptions);
Console.WriteLine($"Fuzzy matches for \"recievable\" (max distance 2):\n");
foreach (TextMatch match in fuzzyMatches)
{
Console.WriteLine($" Page {match.PageIndex + 1}: \"{match.Text}\" (score: {match.Score:F2})");
}
Fuzzy search is particularly useful for scanned documents where OCR may introduce character-level errors. A search for "receivable" with max distance 2 will match "recievable", "receivble", and other common OCR misreads.
Step 7: Spatial Region Search
Find all text within a specific rectangular area of a page:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("\n=== Region Search: Top-Right Header ===\n");
// Define a region in the top-right corner of page 1 (in points)
// PDF coordinates: origin at bottom-left, 72 points per inch
PageElement firstPage = pages[0];
var headerRegion = new LMKit.Graphics.Geometry.Rectangle(
x: firstPage.Width * 0.5, // right half
y: firstPage.Height * 0.85, // top 15%
width: firstPage.Width * 0.5,
height: firstPage.Height * 0.15
);
List<TextMatch> regionMatches = search.FindInRegion(firstPage, headerRegion);
Console.WriteLine($"Text in header region ({regionMatches.Count} match(es)):\n");
foreach (TextMatch match in regionMatches)
{
Console.WriteLine($" \"{match.Text}\"");
}
Step 8: Proximity Search
Find text that appears near other text on the page:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("\n=== Proximity Search ===\n");
// Find text near the label "Total Due"
var proximityOptions = new ProximityOptions
{
Anchor = "Total Due",
MaxDistance = 100 // within 100 points of the anchor
};
List<TextMatch> nearTotal = search.FindNear(pages, "Total Due", proximityOptions);
Console.WriteLine($"Text near \"Total Due\":\n");
foreach (TextMatch match in nearTotal)
{
Console.WriteLine($" Page {match.PageIndex + 1}: \"{match.Text}\" (distance: {match.Score:F0}pt)");
}
Step 9: Range Extraction Between Anchors
Extract all text that appears between two markers:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("\n=== Between Search: Extract Clause Content ===\n");
// Extract text between "TERMS AND CONDITIONS" and "SIGNATURES"
List<TextMatch> clauseContent = search.FindBetween(
pages,
startQuery: "TERMS AND CONDITIONS",
endQuery: "SIGNATURES");
Console.WriteLine($"Content between markers ({clauseContent.Count} segment(s)):\n");
foreach (TextMatch match in clauseContent)
{
Console.WriteLine($" Page {match.PageIndex + 1}:");
Console.WriteLine($" {Truncate(match.Text, 120)}");
Console.WriteLine();
}
static string Truncate(string text, int max)
{
if (string.IsNullOrEmpty(text)) return "";
string cleaned = text.Replace("\n", " ").Replace("\r", "");
return cleaned.Length <= max ? cleaned : cleaned[..max] + "...";
}
Step 10: Building a Complete Document Audit Tool
Combine multiple search types into a document audit pipeline:
using System.Text;
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load the PDF and extract page elements
// ──────────────────────────────────────
string pdfPath = args.Length > 0 ? args[0] : "contract.pdf";
if (!File.Exists(pdfPath))
{
Console.WriteLine($"File not found: {pdfPath}");
Console.WriteLine("Usage: dotnet run -- <path-to-pdf>");
return;
}
var attachment = new Attachment(pdfPath);
int pageCount = attachment.PageCount;
Console.WriteLine($"Loaded: {Path.GetFileName(pdfPath)} ({pageCount} pages)\n");
// Get page elements (layout-aware) for each page
IReadOnlyList<PageElement> pages = attachment.PageElements;
// ──────────────────────────────────────
// 2. Create the search engine
// ──────────────────────────────────────
var search = new LayoutSearchEngine();
Console.WriteLine("=== Document Audit Report ===\n");
// 1. Find all email addresses
var emails = search.FindRegex(pages, @"[\w.+-]+@[\w-]+\.[\w.-]+");
Console.WriteLine($"Email addresses found: {emails.Count}");
foreach (var e in emails)
Console.WriteLine($" p.{e.PageIndex + 1}: {e.Text}");
// 2. Find all phone numbers
var phones = search.FindRegex(pages, @"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}");
Console.WriteLine($"\nPhone numbers found: {phones.Count}");
foreach (var p in phones)
Console.WriteLine($" p.{p.PageIndex + 1}: {p.Text}");
// 3. Find all dates
var allDates = search.FindRegex(pages, @"\b\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4}\b");
Console.WriteLine($"\nDates found: {allDates.Count}");
foreach (var d in allDates)
Console.WriteLine($" p.{d.PageIndex + 1}: {d.Text}");
// 4. Find all monetary values
var money = search.FindRegex(pages, @"[\$\€\£][\d,]+\.?\d*");
Console.WriteLine($"\nMonetary values found: {money.Count}");
foreach (var m in money)
Console.WriteLine($" p.{m.PageIndex + 1}: {m.Text}");
Console.WriteLine("\n=== Audit Complete ===");
Search Method Reference
| Method | Purpose | Key Options |
|---|---|---|
FindText |
Exact substring match | Case sensitivity |
FindRegex |
Regular expression pattern | Standard .NET regex syntax |
FindFuzzy |
Approximate match (typo-tolerant) | MaxDistance (edit distance) |
FindInRegion |
Text within a spatial rectangle | Rectangle coordinates (points) |
FindNear |
Text near an anchor string | Anchor, MaxDistance (points) |
FindBetween |
Text between two markers | startQuery, endQuery |
All methods return List<TextMatch> with Text, Snippet, Bounds, Score, and PageIndex.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| No matches on scanned PDF | PDF has no text layer | Use VlmOcr or TesseractOcr to extract text first |
| Regex finds partial matches | Pattern too broad | Add word boundaries (\b) to patterns |
| Fuzzy search returns too many results | MaxDistance too high |
Reduce to 1 or 2 for tighter matching |
| Region search misses text | Coordinate system mismatch | PDF uses points (72/inch) with origin at bottom-left |
| Between search spans multiple pages | Text crosses a page break | Use the cross-page overload with IEnumerable<PageElement> |
Next Steps
- Search and Highlight Demo: interactive console app that searches PDFs and images, then produces highlighted copies with all matches visually marked.
- Extract Text with Layout Preservation from PDFs: control how extracted text preserves columns, tables, and paragraphs.
- Convert Documents to Markdown with VLM OCR: convert scanned documents to searchable text.
- Extract Structured Data from Unstructured Text: pull typed fields from search results.
- Extract PII and Redact Sensitive Data: find and redact sensitive information.