Class LayoutSearchEngine
Provides advanced, layout-aware search capabilities over PageElement instances and their TextElement children. Supports exact, regex, fuzzy, region-based, proximity, block-level queries, and cross-page overloads. Returns bounding boxes and contributing elements for each match.
public sealed class LayoutSearchEngine
- Inheritance
-
LayoutSearchEngine
- Inherited Members
Examples
Example 1: Exact text search on a single page.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
using LMKit.Document.Search;
// Load a PDF and get the first page layout.
PdfInfo info = PdfInfo.Load("invoice.pdf");
PageElement page = info.Pages[0].GetLayout();
var engine = new LayoutSearchEngine();
List<TextMatch> matches = engine.FindText(page, "Total Due");
foreach (TextMatch match in matches)
{
Console.WriteLine($"Found "{match.Text}" at {match.BoundingBox}");
}
Example 2: Regex search across all pages.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
using LMKit.Document.Search;
using System.Text.RegularExpressions;
PdfInfo info = PdfInfo.Load("report.pdf");
List<PageElement> pages = info.Pages
.Select(p => p.GetLayout())
.ToList();
var engine = new LayoutSearchEngine();
var regexOpts = new RegexSearchOptions
{
RegexOptions = RegexOptions.IgnoreCase | RegexOptions.Compiled,
MaxResults = 200
};
List<TextMatch> matches = engine.FindRegex(pages, @"\d{4}-\d{2}-\d{2}", regexOpts);
foreach (TextMatch match in matches)
{
Console.WriteLine($"Page {match.PageIndex + 1}: date "{match.Text}"");
}
Example 3: Fuzzy search for an approximate term.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
using LMKit.Document.Search;
PdfInfo info = PdfInfo.Load("scanned_contract.pdf");
PageElement page = info.Pages[0].GetLayout();
var engine = new LayoutSearchEngine();
var fuzzyOpts = new FuzzySearchOptions
{
MaxEditDistance = 2,
MinScore = 0.7
};
List<TextMatch> matches = engine.FindFuzzy(page, "indemnification", fuzzyOpts);
foreach (TextMatch match in matches)
{
Console.WriteLine($"Score {match.Score:F2}: "{match.Text}"");
}
Example 4: Extract text between two anchors.
using LMKit.Document.Layout;
using LMKit.Document.Pdf;
using LMKit.Document.Search;
PdfInfo info = PdfInfo.Load("letter.pdf");
PageElement page = info.Pages[0].GetLayout();
var engine = new LayoutSearchEngine();
var betweenOpts = new BetweenOptions { Inclusive = false, MaxChars = 5000 };
List<TextMatch> spans = engine.FindBetween(page, "Dear", "Sincerely", betweenOpts);
foreach (TextMatch span in spans)
{
Console.WriteLine($"Letter body ({span.Text.Length} chars): {span.Text}");
}
Constructors
- LayoutSearchEngine(LayoutSearchOptions)
Initializes a new instance of the LayoutSearchEngine class.
Methods
- FindBetween(PageElement, string, string, BetweenOptions)
Extracts the text located between the first occurrence of
startQueryand the first occurrence ofendQuery. Can optionally include the anchors and cross line/block boundaries (within the same page).
- FindBetween(IEnumerable<PageElement>, string, string, BetweenOptions)
Extracts text located between the first occurrences of
startQueryandendQuerywithin each page, across multiplepages. This overload does not span across page boundaries.
- FindFuzzy(PageElement, string, FuzzySearchOptions)
Performs token-aware fuzzy search using Damerau–Levenshtein distance over sliding windows of the page text. Useful when the source contains OCR noise or minor typos. Normalization (whitespace/diacritics/optional char-stripping) is applied to both the page text and the query.
- FindFuzzy(IEnumerable<PageElement>, string, FuzzySearchOptions)
Performs fuzzy search across multiple
pages.
- FindInRegion(PageElement, Rectangle, RegionSearchOptions)
Returns text matches within a geometric
region. You can choose intersection or containment semantics and whether to merge adjacent elements.
- FindInRegion(IEnumerable<PageElement>, Rectangle, RegionSearchOptions)
Returns text matches found within the same
region(in each page's coordinate space) across multiplepages. The sameregionrectangle is applied to each page independently (page-local coordinates).
- FindNear(PageElement, string, ProximityOptions)
Finds instances of
querylocated within a proximity of the specified anchor region.
- FindNear(IEnumerable<PageElement>, string, ProximityOptions)
Finds instances of
querylocated within a proximity of the specified anchor region across multiplepages. The same anchor region and radius are applied to each page independently (page-local coordinates).
- FindRegex(PageElement, string, RegexSearchOptions)
Finds regular expression matches within a page's text and returns layout-aware results. The regex runs over the normalized page text (options are applied to the text, not the pattern).
- FindRegex(IEnumerable<PageElement>, string, RegexSearchOptions)
Finds regular expression matches across multiple
pages.
- FindText(PageElement, string, TextSearchOptions)
Finds exact (substring) matches of
querywithin a page's text, honoringtextOptions. Results include the matched text, a context snippet, the union bounding box, and contributing elements. Normalization (whitespace/diacritics/optional char-stripping) is applied to both the page text and the query.
- FindText(IEnumerable<PageElement>, string, TextSearchOptions)
Finds exact matches across multiple
pagesand annotates each result with its page index.