👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_processing/search_and_highlight
Search and Highlight for C# .NET Applications (PDFs and Images)
🎯 Purpose of the Demo
Search and Highlight demonstrates how to use LM-Kit.NET to find text inside PDF documents or images and produce a highlighted copy with every match visually marked.
The sample shows how to:
- Open a PDF or image file as an
Attachment. - Detect whether the document has extractable text (digital PDFs) or requires OCR (scanned documents or images).
- Choose between two OCR engines: Tesseract (traditional, no model) or PaddleOCR-VL (vision-language model with coordinate output).
- Run a search query in three modes: exact text, regex, or fuzzy (Damerau-Levenshtein distance).
- Call
SearchHighlightEngine.Highlightto find matches and produce an annotated output file. - Save the result as a highlighted PDF (with annotations) or PNG (with overlays) and auto-open it.
Why Search and Highlight?
- Instant visual feedback: see every match highlighted directly in the output document.
- Flexible search: choose between exact substring, regular expressions, or approximate matching depending on your use case.
- OCR transparency: scanned PDFs and images are automatically OCR-processed before search, with no manual preprocessing.
- No cloud dependency: all processing runs locally on your machine.
👥 Target Audience
- Legal and Compliance: search contracts and filings for specific clauses, terms, or regulatory keywords.
- Finance and Accounting: locate totals, tax references, or account numbers across invoices and statements.
- Quality Assurance: verify that specific labels, warnings, or identifiers appear in printed documentation.
- Research and Archival: find references across large collections of scanned documents.
🚀 Problem Solved
- Finding text in documents without reading every page: enter a query and get highlighted matches with page numbers instantly.
- Searching scanned or image-based documents: OCR is triggered automatically when no extractable text layer is found.
- Approximate matching: fuzzy search catches typos, OCR artifacts, and minor variations that exact search would miss.
- Producing annotated output: the highlighted copy can be shared, archived, or used as evidence in review workflows.
💻 Sample Application Description
Console app that:
Prompts for a file path (PDF or image).
Checks whether the file has extractable text via
Attachment.HasText.If OCR is needed, lets you choose between Tesseract (option 0) or PaddleOCR-VL (option 1).
Prompts for a search query and a search mode (Text, Regex, or Fuzzy).
Calls
SearchHighlightEngine.Highlight(...)with either:- The file path directly (for digital PDFs with native text).
- An Attachment plus PageElement array (for images or scanned PDFs after OCR).
Displays results: query, mode, pages scanned, total matches, and up to 10 match snippets with page numbers.
Saves the highlighted output next to the original file (e.g.
invoice_highlighted.pdf).Auto-opens the output file for immediate inspection.
Loops until you type
qto quit.
✨ Key Features
📄 Dual input support: digital PDFs (native text) and images/scanned PDFs (via OCR).
🔍 Three search modes:
- Text: exact case-insensitive substring match.
- Regex: full regular expression pattern matching.
- Fuzzy: approximate matching using Damerau-Levenshtein distance.
🤖 Automatic OCR detection: OCR triggers only when the document has no text layer.
🔧 Two OCR engines:
- Tesseract: traditional OCR, no model download required.
- PaddleOCR-VL: vision-language model with coordinate-aware extraction (~1 GB).
🖍️ Highlighted output: PDF annotations or PNG overlays marking every match.
📊 Result summary: match count, page numbers, text snippets, and elapsed time.
❌ Friendly errors: clear messages for invalid paths, OCR failures, and search errors.
🛠️ Commands and Flow
Inside the console loop:
File input
- The app prompts:
enter PDF or image file path (or 'q' to quit): - Type a file path and press Enter.
- The app prompts:
OCR (if needed)
- If
Attachment.HasTextisfalse, the app shows an OCR engine selection menu. - OCR runs page-by-page and collects
PageElementarrays for the search engine.
- If
Search
- Enter a search query.
- Choose a mode:
0(Text),1(Regex), or2(Fuzzy). - The engine scans all pages and returns matches.
Output
- Results are printed to the console.
- If matches were found, the highlighted file is saved and opened automatically.
Quit
- At any prompt, typing
qexits the app cleanly.
- At any prompt, typing
🗣️ Example Use Cases
Try the sample with:
- A digital PDF invoice to find all occurrences of "total", "VAT", or an account number.
- A scanned contract to search for party names or specific clauses with fuzzy matching.
- A photograph of a receipt to locate line items using regex patterns like
\d+\.\d{2}. - A multi-page report to search for a keyword and see which pages contain it.
After each run, inspect:
- The highlighted PDF or PNG to visually confirm match positions.
- The console output for match counts and page distribution.
- The elapsed time to gauge performance on your hardware.
💻 Minimal Integration Snippet
using LMKit.Data;
using LMKit.Document.Search;
// Search a digital PDF (native text layer)
var options = new SearchHighlightOptions
{
SearchMode = SearchMode.Text
};
SearchHighlightResult result = SearchHighlightEngine.Highlight(
"invoice.pdf", "total", options);
Console.WriteLine($"Matches: {result.TotalMatches}");
// Save the highlighted output
string ext = result.OutputMimeType == "application/pdf" ? ".pdf" : ".png";
File.WriteAllBytes($"invoice_highlighted{ext}", result.OutputData);
For scanned documents or images that require OCR:
using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
using LMKit.Integrations.Tesseract;
var attachment = new Attachment("scanned_invoice.png");
// Run OCR to extract text with coordinates
using var ocr = new TesseractOcr();
var pageElements = new PageElement[attachment.PageCount];
for (int i = 0; i < attachment.PageCount; i++)
{
var ocrResult = await ocr.RunAsync(attachment, i);
pageElements[i] = ocrResult.PageElement;
}
// Search and highlight using OCR results
var options = new SearchHighlightOptions
{
SearchMode = SearchMode.Fuzzy
};
SearchHighlightResult result = SearchHighlightEngine.Highlight(
attachment, "total", options, pageElements);
File.WriteAllBytes("scanned_invoice_highlighted.png", result.OutputData);
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
- For PaddleOCR-VL: ~1 GB VRAM (model downloaded automatically)
- For Tesseract: no additional requirements (data files downloaded automatically)
📥 Download
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document_processing/search_and_highlight
Project Link: search_and_highlight (same path as above)
▶️ Run
dotnet build
dotnet run
Then:
- Type the path to a PDF or image file (or
qto quit). - If OCR is needed, select an OCR engine: 0 (Tesseract) or 1 (PaddleOCR-VL).
- Enter a search query.
- Choose a search mode: 0 (Text), 1 (Regex), or 2 (Fuzzy).
- Inspect the highlighted output file that opens automatically.
- Press Enter to process another file, or
qto exit.
🔧 Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
| "No extractable text found" on a digital PDF | The PDF may contain image-only pages | Choose an OCR engine when prompted |
| Fuzzy search returns too many matches | Default distance threshold is broad | Combine fuzzy search with a more specific query |
| PaddleOCR-VL download is slow | First run downloads ~1 GB model | Wait for completion; subsequent runs use the cached model |
| "Error: Unable to open" message | Invalid file path or unsupported format | Check the path and ensure the file is a PDF, PNG, JPG, BMP, TIFF, or WebP |
🔧 Extend the Demo
- Restrict the search to specific pages using
SearchHighlightOptions.PageRange. - Customize highlight appearance (color, opacity) via
SearchHighlightOptions. - Combine with LM-Kit's Structured Extraction to extract field values from matched regions.
- Build a batch pipeline that searches across a folder of documents.
- Export match metadata (page numbers, coordinates, text) to JSON or CSV for downstream processing.
🔍 Notes on Key Types
SearchHighlightEngine(LMKit.Document.Search): static API withHighlight(...)overloads.- File path overload: extracts text from the native PDF text layer.
- Attachment + PageElement overload: uses pre-computed OCR results.
SearchHighlightOptions: configures the search behavior.SearchMode:Text,Regex, orFuzzy.MaxResults: maximum matches to return (default: no limit).
SearchHighlightResult: the output of a search-and-highlight operation.OutputData: byte array of the highlighted PDF or PNG.OutputMimeType:"application/pdf"or"image/png".Matches: list ofSearchMatchinstances with page index, text, snippet, and bounds.TotalMatches,ScannedPages,PageCount: summary statistics.
TesseractOcr(LMKit.Integrations.Tesseract): traditional OCR engine.- No model download required; data files are fetched automatically.
RunAsync(Attachment, pageIndex)returns anOcrResultwith aPageElement.
VlmOcr(LMKit.Extraction.Ocr): vision-language model OCR engine.- Requires a loaded
LMmodel (e.g.paddleocr-vl:0.9b). - Use
VlmOcrIntent.OcrWithCoordinatesfor coordinate-aware extraction.
- Requires a loaded
📚 Additional Resources
- SearchHighlightEngine API Reference
- SearchHighlightOptions API Reference
- TesseractOcr API Reference
- VlmOcr API Reference
- How-To: Search and Locate Content Within Documents
📚 Related Content
- How-To: Search and Locate Content Within Documents: Step-by-step guide covering document search, including
SearchHighlightEngine. - How-To: Process Scanned Documents with OCR and Vision Models: Covers OCR preprocessing before search and extraction.
- VLM OCR with Coordinates Demo: Coordinate-aware OCR with bounding box visualization.
- Glossary: Optical Character Recognition: Covers OCR concepts including traditional and VLM-based approaches.