👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/document-search/search_and_highlight

Search and Highlight for C# .NET Applications (PDFs and Images)

🎯 Purpose of the Demo

Search and Highlight demonstrates how to use LM-Kit.NET to find text inside PDF documents or images and produce a highlighted copy with every match visually marked.

The sample shows how to:

Open a PDF or image file as an Attachment.
Detect whether the document has extractable text (digital PDFs) or requires OCR (scanned documents or images).
Choose between two OCR engines: LM-Kit OCR (traditional, no model) or PaddleOCR-VL (vision-language model with coordinate output).
Run a search query in three modes: exact text, regex, or fuzzy (Damerau-Levenshtein distance).
Call SearchHighlightEngine.Highlight to find matches and produce an annotated output file.
Save the result as a highlighted PDF (with annotations) or PNG (with overlays) and auto-open it.

Why Search and Highlight?

Instant visual feedback: see every match highlighted directly in the output document.
Flexible search: choose between exact substring, regular expressions, or approximate matching depending on your use case.
OCR transparency: scanned PDFs and images are automatically OCR-processed before search, with no manual preprocessing.
No cloud dependency: all processing runs locally on your machine.

👥 Target Audience

Legal and Compliance: search contracts and filings for specific clauses, terms, or regulatory keywords.
Finance and Accounting: locate totals, tax references, or account numbers across invoices and statements.
Quality Assurance: verify that specific labels, warnings, or identifiers appear in printed documentation.
Research and Archival: find references across large collections of scanned documents.

🚀 Problem Solved

Finding text in documents without reading every page: enter a query and get highlighted matches with page numbers instantly.
Searching scanned or image-based documents: OCR is triggered automatically when no extractable text layer is found.
Approximate matching: fuzzy search catches typos, OCR artifacts, and minor variations that exact search would miss.
Producing annotated output: the highlighted copy can be shared, archived, or used as evidence in review workflows.

💻 Sample Application Description

Console app that:

Prompts for a file path (PDF or image).
Checks whether the file has extractable text via Attachment.HasText.
If OCR is needed, lets you choose between LM-Kit OCR (option 0) or PaddleOCR-VL (option 1).
Prompts for a search query and a search mode (Text, Regex, or Fuzzy).
Calls SearchHighlightEngine.Highlight(...) with either:
- The file path directly (for digital PDFs with native text).
- An Attachment plus PageElement array (for images or scanned PDFs after OCR).
Displays results: query, mode, pages scanned, total matches, and up to 10 match snippets with page numbers.
Saves the highlighted output next to the original file (e.g. invoice_highlighted.pdf).
Auto-opens the output file for immediate inspection.
Loops until you type q to quit.

✨ Key Features

📄 Dual input support: digital PDFs (native text) and images/scanned PDFs (via OCR).
🔍 Three search modes:
- Text: exact case-insensitive substring match.
- Regex: full regular expression pattern matching.
- Fuzzy: approximate matching using Damerau-Levenshtein distance.
🤖 Automatic OCR detection: OCR triggers only when the document has no text layer.
🔧 Two OCR engines:
- LM-Kit OCR: traditional OCR, no model download required.
- PaddleOCR-VL: vision-language model with coordinate-aware extraction (~1 GB).
🖍️ Highlighted output: PDF annotations or PNG overlays marking every match.
📊 Result summary: match count, page numbers, text snippets, and elapsed time.
❌ Friendly errors: clear messages for invalid paths, OCR failures, and search errors.

🛠️ Commands and Flow

Inside the console loop:

File input
- The app prompts: enter PDF or image file path (or 'q' to quit):
- Type a file path and press Enter.
OCR (if needed)
- If Attachment.HasText is false, the app shows an OCR engine selection menu.
- OCR runs page-by-page and collects PageElement arrays for the search engine.
Search
- Enter a search query.
- Choose a mode: 0 (Text), 1 (Regex), or 2 (Fuzzy).
- The engine scans all pages and returns matches.
Output
- Results are printed to the console.
- If matches were found, the highlighted file is saved and opened automatically.
Quit
- At any prompt, typing q exits the app cleanly.

🗣️ Example Use Cases

Try the sample with:

A digital PDF invoice to find all occurrences of "total", "VAT", or an account number.
A scanned contract to search for party names or specific clauses with fuzzy matching.
A photograph of a receipt to locate line items using regex patterns like \d+\.\d{2}.
A multi-page report to search for a keyword and see which pages contain it.

After each run, inspect:

The highlighted PDF or PNG to visually confirm match positions.
The console output for match counts and page distribution.
The elapsed time to gauge performance on your hardware.

💻 Minimal Integration Snippet

using LMKit.Data;
using LMKit.Document.Search;

// Search a digital PDF (native text layer)
var options = new SearchHighlightOptions
{
    SearchMode = SearchMode.Text
};

SearchHighlightResult result = SearchHighlightEngine.Highlight(
    "invoice.pdf", "total", options);

Console.WriteLine($"Matches: {result.TotalMatches}");

// Save the highlighted output
string ext = result.OutputMimeType == "application/pdf" ? ".pdf" : ".png";
File.WriteAllBytes($"invoice_highlighted{ext}", result.OutputData);

For scanned documents or images that require OCR:

using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
using LMKit.Extraction.Ocr;

var attachment = new Attachment("scanned_invoice.png");

// Run OCR to extract text with coordinates
using var ocr = new LMKitOcr();
var pageElements = new PageElement[attachment.PageCount];

for (int i = 0; i < attachment.PageCount; i++)
{
    var ocrResult = await ocr.RunAsync(attachment, i);
    pageElements[i] = ocrResult.PageElement;
}

// Search and highlight using OCR results
var options = new SearchHighlightOptions
{
    SearchMode = SearchMode.Fuzzy
};

SearchHighlightResult result = SearchHighlightEngine.Highlight(
    attachment, "total", options, pageElements);

File.WriteAllBytes("scanned_invoice_highlighted.png", result.OutputData);

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later
For PaddleOCR-VL: ~1 GB VRAM (model downloaded automatically)
For LM-Kit OCR: no additional requirements (data files downloaded automatically)

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/document-search/search_and_highlight

Project Link: search_and_highlight (same path as above)

▶️ Run

dotnet build
dotnet run

Then:

Type the path to a PDF or image file (or q to quit).
If OCR is needed, select an OCR engine: 0 (LM-Kit OCR) or 1 (PaddleOCR-VL).
Enter a search query.
Choose a search mode: 0 (Text), 1 (Regex), or 2 (Fuzzy).
Inspect the highlighted output file that opens automatically.
Press Enter to process another file, or q to exit.

🔧 Troubleshooting

Symptom	Cause	Fix
"No extractable text found" on a digital PDF	The PDF may contain image-only pages	Choose an OCR engine when prompted
Fuzzy search returns too many matches	Default distance threshold is broad	Combine fuzzy search with a more specific query
PaddleOCR-VL download is slow	First run downloads ~1 GB model	Wait for completion; subsequent runs use the cached model
"Error: Unable to open" message	Invalid file path or unsupported format	Check the path and ensure the file is a PDF, PNG, JPG, BMP, TIFF, or WebP

🔧 Extend the Demo

Restrict the search to specific pages using SearchHighlightOptions.PageRange.
Customize highlight appearance (color, opacity) via SearchHighlightOptions.
Combine with LM-Kit's Structured Extraction to extract field values from matched regions.
Build a batch pipeline that searches across a folder of documents.
Export match metadata (page numbers, coordinates, text) to JSON or CSV for downstream processing.

🔍 Notes on Key Types

SearchHighlightEngine (LMKit.Document.Search): static API with Highlight(...) overloads.
- File path overload: extracts text from the native PDF text layer.
- Attachment + PageElement overload: uses pre-computed OCR results.
SearchHighlightOptions: configures the search behavior.
- SearchMode: Text, Regex, or Fuzzy.
- MaxResults: maximum matches to return (default: no limit).
SearchHighlightResult: the output of a search-and-highlight operation.
- OutputData: byte array of the highlighted PDF or PNG.
- OutputMimeType: "application/pdf" or "image/png".
- Matches: list of SearchMatch instances with page index, text, snippet, and bounds.
- TotalMatches, ScannedPages, PageCount: summary statistics.
LMKitOcr (LMKit.Extraction.Ocr): traditional OCR engine.
- No model download required; data files are fetched automatically.
- RunAsync(Attachment, pageIndex) returns an OcrResult with a PageElement.
VlmOcr (LMKit.Extraction.Ocr): vision-language model OCR engine.
- Requires a loaded LM model (e.g. paddleocr-vl-1.6:0.9b).
- Use VlmOcrIntent.OcrWithCoordinates for coordinate-aware extraction.

📚 Additional Resources

How-To: Search and Locate Content Within Documents: Step-by-step guide covering document search, including SearchHighlightEngine.
How-To: Process Scanned Documents with OCR and Vision Models: Covers OCR preprocessing before search and extraction.
VLM OCR with Coordinates Demo: Coordinate-aware OCR with bounding box visualization.
Glossary: Optical Character Recognition: Covers OCR concepts including traditional and VLM-based approaches.

Table of Contents