Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_processing/search_and_highlight

Search and Highlight for C# .NET Applications (PDFs and Images)


🎯 Purpose of the Demo

Search and Highlight demonstrates how to use LM-Kit.NET to find text inside PDF documents or images and produce a highlighted copy with every match visually marked.

The sample shows how to:

  • Open a PDF or image file as an Attachment.
  • Detect whether the document has extractable text (digital PDFs) or requires OCR (scanned documents or images).
  • Choose between two OCR engines: Tesseract (traditional, no model) or PaddleOCR-VL (vision-language model with coordinate output).
  • Run a search query in three modes: exact text, regex, or fuzzy (Damerau-Levenshtein distance).
  • Call SearchHighlightEngine.Highlight to find matches and produce an annotated output file.
  • Save the result as a highlighted PDF (with annotations) or PNG (with overlays) and auto-open it.

Why Search and Highlight?

  • Instant visual feedback: see every match highlighted directly in the output document.
  • Flexible search: choose between exact substring, regular expressions, or approximate matching depending on your use case.
  • OCR transparency: scanned PDFs and images are automatically OCR-processed before search, with no manual preprocessing.
  • No cloud dependency: all processing runs locally on your machine.

👥 Target Audience

  • Legal and Compliance: search contracts and filings for specific clauses, terms, or regulatory keywords.
  • Finance and Accounting: locate totals, tax references, or account numbers across invoices and statements.
  • Quality Assurance: verify that specific labels, warnings, or identifiers appear in printed documentation.
  • Research and Archival: find references across large collections of scanned documents.

🚀 Problem Solved

  • Finding text in documents without reading every page: enter a query and get highlighted matches with page numbers instantly.
  • Searching scanned or image-based documents: OCR is triggered automatically when no extractable text layer is found.
  • Approximate matching: fuzzy search catches typos, OCR artifacts, and minor variations that exact search would miss.
  • Producing annotated output: the highlighted copy can be shared, archived, or used as evidence in review workflows.

💻 Sample Application Description

Console app that:

  • Prompts for a file path (PDF or image).

  • Checks whether the file has extractable text via Attachment.HasText.

  • If OCR is needed, lets you choose between Tesseract (option 0) or PaddleOCR-VL (option 1).

  • Prompts for a search query and a search mode (Text, Regex, or Fuzzy).

  • Calls SearchHighlightEngine.Highlight(...) with either:

    • The file path directly (for digital PDFs with native text).
    • An Attachment plus PageElement array (for images or scanned PDFs after OCR).
  • Displays results: query, mode, pages scanned, total matches, and up to 10 match snippets with page numbers.

  • Saves the highlighted output next to the original file (e.g. invoice_highlighted.pdf).

  • Auto-opens the output file for immediate inspection.

  • Loops until you type q to quit.

✨ Key Features

  • 📄 Dual input support: digital PDFs (native text) and images/scanned PDFs (via OCR).

  • 🔍 Three search modes:

    • Text: exact case-insensitive substring match.
    • Regex: full regular expression pattern matching.
    • Fuzzy: approximate matching using Damerau-Levenshtein distance.
  • 🤖 Automatic OCR detection: OCR triggers only when the document has no text layer.

  • 🔧 Two OCR engines:

    • Tesseract: traditional OCR, no model download required.
    • PaddleOCR-VL: vision-language model with coordinate-aware extraction (~1 GB).
  • 🖍️ Highlighted output: PDF annotations or PNG overlays marking every match.

  • 📊 Result summary: match count, page numbers, text snippets, and elapsed time.

  • ❌ Friendly errors: clear messages for invalid paths, OCR failures, and search errors.


🛠️ Commands and Flow

Inside the console loop:

  • File input

    • The app prompts: enter PDF or image file path (or 'q' to quit):
    • Type a file path and press Enter.
  • OCR (if needed)

    • If Attachment.HasText is false, the app shows an OCR engine selection menu.
    • OCR runs page-by-page and collects PageElement arrays for the search engine.
  • Search

    • Enter a search query.
    • Choose a mode: 0 (Text), 1 (Regex), or 2 (Fuzzy).
    • The engine scans all pages and returns matches.
  • Output

    • Results are printed to the console.
    • If matches were found, the highlighted file is saved and opened automatically.
  • Quit

    • At any prompt, typing q exits the app cleanly.

🗣️ Example Use Cases

Try the sample with:

  • A digital PDF invoice to find all occurrences of "total", "VAT", or an account number.
  • A scanned contract to search for party names or specific clauses with fuzzy matching.
  • A photograph of a receipt to locate line items using regex patterns like \d+\.\d{2}.
  • A multi-page report to search for a keyword and see which pages contain it.

After each run, inspect:

  • The highlighted PDF or PNG to visually confirm match positions.
  • The console output for match counts and page distribution.
  • The elapsed time to gauge performance on your hardware.

💻 Minimal Integration Snippet

using LMKit.Data;
using LMKit.Document.Search;

// Search a digital PDF (native text layer)
var options = new SearchHighlightOptions
{
    SearchMode = SearchMode.Text
};

SearchHighlightResult result = SearchHighlightEngine.Highlight(
    "invoice.pdf", "total", options);

Console.WriteLine($"Matches: {result.TotalMatches}");

// Save the highlighted output
string ext = result.OutputMimeType == "application/pdf" ? ".pdf" : ".png";
File.WriteAllBytes($"invoice_highlighted{ext}", result.OutputData);

For scanned documents or images that require OCR:

using LMKit.Data;
using LMKit.Document.Layout;
using LMKit.Document.Search;
using LMKit.Integrations.Tesseract;

var attachment = new Attachment("scanned_invoice.png");

// Run OCR to extract text with coordinates
using var ocr = new TesseractOcr();
var pageElements = new PageElement[attachment.PageCount];

for (int i = 0; i < attachment.PageCount; i++)
{
    var ocrResult = await ocr.RunAsync(attachment, i);
    pageElements[i] = ocrResult.PageElement;
}

// Search and highlight using OCR results
var options = new SearchHighlightOptions
{
    SearchMode = SearchMode.Fuzzy
};

SearchHighlightResult result = SearchHighlightEngine.Highlight(
    attachment, "total", options, pageElements);

File.WriteAllBytes("scanned_invoice_highlighted.png", result.OutputData);

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later
  • For PaddleOCR-VL: ~1 GB VRAM (model downloaded automatically)
  • For Tesseract: no additional requirements (data files downloaded automatically)

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document_processing/search_and_highlight

Project Link: search_and_highlight (same path as above)

▶️ Run

dotnet build
dotnet run

Then:

  1. Type the path to a PDF or image file (or q to quit).
  2. If OCR is needed, select an OCR engine: 0 (Tesseract) or 1 (PaddleOCR-VL).
  3. Enter a search query.
  4. Choose a search mode: 0 (Text), 1 (Regex), or 2 (Fuzzy).
  5. Inspect the highlighted output file that opens automatically.
  6. Press Enter to process another file, or q to exit.

🔧 Troubleshooting

Symptom Cause Fix
"No extractable text found" on a digital PDF The PDF may contain image-only pages Choose an OCR engine when prompted
Fuzzy search returns too many matches Default distance threshold is broad Combine fuzzy search with a more specific query
PaddleOCR-VL download is slow First run downloads ~1 GB model Wait for completion; subsequent runs use the cached model
"Error: Unable to open" message Invalid file path or unsupported format Check the path and ensure the file is a PDF, PNG, JPG, BMP, TIFF, or WebP

🔧 Extend the Demo

  • Restrict the search to specific pages using SearchHighlightOptions.PageRange.
  • Customize highlight appearance (color, opacity) via SearchHighlightOptions.
  • Combine with LM-Kit's Structured Extraction to extract field values from matched regions.
  • Build a batch pipeline that searches across a folder of documents.
  • Export match metadata (page numbers, coordinates, text) to JSON or CSV for downstream processing.

🔍 Notes on Key Types

  • SearchHighlightEngine (LMKit.Document.Search): static API with Highlight(...) overloads.

    • File path overload: extracts text from the native PDF text layer.
    • Attachment + PageElement overload: uses pre-computed OCR results.
  • SearchHighlightOptions: configures the search behavior.

    • SearchMode: Text, Regex, or Fuzzy.
    • MaxResults: maximum matches to return (default: no limit).
  • SearchHighlightResult: the output of a search-and-highlight operation.

    • OutputData: byte array of the highlighted PDF or PNG.
    • OutputMimeType: "application/pdf" or "image/png".
    • Matches: list of SearchMatch instances with page index, text, snippet, and bounds.
    • TotalMatches, ScannedPages, PageCount: summary statistics.
  • TesseractOcr (LMKit.Integrations.Tesseract): traditional OCR engine.

    • No model download required; data files are fetched automatically.
    • RunAsync(Attachment, pageIndex) returns an OcrResult with a PageElement.
  • VlmOcr (LMKit.Extraction.Ocr): vision-language model OCR engine.

    • Requires a loaded LM model (e.g. paddleocr-vl:0.9b).
    • Use VlmOcrIntent.OcrWithCoordinates for coordinate-aware extraction.

📚 Additional Resources

Share