👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/pdf-toolkit/searchable_pdf_from_scans
Searchable PDF from Scans for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that turns image-only scanned PDFs into searchable PDFs: identical visual output, plus an invisible text layer that makes the file selectable, copyable, and indexable. Two modes: single file or whole folder. The OCR model is picked at startup so the cost is paid once.
All OCR runs on-device.
👥 Industry Target Audience
- Document-management systems: ingest scanned archives without losing search.
- Compliance: legal hold, GDPR access, FOIA - all require searchable PDFs.
- eDiscovery: text-search across millions of pages of evidence.
- Healthcare: patient records imported from scanners need search.
- Government / archives: turn microfilm-era scans into indexed assets.
🚀 Problem Solved
PDF -> PDF/OCR is one of the most-asked PDF conversions. Cloud OCR services are convenient but force confidential documents through a third-party. Self-hosted Tesseract pipelines plateau on layout-heavy or handwritten text. VLM-OCR models (paddleocr-vl, glm-ocr, lightonocr-2) match or exceed cloud accuracy on real-world scans and run entirely on-device. The demo collapses the pipeline into one SDK call wrapped in a menu.
💻 Application Overview
Interactive menu with two modes:
| Mode | What it does |
|---|---|
| File | Convert one scanned PDF. Prompts for input and output. |
| Folder | Convert every PDF in a folder. Output goes to a chosen directory, one .searchable.pdf per source. |
| Quit | Exit. |
The OCR model is selected at startup. Pages that already carry a text layer are skipped (TextPageHandling.Skip + TextDetectionStrategy.HasText), which makes the demo safe to run over mixed archives. Ctrl-C cancels cleanly.
✨ Key Features
PdfSearchableMaker.ConvertToFileAsync(...)for the conversion.PdfSearchableMakerOptions:TextPageHandling,TextDetectionStrategy,MaxDegreeOfParallelism,Progress.VlmOcr(model, VlmOcrIntent.Markdown)for layout-preserving OCR.- Model picker at startup: paddleocr-vl-1.6:0.9b (default), glm-ocr, lightonocr-2:1b.
- Per-page progress via
Progress<OcrProgressEventArgs>.
🧠 Models
paddleocr-vl-1.6:0.9b(default, fast, low-VRAM)glm-ocr(higher accuracy)lightonocr-2:1b(balanced)
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
- A GPU is recommended for VLM-OCR (CPU works but is slower).
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/pdf-toolkit/searchable_pdf_from_scans
dotnet run
Pick the OCR model, then a mode from the menu, then follow the prompts.