👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/pdf-toolkit/searchable_pdf_from_scans

Searchable PDF from Scans for C# .NET Applications

🎯 Purpose of the Demo

An interactive console app that turns image-only scanned PDFs into searchable PDFs: identical visual output, plus an invisible text layer that makes the file selectable, copyable, and indexable. Two modes: single file or whole folder. The OCR model is picked at startup so the cost is paid once.

All OCR runs on-device.

👥 Industry Target Audience

Document-management systems: ingest scanned archives without losing search.
Compliance: legal hold, GDPR access, FOIA - all require searchable PDFs.
eDiscovery: text-search across millions of pages of evidence.
Healthcare: patient records imported from scanners need search.
Government / archives: turn microfilm-era scans into indexed assets.

🚀 Problem Solved

PDF -> PDF/OCR is one of the most-asked PDF conversions. Cloud OCR services are convenient but force confidential documents through a third-party. Self-hosted Tesseract pipelines plateau on layout-heavy or handwritten text. VLM-OCR models (paddleocr-vl, glm-ocr, lightonocr-2) match or exceed cloud accuracy on real-world scans and run entirely on-device. The demo collapses the pipeline into one SDK call wrapped in a menu.

💻 Application Overview

Interactive menu with two modes:

Mode	What it does
File	Convert one scanned PDF. Prompts for input and output.
Folder	Convert every PDF in a folder. Output goes to a chosen directory, one `.searchable.pdf` per source.
Quit	Exit.

The OCR model is selected at startup. Pages that already carry a text layer are skipped (TextPageHandling.Skip + TextDetectionStrategy.HasText), which makes the demo safe to run over mixed archives. Ctrl-C cancels cleanly.

✨ Key Features

PdfSearchableMaker.ConvertToFileAsync(...) for the conversion.
PdfSearchableMakerOptions: TextPageHandling, TextDetectionStrategy, MaxDegreeOfParallelism, Progress.
VlmOcr(model, VlmOcrIntent.Markdown) for layout-preserving OCR.
Model picker at startup: paddleocr-vl-1.6:0.9b (default), glm-ocr, lightonocr-2:1b.
Per-page progress via Progress<OcrProgressEventArgs>.

🧠 Models

paddleocr-vl-1.6:0.9b (default, fast, low-VRAM)
glm-ocr (higher accuracy)
lightonocr-2:1b (balanced)

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later
A GPU is recommended for VLM-OCR (CPU works but is slower).

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/pdf-toolkit/searchable_pdf_from_scans
dotnet run

Pick the OCR model, then a mode from the menu, then follow the prompts.

Table of Contents