Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/pdf-toolkit/searchable_pdf_from_scans

Searchable PDF from Scans for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that turns image-only scanned PDFs into searchable PDFs: identical visual output, plus an invisible text layer that makes the file selectable, copyable, and indexable. Two modes: single file or whole folder. The OCR model is picked at startup so the cost is paid once.

All OCR runs on-device.


👥 Industry Target Audience

  • Document-management systems: ingest scanned archives without losing search.
  • Compliance: legal hold, GDPR access, FOIA - all require searchable PDFs.
  • eDiscovery: text-search across millions of pages of evidence.
  • Healthcare: patient records imported from scanners need search.
  • Government / archives: turn microfilm-era scans into indexed assets.

🚀 Problem Solved

PDF -> PDF/OCR is one of the most-asked PDF conversions. Cloud OCR services are convenient but force confidential documents through a third-party. Self-hosted Tesseract pipelines plateau on layout-heavy or handwritten text. VLM-OCR models (paddleocr-vl, glm-ocr, lightonocr-2) match or exceed cloud accuracy on real-world scans and run entirely on-device. The demo collapses the pipeline into one SDK call wrapped in a menu.


💻 Application Overview

Interactive menu with two modes:

Mode What it does
File Convert one scanned PDF. Prompts for input and output.
Folder Convert every PDF in a folder. Output goes to a chosen directory, one .searchable.pdf per source.
Quit Exit.

The OCR model is selected at startup. Pages that already carry a text layer are skipped (TextPageHandling.Skip + TextDetectionStrategy.HasText), which makes the demo safe to run over mixed archives. Ctrl-C cancels cleanly.

✨ Key Features

  • PdfSearchableMaker.ConvertToFileAsync(...) for the conversion.
  • PdfSearchableMakerOptions: TextPageHandling, TextDetectionStrategy, MaxDegreeOfParallelism, Progress.
  • VlmOcr(model, VlmOcrIntent.Markdown) for layout-preserving OCR.
  • Model picker at startup: paddleocr-vl-1.6:0.9b (default), glm-ocr, lightonocr-2:1b.
  • Per-page progress via Progress<OcrProgressEventArgs>.

🧠 Models

  • paddleocr-vl-1.6:0.9b (default, fast, low-VRAM)
  • glm-ocr (higher accuracy)
  • lightonocr-2:1b (balanced)

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later
  • A GPU is recommended for VLM-OCR (CPU works but is slower).

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/pdf-toolkit/searchable_pdf_from_scans
dotnet run

Pick the OCR model, then a mode from the menu, then follow the prompts.

Share