👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/pdf-toolkit/tiff_to_pdfa_archive
Multipage TIFF to PDF/A-1B Archive for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that converts scanned multipage TIFFs into searchable PDF/A archive files. The original raster is preserved as the visible page; an invisible OCR text layer is overlaid so the output is selectable, copyable, and full-text searchable while remaining a faithful visual archive that conforms to ISO 19005-1 (PDF/A-1B) by default.
All OCR runs on-device.
👥 Industry Target Audience
- Records management: long-term archival of scanned correspondence, contracts, and case files.
- Government / public sector: FOIA, tax records, legal filings that mandate PDF/A.
- Legal / eDiscovery: archive + searchability without re-OCRing later.
- Banking / finance: KYC, signed-disclosure storage with reproducible visual rendering.
- Healthcare: archive paper-era patient records as searchable, audit-ready PDFs.
- Fax-inbound pipelines: TIFF is still the dominant fax format; PDF/A is the canonical archive.
🚀 Problem Solved
Multipage TIFFs are the universal raster archive format coming out of scanners and fax servers, but they have two problems: they are not searchable, and they are not a recognized long-term archival format. The ImageToSearchablePdf API solves both in one pass — OCR every page on-device, overlay an invisible text layer, and write a PDF/A-1B (or 2B / 3B) compliant file with all the visual fidelity guarantees that ISO 19005 requires.
💻 Application Overview
Interactive menu (no command-line arguments) with two modes:
| Mode | What it does |
|---|---|
| File | Convert one multipage TIFF. Prompts for archive level (1B / 2B / 3B), OCR languages, page range, and parallelism. |
| Folder | Convert every .tif / .tiff in a folder (optional recurse). Writes an archive_manifest.csv with source, output, pages, bytes, elapsed_ms, status per file. |
| Quit | Exit. |
The OCR engine loads once at startup. Per-page progress is reported via IProgress<OcrProgressEventArgs>. Ctrl-C cancels cleanly.
✨ Key Features
ImageToSearchablePdf.ConvertAsync(tiffPath, ocrEngine, outputPdf, options, ct): one call, end-to-end pipeline.LMKitOcr: on-device, Tesseract-based, no LLM required for default English.PdfGenerationOptions:Version:PdfA1b(default),PdfA2b,PdfA3b.Languages: pick from theLanguageenum (English, French, German, Japanese, Chinese, Arabic, ...).PageRange:"1-3,5"style, 1-based.MaxDegreeOfParallelism: parallel OCR across pages.EnableOrientationDetection: auto-rotate sideways pages.Creator/Producer: written to PDF Info dictionary and PDF/A XMP metadata.
Progress<OcrProgressEventArgs>: per-page progress callback safe for concurrent scenarios.- CSV manifest in folder mode: portable audit trail (bytes, page count, elapsed ms, status per source).
🧠 Model
LMKitOcr(built-in, Tesseract-based). No LLM download required for the default English path. Optionally, setLMKitOcr.VisionModelto enable auto language detection on a vision-capableLM.
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/pdf-toolkit/tiff_to_pdfa_archive
dotnet run
Pick a mode from the menu and follow the prompts.
📚 Related Content
- Searchable PDF from Scans Demo: same idea, but starting from a PDF instead of a TIFF.
- PDF to Multi-page TIFF Archive Demo: the inverse operation (PDF → multipage TIFF).
- OCR Demo: VLM-OCR for layout-rich pages.
- PDF Text Search with Highlights Demo: full-text search across the produced PDF/A files.