👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/pdf-toolkit/tiff_to_pdfa_archive

Multipage TIFF to PDF/A-1B Archive for C# .NET Applications

🎯 Purpose of the Demo

An interactive console app that converts scanned multipage TIFFs into searchable PDF/A archive files. The original raster is preserved as the visible page; an invisible OCR text layer is overlaid so the output is selectable, copyable, and full-text searchable while remaining a faithful visual archive that conforms to ISO 19005-1 (PDF/A-1B) by default.

All OCR runs on-device.

👥 Industry Target Audience

Records management: long-term archival of scanned correspondence, contracts, and case files.
Government / public sector: FOIA, tax records, legal filings that mandate PDF/A.
Legal / eDiscovery: archive + searchability without re-OCRing later.
Banking / finance: KYC, signed-disclosure storage with reproducible visual rendering.
Healthcare: archive paper-era patient records as searchable, audit-ready PDFs.
Fax-inbound pipelines: TIFF is still the dominant fax format; PDF/A is the canonical archive.

🚀 Problem Solved

Multipage TIFFs are the universal raster archive format coming out of scanners and fax servers, but they have two problems: they are not searchable, and they are not a recognized long-term archival format. The ImageToSearchablePdf API solves both in one pass — OCR every page on-device, overlay an invisible text layer, and write a PDF/A-1B (or 2B / 3B) compliant file with all the visual fidelity guarantees that ISO 19005 requires.

💻 Application Overview

Interactive menu (no command-line arguments) with two modes:

Mode	What it does
File	Convert one multipage TIFF. Prompts for archive level (1B / 2B / 3B), OCR languages, page range, and parallelism.
Folder	Convert every `.tif` / `.tiff` in a folder (optional recurse). Writes an `archive_manifest.csv` with `source, output, pages, bytes, elapsed_ms, status` per file.
Quit	Exit.

The OCR engine loads once at startup. Per-page progress is reported via IProgress<OcrProgressEventArgs>. Ctrl-C cancels cleanly.

✨ Key Features

ImageToSearchablePdf.ConvertAsync(tiffPath, ocrEngine, outputPdf, options, ct): one call, end-to-end pipeline.
LMKitOcr: on-device, Tesseract-based, no LLM required for default English.
PdfGenerationOptions:
- Version: PdfA1b (default), PdfA2b, PdfA3b.
- Languages: pick from the Language enum (English, French, German, Japanese, Chinese, Arabic, ...).
- PageRange: "1-3,5" style, 1-based.
- MaxDegreeOfParallelism: parallel OCR across pages.
- EnableOrientationDetection: auto-rotate sideways pages.
- Creator / Producer: written to PDF Info dictionary and PDF/A XMP metadata.
Progress<OcrProgressEventArgs>: per-page progress callback safe for concurrent scenarios.
CSV manifest in folder mode: portable audit trail (bytes, page count, elapsed ms, status per source).

🧠 Model

LMKitOcr (built-in, Tesseract-based). No LLM download required for the default English path. Optionally, set LMKitOcr.VisionModel to enable auto language detection on a vision-capable LM.

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/pdf-toolkit/tiff_to_pdfa_archive
dotnet run

Pick a mode from the menu and follow the prompts.

Searchable PDF from Scans Demo: same idea, but starting from a PDF instead of a TIFF.
PDF to Multi-page TIFF Archive Demo: the inverse operation (PDF → multipage TIFF).
OCR Demo: VLM-OCR for layout-rich pages.
PDF Text Search with Highlights Demo: full-text search across the produced PDF/A files.

Table of Contents