Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document-intelligence/pdf-toolkit/tiff_to_pdfa_archive

Multipage TIFF to PDF/A-1B Archive for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that converts scanned multipage TIFFs into searchable PDF/A archive files. The original raster is preserved as the visible page; an invisible OCR text layer is overlaid so the output is selectable, copyable, and full-text searchable while remaining a faithful visual archive that conforms to ISO 19005-1 (PDF/A-1B) by default.

All OCR runs on-device.


👥 Industry Target Audience

  • Records management: long-term archival of scanned correspondence, contracts, and case files.
  • Government / public sector: FOIA, tax records, legal filings that mandate PDF/A.
  • Legal / eDiscovery: archive + searchability without re-OCRing later.
  • Banking / finance: KYC, signed-disclosure storage with reproducible visual rendering.
  • Healthcare: archive paper-era patient records as searchable, audit-ready PDFs.
  • Fax-inbound pipelines: TIFF is still the dominant fax format; PDF/A is the canonical archive.

🚀 Problem Solved

Multipage TIFFs are the universal raster archive format coming out of scanners and fax servers, but they have two problems: they are not searchable, and they are not a recognized long-term archival format. The ImageToSearchablePdf API solves both in one pass — OCR every page on-device, overlay an invisible text layer, and write a PDF/A-1B (or 2B / 3B) compliant file with all the visual fidelity guarantees that ISO 19005 requires.


💻 Application Overview

Interactive menu (no command-line arguments) with two modes:

Mode What it does
File Convert one multipage TIFF. Prompts for archive level (1B / 2B / 3B), OCR languages, page range, and parallelism.
Folder Convert every .tif / .tiff in a folder (optional recurse). Writes an archive_manifest.csv with source, output, pages, bytes, elapsed_ms, status per file.
Quit Exit.

The OCR engine loads once at startup. Per-page progress is reported via IProgress<OcrProgressEventArgs>. Ctrl-C cancels cleanly.

✨ Key Features

  • ImageToSearchablePdf.ConvertAsync(tiffPath, ocrEngine, outputPdf, options, ct): one call, end-to-end pipeline.
  • LMKitOcr: on-device, Tesseract-based, no LLM required for default English.
  • PdfGenerationOptions:
    • Version: PdfA1b (default), PdfA2b, PdfA3b.
    • Languages: pick from the Language enum (English, French, German, Japanese, Chinese, Arabic, ...).
    • PageRange: "1-3,5" style, 1-based.
    • MaxDegreeOfParallelism: parallel OCR across pages.
    • EnableOrientationDetection: auto-rotate sideways pages.
    • Creator / Producer: written to PDF Info dictionary and PDF/A XMP metadata.
  • Progress<OcrProgressEventArgs>: per-page progress callback safe for concurrent scenarios.
  • CSV manifest in folder mode: portable audit trail (bytes, page count, elapsed ms, status per source).

🧠 Model

  • LMKitOcr (built-in, Tesseract-based). No LLM download required for the default English path. Optionally, set LMKitOcr.VisionModel to enable auto language detection on a vision-capable LM.

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document-intelligence/pdf-toolkit/tiff_to_pdfa_archive
dotnet run

Pick a mode from the menu and follow the prompts.

Share