Table of Contents

👉 Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/batch_pii_extraction

Batch PII Extraction with AI in .NET Applications


🎯 Purpose of the Sample

Batch PII Extraction demonstrates how to detect and extract Personally Identifiable Information (PII) from large volumes of documents using LM-Kit.NET. The sample processes files in parallel, applies OCR when needed, and outputs structured JSON results for each document.

It is designed for high-throughput compliance, privacy, and data governance workflows.


👥 Target Audience

  • Compliance & Legal – GDPR, HIPAA, and privacy audits
  • Enterprise & B2B Apps – PII detection pipelines
  • Security & Risk – sensitive data discovery
  • Back-office & Ops – bulk document processing
  • Benchmarking & Demos – throughput and scalability evaluation

🚀 Problem Solved

  • Manual PII review across thousands of documents
  • Mixed input formats including scanned documents
  • Scalable extraction with controlled parallelism
  • Structured outputs ready for redaction or storage

💻 Sample Application Description

Console application that:

  • Loads a predefined LM-Kit PII extraction model.
  • Recursively scans an input directory.
  • Runs parallel PII extraction with adaptive thread count.
  • Applies OCR automatically for non-text documents.
  • Extracts entities such as names, addresses, IDs, and more.
  • Writes one JSON output file per input document.
  • Displays live progress, throughput, and performance metrics.

📂 Supported Inputs

  • PDFs and scanned documents
  • Images via OCR (Tesseract by default)
  • Text-based files supported by LM-Kit attachments

🔐 Extracted Information

Depending on the model and configuration, extracted entities may include:

  • Person names
  • Addresses
  • Phone numbers
  • Email addresses
  • National identifiers
  • Payment and banking information
  • Other sensitive data types (optional)

⚙️ Key Features

  • ⚡ High-Throughput Batch Processing – multi-threaded execution
  • 🧠 OCR-Aware Extraction – automatic OCR integration
  • 📄 Page-Level Metrics – documents and pages per second
  • 📊 Live Console Dashboard – progress table with throughput
  • 🧩 Structured JSON Output – ready for redaction or analysis
  • 🔁 Thread-Safe Statistics – real-time performance snapshots

🛠️ Getting Started

📋 Prerequisites

  • .NET Framework 4.6.2 or .NET 8.0+
  • LM-Kit.NET license key
  • Tesseract OCR dependencies (default OCR engine)

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/batch_pii_extraction

▶️ Run

dotnet build
dotnet run

The sample uses configurable input and output directories and automatically determines the optimal thread count based on hardware and model size.


📁 Output Structure

  • One .json file per input document
  • Preserves the input folder hierarchy
  • JSON includes detected entities and metadata
  • Null values omitted for clean output

📈 Runtime Metrics

Displayed during execution:

  • Documents processed
  • Pages processed
  • Per-document processing time
  • Documents per second
  • Pages per second

Final summary includes total runtime and averages.


🔍 Notes

  • The model is loaded once and shared across threads.
  • Each worker initializes its own extraction engine.
  • OCR can be replaced with a custom implementation.
  • Additional PII entity definitions can be added programmatically.
  • Preferred inference modality can be adjusted for redaction scenarios.

🔧 Extend the Demo

  • Add confidence thresholds for filtering results.
  • Integrate automatic redaction pipelines.
  • Store outputs in databases or object storage.
  • Replace OCR with cloud-based engines.
  • Combine with document classification or routing workflows.