👉 Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/batch_pii_extraction

Batch PII Extraction with AI in .NET Applications

🎯 Purpose of the Demo

Batch PII Extraction demonstrates how to detect and extract Personally Identifiable Information (PII) from large volumes of documents using LM-Kit.NET. The sample processes files in parallel, applies OCR when needed, and outputs structured JSON results for each document.

It is designed for high-throughput compliance, privacy, and data governance workflows.

👥 Target Audience

Compliance & Legal – GDPR, HIPAA, and privacy audits
Enterprise & B2B Apps – PII detection pipelines
Security & Risk – sensitive data discovery
Back-office & Ops – bulk document processing
Benchmarking & Demos – throughput and scalability evaluation

🚀 Problem Solved

Manual PII review across thousands of documents
Mixed input formats including scanned documents
Scalable extraction with controlled parallelism
Structured outputs ready for redaction or storage

💻 Sample Application Description

Console application that:

Loads a predefined LM-Kit PII extraction model.
Recursively scans an input directory.
Runs parallel PII extraction with adaptive thread count.
Applies OCR automatically for non-text documents.
Extracts entities such as names, addresses, IDs, and more.
Writes one JSON output file per input document.
Displays live progress, throughput, and performance metrics.

📂 Supported Inputs

PDFs and scanned documents
Images via OCR (Tesseract by default)
Text-based files supported by LM-Kit attachments

🔐 Extracted Information

Depending on the model and configuration, extracted entities may include:

Person names
Addresses
Phone numbers
Email addresses
National identifiers
Payment and banking information
Other sensitive data types (optional)

⚙️ Key Features

⚡ High-Throughput Batch Processing – multi-threaded execution
🧠 OCR-Aware Extraction – automatic OCR integration
📄 Page-Level Metrics – documents and pages per second
📊 Live Console Dashboard – progress table with throughput
🧩 Structured JSON Output – ready for redaction or analysis
🔁 Thread-Safe Statistics – real-time performance snapshots

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later
LM-Kit.NET license key
Tesseract OCR dependencies (default OCR engine)

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/batch_pii_extraction

▶️ Run

dotnet build
dotnet run

The sample uses configurable input and output directories and automatically determines the optimal thread count based on hardware and model size.

📁 Output Structure

One .json file per input document
Preserves the input folder hierarchy
JSON includes detected entities and metadata
Null values omitted for clean output

📈 Runtime Metrics

Displayed during execution:

Documents processed
Pages processed
Per-document processing time
Documents per second
Pages per second

Final summary includes total runtime and averages.

🔍 Notes

The model is loaded once and shared across threads.
Each worker initializes its own extraction engine.
OCR can be replaced with a custom implementation.
Additional PII entity definitions can be added programmatically.
Preferred inference modality can be adjusted for redaction scenarios.

🔧 Extend the Demo

Add confidence thresholds for filtering results.
Integrate automatic redaction pipelines.
Store outputs in databases or object storage.
Replace OCR with cloud-based engines.
Combine with document classification or routing workflows.

How-To: Extract PII and Redact Data: Step-by-step guide to detecting and redacting personally identifiable information.
Glossary: PII Detection: Explains PII detection concepts, entity types, and compliance applications.
Glossary: Named Entity Recognition: Covers NER concepts that underpin PII extraction.
PII Extraction Demo: Single-file PII extraction demo with interactive prompts and custom entity definitions.

Table of Contents