👉 Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/batch_pii_extraction
Batch PII Extraction with AI in .NET Applications
🎯 Purpose of the Sample
Batch PII Extraction demonstrates how to detect and extract Personally Identifiable Information (PII) from large volumes of documents using LM-Kit.NET. The sample processes files in parallel, applies OCR when needed, and outputs structured JSON results for each document.
It is designed for high-throughput compliance, privacy, and data governance workflows.
👥 Target Audience
- Compliance & Legal – GDPR, HIPAA, and privacy audits
- Enterprise & B2B Apps – PII detection pipelines
- Security & Risk – sensitive data discovery
- Back-office & Ops – bulk document processing
- Benchmarking & Demos – throughput and scalability evaluation
🚀 Problem Solved
- Manual PII review across thousands of documents
- Mixed input formats including scanned documents
- Scalable extraction with controlled parallelism
- Structured outputs ready for redaction or storage
💻 Sample Application Description
Console application that:
- Loads a predefined LM-Kit PII extraction model.
- Recursively scans an input directory.
- Runs parallel PII extraction with adaptive thread count.
- Applies OCR automatically for non-text documents.
- Extracts entities such as names, addresses, IDs, and more.
- Writes one JSON output file per input document.
- Displays live progress, throughput, and performance metrics.
📂 Supported Inputs
- PDFs and scanned documents
- Images via OCR (Tesseract by default)
- Text-based files supported by LM-Kit attachments
🔐 Extracted Information
Depending on the model and configuration, extracted entities may include:
- Person names
- Addresses
- Phone numbers
- Email addresses
- National identifiers
- Payment and banking information
- Other sensitive data types (optional)
⚙️ Key Features
- ⚡ High-Throughput Batch Processing – multi-threaded execution
- 🧠 OCR-Aware Extraction – automatic OCR integration
- 📄 Page-Level Metrics – documents and pages per second
- 📊 Live Console Dashboard – progress table with throughput
- 🧩 Structured JSON Output – ready for redaction or analysis
- 🔁 Thread-Safe Statistics – real-time performance snapshots
🛠️ Getting Started
📋 Prerequisites
- .NET Framework 4.6.2 or .NET 8.0+
- LM-Kit.NET license key
- Tesseract OCR dependencies (default OCR engine)
📥 Download
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/batch_pii_extraction
▶️ Run
dotnet build
dotnet run
The sample uses configurable input and output directories and automatically determines the optimal thread count based on hardware and model size.
📁 Output Structure
- One
.jsonfile per input document - Preserves the input folder hierarchy
- JSON includes detected entities and metadata
- Null values omitted for clean output
📈 Runtime Metrics
Displayed during execution:
- Documents processed
- Pages processed
- Per-document processing time
- Documents per second
- Pages per second
Final summary includes total runtime and averages.
🔍 Notes
- The model is loaded once and shared across threads.
- Each worker initializes its own extraction engine.
- OCR can be replaced with a custom implementation.
- Additional PII entity definitions can be added programmatically.
- Preferred inference modality can be adjusted for redaction scenarios.
🔧 Extend the Demo
- Add confidence thresholds for filtering results.
- Integrate automatic redaction pipelines.
- Store outputs in databases or object storage.
- Replace OCR with cloud-based engines.
- Combine with document classification or routing workflows.