Table of Contents

πŸ‘‰ Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/text-analysis/pii-extraction/pii_detector_redactor

Folder PII Redaction Tool for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that detects PII in text content and produces position-accurate redacted copies plus a full audit trail. Built on LM-Kit.NET's PiiExtraction engine. Designed for compliance, security, and data-room workflows where the redaction must be defensible (every span recorded with offsets, label, original text, redaction value, and confidence).

All inference runs on-device. No PII ever leaves the host.


πŸ‘₯ Industry Target Audience

  • Compliance & legal: GDPR / HIPAA / CCPA pre-share redaction.
  • Security: pre-export scrubbing of logs, support tickets, transcripts.
  • HR: scrub identifiers from CVs and applications before LLM review.
  • Healthcare: redact patient identifiers in clinical notes.

πŸš€ Problem Solved

The naΓ―ve approach to redaction is a case-insensitive search-and-replace over the entity values returned by the detector. That mangles documents where the same string appears in both a PII context and a benign one ("John" the name vs. "St. John's" the place), and it fully misses substring-edge cases.

This demo uses what the LM-Kit engine actually returns: PiiExtractedEntity.Occurrences[] carries each detection's StartIndex and EndIndex in the source text. The demo splices the redactions in by offset, right-to-left so earlier positions stay valid as later spans are rewritten. Overlapping spans are resolved deterministically (longest wins, then highest confidence on ties). Every redaction is recorded in an audit CSV.


πŸ’» Application Overview

Interactive menu β€” no command-line arguments β€” with three modes plus a runtime-selectable redaction style. Model load happens once at startup.

Mode What it does
Live Paste text, pick a redaction style, see the entities and redacted output.
File Redact one .txt / .md / .eml / .log file, write companion audit CSV.
Folder Walk a folder, optional confidence floor, redact every text file preserving the relative path, write redaction_audit.csv with every redacted span.
Quit Exit.

Three redaction styles, picked per run:

Style Example transform Use when
mask (default) "John Smith" β†’ "**** *****" You want to preserve the shape of the text (length, whitespace, digits-vs-letters).
label "John Smith" β†’ "[PERSON]" You need a clean, label-tagged version for downstream LLM consumption.
hash "John Smith" β†’ "[PERSON#a1b2c3]" You need a stable per-value token so two occurrences of the same identity remain linked across redacted documents.

✨ Key Features

  • PiiExtraction.Extract + Occurrences[].StartIndex/EndIndex: position-accurate spans straight from the API. No regex, no string-replace.
  • Deterministic overlap resolution: longest-wins, then highest-confidence-wins on ties.
  • Three redaction styles: choose at runtime per session.
  • Audit CSV lists every span: source, label, original, redaction, start, end, confidence.
  • Interactive, no flags: every input is a console prompt.

🧠 Supported Models

  • Alibaba Qwen 3.5 2B (~2 GB VRAM) β€” fast default.
  • Alibaba Qwen 3.5 4B / 9B (~3.5 / 7 GB VRAM).
  • Google Gemma 4 E4B (~6 GB VRAM).
  • Alibaba Qwen 3.6 27B (~18 GB VRAM).
  • Any custom model URI.

Supported Inputs

Text files: .txt, .md, .eml, .log. PDFs and images can be inspected with the NER demo; position-aware redaction of binary types requires a different pipeline.


πŸ› οΈ Getting Started

πŸ“‹ Prerequisites

  • .NET 8.0 or later

▢️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/text-analysis/pii-extraction/pii_detector_redactor
dotnet run

Pick a model, pick a mode, follow the prompts.

Share