Table of Contents

Extract PII and Redact Sensitive Data

Applications that process user content (support tickets, uploaded documents, form submissions) need to detect and handle personally identifiable information (PII) before storing, sharing, or analyzing that data. LM-Kit.NET's PiiExtraction class identifies 11 built-in PII types (names, emails, phone numbers, SSNs, credit cards, and more) with confidence scores and occurrence positions. This tutorial builds a PII detection and redaction system that processes text and documents locally.


Why Local PII Extraction Matters

Two enterprise problems that on-device PII detection solves:

  1. Comply with privacy regulations without sending data to third parties. GDPR, CCPA, HIPAA all require knowing what PII you hold. But sending documents to a cloud PII detection API means a third party processes the very data you are trying to protect. Local extraction keeps sensitive data within your infrastructure during the detection phase.
  2. Redact before sharing. Customer support logs shared with analytics teams, medical records shared with researchers, legal documents shared with external counsel. All need PII stripped first. A local system redacts at the source before data ever moves.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n PiiQuickstart
cd PiiQuickstart
dotnet add package LM-Kit.NET

Step 2: Understand PII Entity Types

LM-Kit.NET detects these PII types out of the box:

Type Examples
Person "John Smith", "Dr. Sarah Chen"
EmailAddress "user@example.com"
PhoneNumber "+1-650-555-1234"
PostalAddress "1600 Amphitheatre Parkway, Mountain View, CA"
Url "https://example.com/profile/12345"
IpAddress "192.168.0.1", "2001:db8::1"
DateOfBirth "01/15/1980", "March 3rd, 1992"
SocialSecurityNumber "123-45-6789"
CreditCardNumber "4111 1111 1111 1111"
BankAccountNumber "000123456789"
Other Catch-all for unclassified PII (opt-in)

You can also define custom PII types for domain-specific identifiers (patient IDs, employee numbers, account codes).


Step 3: Basic PII Extraction

using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract PII
// ──────────────────────────────────────
var pii = new PiiExtraction(model);

string text = """
    Dear Support Team,

    My name is James Wilson and I'm writing about order #45832. I purchased a laptop
    on my Visa card ending in 4242 (full number: 4532-1234-5678-4242). The delivery
    address is 742 Evergreen Terrace, Springfield, IL 62704.

    You can reach me at james.wilson@email.com or call (555) 867-5309. My date of
    birth for verification is 03/15/1985, and my SSN on file is 234-56-7890.

    Thanks,
    James Wilson
    """;

List<PiiExtraction.PiiExtractedEntity> entities = pii.Extract(text);

Console.WriteLine($"Found {entities.Count} PII entities (confidence: {pii.Confidence:P0}):\n");

foreach (var entity in entities)
{
    Console.ForegroundColor = ConsoleColor.Red;
    Console.Write($"  {entity.EntityDefinition.Label,-25}");
    Console.ResetColor();
    Console.Write($"  {entity.Value,-40}");
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  ({entity.Confidence:P0})");
    Console.ResetColor();
}

Step 4: Redact PII from Text

Build a redaction function that replaces PII with type-labeled placeholders:

string RedactText(string originalText, List<PiiExtraction.PiiExtractedEntity> entities)
{
    string redacted = originalText;

    // Sort by value length descending to avoid partial replacements
    var sorted = entities.OrderByDescending(e => e.Value.Length);

    foreach (var entity in sorted)
    {
        string placeholder = $"[{entity.EntityDefinition.Label.ToUpper()}]";
        redacted = redacted.Replace(entity.Value, placeholder);
    }

    return redacted;
}

// Extract and redact
var piiEntities = pii.Extract(text);
string redacted = RedactText(text, piiEntities);

Console.WriteLine("Redacted output:\n");
Console.WriteLine(redacted);

Expected output:

Dear Support Team,

My name is [PERSON] and I'm writing about order #45832. I purchased a laptop
on my Visa card ending in 4242 (full number: [CREDITCARDNUMBER]). The delivery
address is [POSTALADDRESS].

You can reach me at [EMAILADDRESS] or call [PHONENUMBER]. My date of
birth for verification is [DATEOFBIRTH], and my SSN on file is [SOCIALSECURITYNUMBER].

Thanks,
[PERSON]

Step 5: Extract PII from Documents

Process PDFs, images, and scanned documents:

using LMKit.Data;

var pii = new PiiExtraction(model);

string filePath = "customer_application.pdf";
var attachment = new Attachment(filePath);

List<PiiExtraction.PiiExtractedEntity> docEntities = pii.Extract(attachment);

Console.WriteLine($"PII found in {Path.GetFileName(filePath)}:\n");

// Group by type for a compliance report
var grouped = docEntities
    .GroupBy(e => e.EntityDefinition.Label)
    .OrderBy(g => g.Key);

foreach (var group in grouped)
{
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"  {group.Key} ({group.Count()}):");
    Console.ResetColor();

    foreach (var entity in group)
    {
        Console.WriteLine($"    {entity.Value} ({entity.Confidence:P0})");
    }
}

Step 6: Custom PII Definitions

Add domain-specific PII types that the default set does not cover:

var customPii = new PiiExtraction(model, new List<PiiExtraction.PiiEntityDefinition>
{
    // Keep the standard types you need
    new(PiiExtraction.PiiEntityType.Person),
    new(PiiExtraction.PiiEntityType.EmailAddress),
    new(PiiExtraction.PiiEntityType.PhoneNumber),
    new(PiiExtraction.PiiEntityType.SocialSecurityNumber),

    // Add custom types
    new("PatientID"),
    new("MedicalRecordNumber"),
    new("InsurancePolicyNumber"),
    new("EmployeeID")
});

customPii.Guidance = "This is a healthcare document. " +
    "Patient IDs follow the format P-XXXXX. " +
    "Medical record numbers follow the format MRN-XXXXXXXX.";

string medicalText = """
    Patient: Maria Garcia (P-48231)
    MRN: MRN-20241589
    Insurance: BlueCross Policy #BC-9912-4456-01
    Emergency Contact: Carlos Garcia, (555) 234-8901
    SSN: 567-89-0123
    Employee ID: EMP-3391 (referring physician)
    """;

var medicalEntities = customPii.Extract(medicalText);

foreach (var entity in medicalEntities)
{
    Console.WriteLine($"  [{entity.EntityDefinition.Label}] {entity.Value}");
}

Step 7: Batch PII Audit

Scan a directory of files and generate a compliance report:

string[] files = Directory.GetFiles("customer_data", "*.*")
    .Where(f => f.EndsWith(".txt") || f.EndsWith(".pdf"))
    .ToArray();

var report = new List<string>();
report.Add("file,pii_type,value,confidence");

int totalPii = 0;

Console.WriteLine($"Scanning {files.Length} files for PII...\n");

foreach (string file in files)
{
    string content = File.ReadAllText(file);
    var entities = pii.Extract(content);
    string fileName = Path.GetFileName(file);

    totalPii += entities.Count;

    foreach (var entity in entities)
    {
        report.Add($"\"{fileName}\",\"{entity.EntityDefinition.Label}\"," +
            $"\"{entity.Value.Replace("\"", "\"\"")}\",{entity.Confidence:F2}");
    }

    ConsoleColor color = entities.Count > 0 ? ConsoleColor.Yellow : ConsoleColor.Green;
    Console.ForegroundColor = color;
    Console.WriteLine($"  {fileName}: {entities.Count} PII entities");
    Console.ResetColor();
}

File.WriteAllLines("pii_audit_report.csv", report);
Console.WriteLine($"\nTotal PII found: {totalPii} across {files.Length} files");
Console.WriteLine("Report saved to pii_audit_report.csv");

Step 8: Confidence-Based Filtering

Not all detections are equally certain. Use confidence scores to separate confirmed PII from uncertain matches:

var entities = pii.Extract(text);

var confirmed = entities.Where(e => e.Confidence >= 0.85f).ToList();
var uncertain = entities.Where(e => e.Confidence < 0.85f).ToList();

Console.WriteLine($"Confirmed PII ({confirmed.Count}):");
foreach (var e in confirmed)
    Console.WriteLine($"  {e.EntityDefinition.Label}: {e.Value} ({e.Confidence:P0})");

Console.WriteLine($"\nUncertain (needs review) ({uncertain.Count}):");
foreach (var e in uncertain)
    Console.WriteLine($"  {e.EntityDefinition.Label}: {e.Value} ({e.Confidence:P0})");

// Auto-redact confirmed, flag uncertain for human review
string autoRedacted = RedactText(text, confirmed);

This two-tier approach reduces false positives while still catching high-confidence PII automatically.


Common Issues

Problem Cause Fix
Missing PII types Default set does not include Other Pass includeOtherType: true to constructor, or add custom definitions
False positives on product codes Model confuses identifiers Add Guidance to describe your document context
Poor detection on scanned PDFs Image quality too low Set pii.OcrEngine to a configured OcrEngine instance
Slow on large documents Full document processed at once Set MaxContextLength to limit processing window
Custom types not detected Label too vague Use descriptive labels ("PatientID" not "ID") and add Guidance with format examples

Next Steps