Table of Contents

Process Email Archives for Compliance and Legal Discovery

Corporate email archives contain critical evidence for litigation holds, regulatory audits, HR investigations, and compliance reviews. Manually sifting through thousands of emails is slow and error-prone. This guide builds a local pipeline that ingests EML and MBOX email files, converts them to searchable text, detects personally identifiable information (PII), classifies emails by topic, and indexes everything into a RAG knowledge base for natural-language search. All processing happens on your machine with no data sent to external services.


Why Local Email Processing Matters

Two enterprise problems that on-device email analysis solves:

  1. Legal discovery without third-party data exposure. During litigation, a legal team receives a 50 GB MBOX export from the email server. Uploading it to a cloud analysis service means a third party processes privileged attorney-client communications and trade secrets. Local processing keeps all email content within your infrastructure, preserving privilege and complying with court-ordered data handling requirements.
  2. Regulatory compliance audits on internal data. Financial institutions under FINRA or SEC supervision must review employee communications for policy violations. Healthcare organizations must audit email for inadvertent PHI disclosure. A local pipeline scans email archives for sensitive content, flags PII, and generates compliance reports without exposing regulated data to external vendors.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB (for chat model) + 1 GB (for embedding model)
Disk ~5 GB for model downloads
Email files .eml or .mbox format

Step 1: Create the Project

dotnet new console -n EmailComplianceTool
cd EmailComplianceTool
dotnet add package LM-Kit.NET

Step 2: Understand Email Formats

LM-Kit.NET treats emails as first-class documents through the Attachment class, which handles both individual messages and archives.

Format Extension Description How LM-Kit Handles It
EML .eml Single email message (RFC 5322) Loaded as a single-page document
MBOX .mbox Archive containing multiple emails Each email becomes a separate "page"
  ┌──────────────┐        ┌─────────────────────────────────┐
  │  email.eml   │───────►│  Attachment (PageCount = 1)     │
  └──────────────┘        │    GetText() → full email text  │
                          └─────────────────────────────────┘

  ┌──────────────┐        ┌─────────────────────────────────┐
  │ archive.mbox │───────►│  Attachment (PageCount = N)     │
  └──────────────┘        │    GetPageText(0) → email #1    │
                          │    GetPageText(1) → email #2    │
                          │    ...                          │
                          └─────────────────────────────────┘

Step 3: Load and Parse Email Files

The Attachment class auto-detects the format and provides a unified API for accessing email content.

using System.Text;
using LMKit.Data;

Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// Load a single EML file
// ──────────────────────────────────────
var singleEmail = new Attachment("message.eml");
Console.WriteLine($"Format: {singleEmail.Mime}");
Console.WriteLine($"Pages:  {singleEmail.PageCount}");  // Always 1 for EML
Console.WriteLine($"Text:\n{singleEmail.GetText()}");

// Access embedded attachments (PDFs, images, etc.)
foreach (var att in singleEmail.Attachments)
{
    Console.WriteLine($"  Attachment: {att.Name} ({att.Mime}, {att.Length} bytes)");
}

// ──────────────────────────────────────
// Load an MBOX archive
// ──────────────────────────────────────
var archive = new Attachment("archive.mbox");
Console.WriteLine($"\nArchive contains {archive.PageCount} emails");

for (int i = 0; i < archive.PageCount; i++)
{
    string emailText = archive.GetPageText(i);
    // First line typically contains the From/Subject headers
    string preview = emailText.Length > 100 ? emailText[..100] + "..." : emailText;
    Console.WriteLine($"  Email {i + 1}: {preview}");
}

Step 4: Convert Emails to Structured Markdown

For richer processing that preserves metadata (sender, recipients, dates, subject, attachments), use the markdown converters. This produces clean, structured text that is ideal for classification and extraction.

using LMKit.Document.Conversion;

// ──────────────────────────────────────
// Convert single EML to Markdown
// ──────────────────────────────────────
string markdown = EmlMarkdownConverter.EmlToMarkdown(
    "message.eml",
    stripQuotes: true);  // Remove quoted reply chains

Console.WriteLine(markdown);
// Output includes: From, To, Cc, Date, Subject header table,
// body text converted from HTML to Markdown, and an attachment list.

// ──────────────────────────────────────
// Convert an MBOX archive to Markdown
// ──────────────────────────────────────
string archiveMarkdown = MboxMarkdownConverter.MboxToMarkdown(
    "archive.mbox",
    stripQuotes: true);

Console.WriteLine(archiveMarkdown);

Step 5: Detect PII in Email Content

Emails are one of the most common sources of inadvertent PII exposure: names, phone numbers, social security numbers, and credit card numbers routinely appear in the body text. Use PiiExtraction to find and flag sensitive data.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.Write("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rLoading model... {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine(" done.");

// ──────────────────────────────────────
// 2. Scan emails for PII
// ──────────────────────────────────────
var pii = new PiiExtraction(model);

var archive = new Attachment("archive.mbox");
Console.WriteLine($"Scanning {archive.PageCount} emails for PII...\n");

for (int i = 0; i < archive.PageCount; i++)
{
    string emailText = archive.GetPageText(i);
    pii.SetContent(emailText);

    var entities = pii.ExtractEntities();

    if (entities.Count > 0)
    {
        Console.ForegroundColor = ConsoleColor.Yellow;
        Console.WriteLine($"Email {i + 1}: {entities.Count} PII entities found");
        Console.ResetColor();

        foreach (var entity in entities)
        {
            Console.WriteLine($"  [{entity.Type}] \"{entity.Value}\" (confidence: {entity.Confidence:F2})");
        }
        Console.WriteLine();
    }
}

For more PII processing options, see Extract PII and Redact Sensitive Data.


Step 6: Classify Emails by Topic

Classification assigns each email to a category (legal, finance, HR, marketing) so you can prioritize review efforts. This is critical for legal discovery where only certain topics are relevant to the case.

using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

using LM model = LM.LoadFromModelID("gemma3:4b");
var categorizer = new Categorization(model);

// Define categories relevant to your investigation
var categories = new List<string>
{
    "Legal and Contracts",
    "Finance and Accounting",
    "Human Resources",
    "Sales and Marketing",
    "Technical and Engineering",
    "General Correspondence"
};

var archive = new Attachment("archive.mbox");

// Build a classification report
var report = new Dictionary<string, List<int>>();
foreach (var cat in categories)
    report[cat] = new List<int>();

for (int i = 0; i < archive.PageCount; i++)
{
    string emailText = archive.GetPageText(i);
    int bestIndex = categorizer.GetBestCategory(categories, emailText);
    report[categories[bestIndex]].Add(i + 1);
}

// Print summary
Console.WriteLine("Classification Summary:");
Console.WriteLine(new string('-', 50));
foreach (var (category, emailNumbers) in report)
{
    Console.WriteLine($"  {category}: {emailNumbers.Count} emails");
}

For advanced classification, see Classify Documents with Custom Categories.


Make the entire archive searchable by indexing email content into a RAG vector database. This lets investigators ask questions in plain English instead of constructing keyword searches.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load embedding model
// ──────────────────────────────────────
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rEmbedding model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine();

// ──────────────────────────────────────
// 2. Build the index
// ──────────────────────────────────────
var ragEngine = new RagEngine(embeddingModel);

var archive = new Attachment("archive.mbox");
Console.WriteLine($"Indexing {archive.PageCount} emails...");

for (int i = 0; i < archive.PageCount; i++)
{
    string emailText = archive.GetPageText(i);
    ragEngine.ImportText(emailText, "email-archive", $"email-{i + 1}");
    Console.Write($"\r  Indexed {i + 1}/{archive.PageCount}");
}
Console.WriteLine(" done.\n");

// ──────────────────────────────────────
// 3. Search with natural language
// ──────────────────────────────────────
string[] queries =
{
    "emails discussing contract renewal terms",
    "messages containing financial projections for Q3",
    "correspondence about employee termination"
};

foreach (string query in queries)
{
    Console.WriteLine($"Query: \"{query}\"");
    var results = ragEngine.FindMatchingPartitions(query, topK: 3, minScore: 0.3f);

    foreach (var match in results)
    {
        Console.WriteLine($"  Score: {match.Score:F3} | Section: {match.SectionId}");
    }
    Console.WriteLine();
}

For advanced retrieval features, see Build a RAG Pipeline Over Your Own Documents and Boost Retrieval with Hybrid Search.


Step 8: Complete Pipeline

Here is the full pipeline combining parsing, PII detection, classification, and indexing into a single pass.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;
using LMKit.Retrieval;
using LMKit.Document.Conversion;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading models...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine();

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rEmbedding model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine("\nModels loaded.\n");

// ──────────────────────────────────────
// 2. Initialize processing engines
// ──────────────────────────────────────
var piiDetector = new PiiExtraction(chatModel);
var categorizer = new Categorization(chatModel);
var ragEngine = new RagEngine(embeddingModel);

var categories = new List<string>
{
    "Legal", "Finance", "HR", "Sales", "Technical", "General"
};

// ──────────────────────────────────────
// 3. Process each email
// ──────────────────────────────────────
var archive = new Attachment("archive.mbox");
int totalPii = 0;

Console.WriteLine($"Processing {archive.PageCount} emails...\n");

for (int i = 0; i < archive.PageCount; i++)
{
    string emailText = archive.GetPageText(i);
    string emailId = $"email-{i + 1}";

    // Classify
    int catIndex = categorizer.GetBestCategory(categories, emailText);
    string category = categories[catIndex];

    // Detect PII
    piiDetector.SetContent(emailText);
    var piiEntities = piiDetector.ExtractEntities();
    totalPii += piiEntities.Count;

    // Index for search
    ragEngine.ImportText(emailText, "compliance-review", emailId);

    // Report
    string piiFlag = piiEntities.Count > 0 ? $" | PII: {piiEntities.Count} entities" : "";
    Console.WriteLine($"  [{emailId}] {category}{piiFlag}");
}

Console.WriteLine($"\nProcessing complete.");
Console.WriteLine($"  Emails processed: {archive.PageCount}");
Console.WriteLine($"  Total PII entities: {totalPii}");
Console.WriteLine($"  Index ready for search.\n");

// ──────────────────────────────────────
// 4. Interactive search
// ──────────────────────────────────────
Console.WriteLine("Enter a search query (or 'quit' to exit):");
while (true)
{
    Console.Write("> ");
    string? query = Console.ReadLine();
    if (string.IsNullOrEmpty(query) || query.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    var results = ragEngine.FindMatchingPartitions(query, topK: 5, minScore: 0.3f);
    if (results.Count == 0)
    {
        Console.WriteLine("  No matching emails found.\n");
        continue;
    }

    foreach (var match in results)
    {
        Console.WriteLine($"  [{match.SectionId}] Score: {match.Score:F3}");
    }
    Console.WriteLine();
}

Common Issues

Problem Cause Fix
Attachment throws on .eml file File is not valid RFC 5322 Verify the file opens in an email client; some exports use .msg format (not supported)
MBOX PageCount is 0 File does not use standard From line separators Ensure the export uses standard MBOX format
PII detection misses domain-specific IDs Built-in types do not cover your identifiers Add custom PiiEntityDefinition entries for patient IDs, account codes, etc.
RAG search returns low scores Emails are too short for meaningful embedding Try lowering minScore or use hybrid search combining BM25 with vectors

Next Steps

Share