Build a Real-Time Document Monitoring and Indexing Agent

Law firms, financial institutions, and healthcare organizations receive a constant stream of documents: contracts, invoices, lab reports, correspondence. Manually sorting, classifying, and indexing each file is slow and mistakes lead to missed deadlines or compliance violations. This guide builds a local document monitoring service that watches an intake folder, automatically classifies each new file, extracts structured data, and indexes the content into a searchable RAG knowledge base. All processing runs on your machine with no cloud dependencies.

Why Automated Document Monitoring Matters

Two enterprise problems that a local document monitoring agent solves:

Law firm document intake at scale. A litigation team receives hundreds of documents per week: contracts, exhibits, deposition transcripts, and correspondence. Each must be classified by matter, indexed for search, and flagged for privileged content. A monitoring agent processes files as they arrive, classifying them instantly and making them searchable within minutes, not days.
Financial compliance with audit trails. Accounting departments receive invoices, receipts, and statements from multiple sources. Regulations require each document to be categorized, checked for completeness, and stored with structured metadata. A monitoring agent automates this triage, reducing human error and creating a searchable archive for auditors.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	4+ GB (chat model) + 1 GB (embedding model)
Disk	~5 GB for model downloads
Document formats	PDF, DOCX, EML, images, or plain text

Step 1: Create the Project

dotnet new console -n DocumentMonitor
cd DocumentMonitor
dotnet add package LM-Kit.NET

Step 2: Understand the Pipeline

The monitoring agent reacts to new files and processes them through a classification, extraction, and indexing pipeline.

  ┌───────────────────┐
  │  Intake Folder    │  (FileSystemWatcher monitors for new files)
  │  /documents/new/  │
  └────────┬──────────┘
           │  New file detected
           ▼
  ┌───────────────────┐
  │  1. Load Document │  Attachment handles PDF, DOCX, EML, images, text
  └────────┬──────────┘
           │
           ▼
  ┌──────────────────┐
  │  2. Classify     │  Categorization assigns a topic label
  └────────┬─────────┘
           │
           ▼
  ┌──────────────────┐
  │  3. Extract      │  TextExtraction pulls structured fields
  └────────┬─────────┘
           │
           ▼
  ┌──────────────────┐
  │  4. Index        │  RagEngine indexes for natural-language search
  └────────┬─────────┘
           │
           ▼
  ┌──────────────────┐
  │  5. Move to      │  File archived after processing
  │     /processed/  │
  └──────────────────┘

Step 3: Set Up File System Monitoring

Use .NET's FileSystemWatcher to detect new files as they arrive in an intake folder.

using System.Text;

Console.OutputEncoding = Encoding.UTF8;

string intakePath = Path.Combine(AppContext.BaseDirectory, "documents", "new");
string processedPath = Path.Combine(AppContext.BaseDirectory, "documents", "processed");

Directory.CreateDirectory(intakePath);
Directory.CreateDirectory(processedPath);

using var watcher = new FileSystemWatcher(intakePath);
watcher.EnableRaisingEvents = true;
watcher.IncludeSubdirectories = false;

// React to new files
watcher.Created += (sender, e) =>
{
    Console.WriteLine($"New file detected: {e.Name}");
    // Processing happens here (see next steps)
};

Console.WriteLine($"Monitoring: {intakePath}");
Console.WriteLine("Drop files into the folder to process them. Press Enter to stop.");
Console.ReadLine();

Step 4: Classify Incoming Documents

Classification assigns each document to a category so it can be routed to the right team or stored in the correct folder.

using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

var categorizer = new Categorization(model);

var categories = new List<string>
{
    "Invoice",
    "Contract",
    "Correspondence",
    "Legal Filing",
    "Financial Statement",
    "Technical Report"
};

// Classify a document
var doc = new Attachment("documents/new/sample.pdf");
string docText = doc.GetText();

int bestIndex = categorizer.GetBestCategory(categories, docText);
string label = categories[bestIndex];
Console.WriteLine($"Document classified as: {label}");

For more classification options, see Classify Documents with Custom Categories.

Step 5: Extract Structured Data

Use structured data extraction to pull specific fields from each document type. Define different extraction schemas per category.

using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

using LM model = LM.LoadFromModelID("gemma4:e4b");
var extractor = new TextExtraction(model);

// Define fields for invoices
var invoiceFields = new List<TextExtractionElement>
{
    new TextExtractionElement("vendor_name", ElementType.String),
    new TextExtractionElement("invoice_number", ElementType.String),
    new TextExtractionElement("invoice_date", ElementType.Date),
    new TextExtractionElement("total_amount", ElementType.Double),
    new TextExtractionElement("currency", ElementType.String)
};

// Define fields for contracts
var contractFields = new List<TextExtractionElement>
{
    new TextExtractionElement("parties", ElementType.Array),
    new TextExtractionElement("effective_date", ElementType.Date),
    new TextExtractionElement("expiration_date", ElementType.Date),
    new TextExtractionElement("contract_type", ElementType.String),
    new TextExtractionElement("governing_law", ElementType.String)
};

// Extract from a document
var doc = new Attachment("documents/new/invoice.pdf");
extractor.Elements = invoiceFields;
extractor.SetContent(doc);

TextExtractionResult result = extractor.Parse();
Console.WriteLine($"Extracted data:\n{result.Json}");

For more extraction techniques, see Extract Structured Data from Unstructured Text and Extract Invoice Data from PDFs and Images.

Step 6: Index Into a Searchable Knowledge Base

Feed processed documents into a RAG engine so the entire archive becomes searchable with natural-language queries.

using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
var ragEngine = new RagEngine(embeddingModel);

// Index a document
var doc = new Attachment("documents/new/contract.pdf");
string category = "Contract";  // From classification step
string docId = Path.GetFileNameWithoutExtension("contract.pdf");

ragEngine.ImportTextFromAttachment(doc, category, docId);
Console.WriteLine($"Indexed: {docId} under '{category}'");

// Later: search across all indexed documents
var results = ragEngine.FindMatchingPartitions(
    "contracts expiring in 2025",
    topK: 5,
    minScore: 0.3f);

foreach (var match in results)
{
    Console.WriteLine($"  Score: {match.Score:F3} | Section: {match.SectionId}");
}

Step 7: Wire It All Together

Here is the complete monitoring service combining file watching, classification, extraction, and indexing.

using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;
using LMKit.Extraction;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading models...");

using LM chatModel = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine();

using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rEmbedding model: {(double)read / len.Value * 100:F1}%");
        return true;
    });
Console.WriteLine("\nModels loaded.\n");

// ──────────────────────────────────────
// 2. Initialize engines
// ──────────────────────────────────────
var categorizer = new Categorization(chatModel);
var extractor = new TextExtraction(chatModel);
var ragEngine = new RagEngine(embeddingModel);

var categories = new List<string>
{
    "Invoice", "Contract", "Correspondence",
    "Legal Filing", "Financial Statement", "Technical Report"
};

// Extraction schemas per category
var extractionSchemas = new Dictionary<string, List<TextExtractionElement>>
{
    ["Invoice"] = new()
    {
        new TextExtractionElement("vendor_name", ElementType.String),
        new TextExtractionElement("invoice_number", ElementType.String),
        new TextExtractionElement("total_amount", ElementType.Double),
        new TextExtractionElement("invoice_date", ElementType.Date)
    },
    ["Contract"] = new()
    {
        new TextExtractionElement("parties", ElementType.Array),
        new TextExtractionElement("effective_date", ElementType.Date),
        new TextExtractionElement("contract_type", ElementType.String)
    }
};

// ──────────────────────────────────────
// 3. Set up folder monitoring
// ──────────────────────────────────────
string intakePath = Path.Combine(AppContext.BaseDirectory, "documents", "new");
string processedPath = Path.Combine(AppContext.BaseDirectory, "documents", "processed");
Directory.CreateDirectory(intakePath);
Directory.CreateDirectory(processedPath);

int processedCount = 0;

using var watcher = new FileSystemWatcher(intakePath);
watcher.EnableRaisingEvents = true;

watcher.Created += async (sender, e) =>
{
    // Brief delay to ensure file write is complete
    await Task.Delay(500);

    if (!File.Exists(e.FullPath))
        return;

    string fileName = e.Name ?? Path.GetFileName(e.FullPath);
    string docId = Path.GetFileNameWithoutExtension(fileName);

    try
    {
        Console.WriteLine($"\n{'=',-50}");
        Console.WriteLine($"Processing: {fileName}");

        // Load document
        var doc = new Attachment(e.FullPath);
        string docText = doc.GetText();

        if (string.IsNullOrWhiteSpace(docText))
        {
            Console.WriteLine("  Skipped: no text content.");
            return;
        }

        // Classify
        int catIndex = categorizer.GetBestCategory(categories, docText);
        string category = categories[catIndex];
        Console.WriteLine($"  Category:  {category}");

        // Extract (if schema exists for this category)
        if (extractionSchemas.TryGetValue(category, out var fields))
        {
            extractor.Elements = fields;
            extractor.SetContent(doc);
            var extractionResult = extractor.Parse();
            Console.WriteLine($"  Extracted: {extractionResult.Json}");
        }

        // Index for search
        ragEngine.ImportTextFromAttachment(doc, category, docId);
        Console.WriteLine($"  Indexed:   {docId} in '{category}'");

        // Move to processed folder
        string destPath = Path.Combine(processedPath, fileName);
        if (File.Exists(destPath))
            destPath = Path.Combine(processedPath, $"{docId}_{DateTime.Now:yyyyMMddHHmmss}{Path.GetExtension(fileName)}");

        doc.Dispose();
        File.Move(e.FullPath, destPath);
        Console.WriteLine($"  Archived:  {destPath}");

        processedCount++;
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"  Error: {ex.Message}");
        Console.ResetColor();
    }
};

Console.WriteLine($"Document Monitor started.");
Console.WriteLine($"  Intake folder:    {intakePath}");
Console.WriteLine($"  Processed folder: {processedPath}");
Console.WriteLine($"  Categories:       {string.Join(", ", categories)}");
Console.WriteLine();
Console.WriteLine("Drop files into the intake folder. Type 'search' to query, 'quit' to stop.\n");

// ──────────────────────────────────────
// 4. Interactive search loop
// ──────────────────────────────────────
while (true)
{
    Console.Write("> ");
    string? input = Console.ReadLine();

    if (string.IsNullOrEmpty(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    if (input.Equals("search", StringComparison.OrdinalIgnoreCase))
    {
        Console.Write("Query: ");
        string? query = Console.ReadLine();
        if (string.IsNullOrWhiteSpace(query))
            continue;

        var results = ragEngine.FindMatchingPartitions(query, topK: 5, minScore: 0.3f);
        if (results.Count == 0)
        {
            Console.WriteLine("  No results found.\n");
            continue;
        }

        foreach (var match in results)
        {
            Console.WriteLine($"  [{match.SectionId}] Score: {match.Score:F3}");
        }
        Console.WriteLine();
    }
    else if (input.Equals("status", StringComparison.OrdinalIgnoreCase))
    {
        Console.WriteLine($"  Documents processed: {processedCount}");
    }
}

Console.WriteLine("Monitor stopped.");

Supported Document Formats

The Attachment class auto-detects file format by content, not just extension. All these formats are handled transparently:

Format	Extensions	Notes
PDF	`.pdf`	Text extraction with layout preservation
Word	`.docx`	Full text and metadata extraction
Email	`.eml`, `.mbox`	Headers, body, and attachment extraction (see Process Email Archives)
Images	`.png`, `.jpg`, `.bmp`, `.gif`, `.webp`	Requires OCR or VLM for text
HTML	`.html`	Cleaned text extraction
Plain text	`.txt`, `.md`	Direct text processing (Markdown is treated as plain UTF-8 text)

Common Issues

Problem	Cause	Fix
File locked on move	Application or OS still writing	Increase the delay after `Created` event (e.g., `Task.Delay(1000)`)
Classification is inconsistent	Short documents lack context	Add descriptions to categories using the `GetBestCategory` overload with descriptions
Extraction returns nulls	Document does not contain the expected fields	Use `NullOnDoubt = true` (default) and check for null values in results
Memory grows over time	Embedding index accumulates in memory	Periodically save the RAG index or use an external vector store like Qdrant or PostgreSQL (pgvector)

Next Steps

Build a Multi-Format Document Ingestion Pipeline for advanced ingestion patterns
Process Email Archives for Compliance and Legal Discovery to add email processing
Build a Persistent Document Knowledge Base with Vector Storage for durable storage
Classify Documents with Custom Categories for fine-grained classification
Extract Invoice Data from PDFs and Images for specialized invoice processing
Automate Contract and Compliance Document Review for compliance workflows

Table of Contents