Build a Real-Time Document Monitoring and Indexing Agent
Law firms, financial institutions, and healthcare organizations receive a constant stream of documents: contracts, invoices, lab reports, correspondence. Manually sorting, classifying, and indexing each file is slow and mistakes lead to missed deadlines or compliance violations. This guide builds a local document monitoring service that watches an intake folder, automatically classifies each new file, extracts structured data, and indexes the content into a searchable RAG knowledge base. All processing runs on your machine with no cloud dependencies.
Why Automated Document Monitoring Matters
Two enterprise problems that a local document monitoring agent solves:
- Law firm document intake at scale. A litigation team receives hundreds of documents per week: contracts, exhibits, deposition transcripts, and correspondence. Each must be classified by matter, indexed for search, and flagged for privileged content. A monitoring agent processes files as they arrive, classifying them instantly and making them searchable within minutes, not days.
- Financial compliance with audit trails. Accounting departments receive invoices, receipts, and statements from multiple sources. Regulations require each document to be categorized, checked for completeness, and stored with structured metadata. A monitoring agent automates this triage, reducing human error and creating a searchable archive for auditors.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB (chat model) + 1 GB (embedding model) |
| Disk | ~5 GB for model downloads |
| Document formats | PDF, DOCX, EML, images, or plain text |
Step 1: Create the Project
dotnet new console -n DocumentMonitor
cd DocumentMonitor
dotnet add package LM-Kit.NET
Step 2: Understand the Pipeline
The monitoring agent reacts to new files and processes them through a classification, extraction, and indexing pipeline.
┌───────────────────┐
│ Intake Folder │ (FileSystemWatcher monitors for new files)
│ /documents/new/ │
└────────┬──────────┘
│ New file detected
▼
┌───────────────────┐
│ 1. Load Document │ Attachment handles PDF, DOCX, EML, images, text
└────────┬──────────┘
│
▼
┌──────────────────┐
│ 2. Classify │ Categorization assigns a topic label
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 3. Extract │ TextExtraction pulls structured fields
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 4. Index │ RagEngine indexes for natural-language search
└────────┬─────────┘
│
▼
┌──────────────────┐
│ 5. Move to │ File archived after processing
│ /processed/ │
└──────────────────┘
Step 3: Set Up File System Monitoring
Use .NET's FileSystemWatcher to detect new files as they arrive in an intake folder.
using System.Text;
Console.OutputEncoding = Encoding.UTF8;
string intakePath = Path.Combine(AppContext.BaseDirectory, "documents", "new");
string processedPath = Path.Combine(AppContext.BaseDirectory, "documents", "processed");
Directory.CreateDirectory(intakePath);
Directory.CreateDirectory(processedPath);
using var watcher = new FileSystemWatcher(intakePath);
watcher.EnableRaisingEvents = true;
watcher.IncludeSubdirectories = false;
// React to new files
watcher.Created += (sender, e) =>
{
Console.WriteLine($"New file detected: {e.Name}");
// Processing happens here (see next steps)
};
Console.WriteLine($"Monitoring: {intakePath}");
Console.WriteLine("Drop files into the folder to process them. Press Enter to stop.");
Console.ReadLine();
Step 4: Classify Incoming Documents
Classification assigns each document to a category so it can be routed to the right team or stored in the correct folder.
using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
using LM model = LM.LoadFromModelID("gemma3:4b");
var categorizer = new Categorization(model);
var categories = new List<string>
{
"Invoice",
"Contract",
"Correspondence",
"Legal Filing",
"Financial Statement",
"Technical Report"
};
// Classify a document
var doc = new Attachment("documents/new/sample.pdf");
string docText = doc.GetText();
int bestIndex = categorizer.GetBestCategory(categories, docText);
string label = categories[bestIndex];
Console.WriteLine($"Document classified as: {label}");
For more classification options, see Classify Documents with Custom Categories.
Step 5: Extract Structured Data
Use structured data extraction to pull specific fields from each document type. Define different extraction schemas per category.
using LMKit.Model;
using LMKit.Data;
using LMKit.Extraction;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
using LM model = LM.LoadFromModelID("gemma3:4b");
var extractor = new TextExtraction(model);
// Define fields for invoices
var invoiceFields = new List<TextExtractionElement>
{
new TextExtractionElement("vendor_name", ElementType.String),
new TextExtractionElement("invoice_number", ElementType.String),
new TextExtractionElement("invoice_date", ElementType.Date),
new TextExtractionElement("total_amount", ElementType.Double),
new TextExtractionElement("currency", ElementType.String)
};
// Define fields for contracts
var contractFields = new List<TextExtractionElement>
{
new TextExtractionElement("parties", ElementType.Array),
new TextExtractionElement("effective_date", ElementType.Date),
new TextExtractionElement("expiration_date", ElementType.Date),
new TextExtractionElement("contract_type", ElementType.String),
new TextExtractionElement("governing_law", ElementType.String)
};
// Extract from a document
var doc = new Attachment("documents/new/invoice.pdf");
extractor.Elements = invoiceFields;
extractor.SetContent(doc);
TextExtractionResult result = extractor.Parse();
Console.WriteLine($"Extracted data:\n{result.Json}");
For more extraction techniques, see Extract Structured Data from Unstructured Text and Extract Invoice Data from PDFs and Images.
Step 6: Index Into a Searchable Knowledge Base
Feed processed documents into a RAG engine so the entire archive becomes searchable with natural-language queries.
using LMKit.Model;
using LMKit.Data;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
var ragEngine = new RagEngine(embeddingModel);
// Index a document
var doc = new Attachment("documents/new/contract.pdf");
string category = "Contract"; // From classification step
string docId = Path.GetFileNameWithoutExtension("contract.pdf");
ragEngine.ImportTextFromAttachment(doc, category, docId);
Console.WriteLine($"Indexed: {docId} under '{category}'");
// Later: search across all indexed documents
var results = ragEngine.FindMatchingPartitions(
"contracts expiring in 2025",
topK: 5,
minScore: 0.3f);
foreach (var match in results)
{
Console.WriteLine($" Score: {match.Score:F3} | Section: {match.SectionId}");
}
Step 7: Wire It All Together
Here is the complete monitoring service combining file watching, classification, extraction, and indexing.
using System.Text;
using LMKit.Model;
using LMKit.Data;
using LMKit.TextAnalysis;
using LMKit.Extraction;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading models...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rChat model: {(double)read / len.Value * 100:F1}%");
return true;
});
Console.WriteLine();
using LM embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rEmbedding model: {(double)read / len.Value * 100:F1}%");
return true;
});
Console.WriteLine("\nModels loaded.\n");
// ──────────────────────────────────────
// 2. Initialize engines
// ──────────────────────────────────────
var categorizer = new Categorization(chatModel);
var extractor = new TextExtraction(chatModel);
var ragEngine = new RagEngine(embeddingModel);
var categories = new List<string>
{
"Invoice", "Contract", "Correspondence",
"Legal Filing", "Financial Statement", "Technical Report"
};
// Extraction schemas per category
var extractionSchemas = new Dictionary<string, List<TextExtractionElement>>
{
["Invoice"] = new()
{
new TextExtractionElement("vendor_name", ElementType.String),
new TextExtractionElement("invoice_number", ElementType.String),
new TextExtractionElement("total_amount", ElementType.Double),
new TextExtractionElement("invoice_date", ElementType.Date)
},
["Contract"] = new()
{
new TextExtractionElement("parties", ElementType.Array),
new TextExtractionElement("effective_date", ElementType.Date),
new TextExtractionElement("contract_type", ElementType.String)
}
};
// ──────────────────────────────────────
// 3. Set up folder monitoring
// ──────────────────────────────────────
string intakePath = Path.Combine(AppContext.BaseDirectory, "documents", "new");
string processedPath = Path.Combine(AppContext.BaseDirectory, "documents", "processed");
Directory.CreateDirectory(intakePath);
Directory.CreateDirectory(processedPath);
int processedCount = 0;
using var watcher = new FileSystemWatcher(intakePath);
watcher.EnableRaisingEvents = true;
watcher.Created += async (sender, e) =>
{
// Brief delay to ensure file write is complete
await Task.Delay(500);
if (!File.Exists(e.FullPath))
return;
string fileName = e.Name ?? Path.GetFileName(e.FullPath);
string docId = Path.GetFileNameWithoutExtension(fileName);
try
{
Console.WriteLine($"\n{'=',-50}");
Console.WriteLine($"Processing: {fileName}");
// Load document
var doc = new Attachment(e.FullPath);
string docText = doc.GetText();
if (string.IsNullOrWhiteSpace(docText))
{
Console.WriteLine(" Skipped: no text content.");
return;
}
// Classify
int catIndex = categorizer.GetBestCategory(categories, docText);
string category = categories[catIndex];
Console.WriteLine($" Category: {category}");
// Extract (if schema exists for this category)
if (extractionSchemas.TryGetValue(category, out var fields))
{
extractor.Elements = fields;
extractor.SetContent(doc);
var extractionResult = extractor.Parse();
Console.WriteLine($" Extracted: {extractionResult.Json}");
}
// Index for search
ragEngine.ImportTextFromAttachment(doc, category, docId);
Console.WriteLine($" Indexed: {docId} in '{category}'");
// Move to processed folder
string destPath = Path.Combine(processedPath, fileName);
if (File.Exists(destPath))
destPath = Path.Combine(processedPath, $"{docId}_{DateTime.Now:yyyyMMddHHmmss}{Path.GetExtension(fileName)}");
doc.Dispose();
File.Move(e.FullPath, destPath);
Console.WriteLine($" Archived: {destPath}");
processedCount++;
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($" Error: {ex.Message}");
Console.ResetColor();
}
};
Console.WriteLine($"Document Monitor started.");
Console.WriteLine($" Intake folder: {intakePath}");
Console.WriteLine($" Processed folder: {processedPath}");
Console.WriteLine($" Categories: {string.Join(", ", categories)}");
Console.WriteLine();
Console.WriteLine("Drop files into the intake folder. Type 'search' to query, 'quit' to stop.\n");
// ──────────────────────────────────────
// 4. Interactive search loop
// ──────────────────────────────────────
while (true)
{
Console.Write("> ");
string? input = Console.ReadLine();
if (string.IsNullOrEmpty(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
if (input.Equals("search", StringComparison.OrdinalIgnoreCase))
{
Console.Write("Query: ");
string? query = Console.ReadLine();
if (string.IsNullOrWhiteSpace(query))
continue;
var results = ragEngine.FindMatchingPartitions(query, topK: 5, minScore: 0.3f);
if (results.Count == 0)
{
Console.WriteLine(" No results found.\n");
continue;
}
foreach (var match in results)
{
Console.WriteLine($" [{match.SectionId}] Score: {match.Score:F3}");
}
Console.WriteLine();
}
else if (input.Equals("status", StringComparison.OrdinalIgnoreCase))
{
Console.WriteLine($" Documents processed: {processedCount}");
}
}
Console.WriteLine("Monitor stopped.");
Supported Document Formats
The Attachment class auto-detects file format by content, not just extension. All these formats are handled transparently:
| Format | Extensions | Notes |
|---|---|---|
.pdf |
Text extraction with layout preservation | |
| Word | .docx |
Full text and metadata extraction |
.eml, .mbox |
Headers, body, and attachment extraction (see Process Email Archives) | |
| Images | .png, .jpg, .bmp, .gif, .webp |
Requires OCR or VLM for text |
| HTML | .html |
Cleaned text extraction |
| Plain text | .txt |
Direct text processing |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| File locked on move | Application or OS still writing | Increase the delay after Created event (e.g., Task.Delay(1000)) |
| Classification is inconsistent | Short documents lack context | Add descriptions to categories using the GetBestCategory overload with descriptions |
| Extraction returns nulls | Document does not contain the expected fields | Use NullOnDoubt = true (default) and check for null values in results |
| Memory grows over time | Embedding index accumulates in memory | Periodically save the RAG index or use an external vector store like Qdrant |
Next Steps
- Build a Multi-Format Document Ingestion Pipeline for advanced ingestion patterns
- Process Email Archives for Compliance and Legal Discovery to add email processing
- Build a Persistent Document Knowledge Base with Vector Storage for durable storage
- Classify Documents with Custom Categories for fine-grained classification
- Extract Invoice Data from PDFs and Images for specialized invoice processing
- Automate Contract and Compliance Document Review for compliance workflows