👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/text-analysis/named-entity-recognition/ner_entity_extractor
Multi-Document Entity Registry for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that builds a cross-document entity registry from a folder of mixed documents (PDF, DOCX, TXT, MD, EML, PNG, JPG, …). Each entity is recognised by LM-Kit.NET's NamedEntityRecognition engine and then deduplicated across the whole corpus, so "Acme Corp.", "Acme Corp", and "ACME CORP" collapse into a single registry row with full per-document provenance.
All processing runs on-device. No data leaves the host.
👥 Industry Target Audience
- Legal & compliance: party / address / date extraction across a contract set.
- Due diligence: who/what is named across the data room?
- Discovery / e-discovery: surface entities that recur across exhibits.
- Healthcare records: cross-document patient/medication mentions.
- Government / records management: redaction prep, registries.
🚀 Problem Solved
A per-document NER dump is the wrong artefact. The team asks: which entities show up across documents, and which document does each occurrence come from? This demo answers exactly that:
- Run NER per document with a confidence floor.
- Normalise each entity value (case-fold, trim trailing punctuation, collapse whitespace).
- Group by
(label, normalised value)→ one registry row per logical entity, carrying occurrence count, distinct-document count, max confidence, and the list of source documents. - Keep every individual hit in a separate occurrences CSV so the audit trail is preserved.
💻 Application Overview
Interactive menu — no command-line arguments — with three modes. Model load happens once at startup.
| Mode | What it does |
|---|---|
| Live | Paste a paragraph; entities appear immediately with confidence. |
| File | Recognise entities in a single document (any supported format). |
| Folder | Walk a folder, optionally recurse, apply a min-confidence filter, build the cross-document registry, emit CSVs. |
| Quit | Exit. |
Folder mode emits two artefacts:
entities_registry.csv— one row per(label, normalised value)with:label, normalised, representative, occurrences, distinct_documents, max_confidence, documents.entities_occurrences.csv— one row per detection with:document, label, value, normalised, confidence.
✨ Key Features
NamedEntityRecognition.Recognize(attachment | text): one call per document, multimodal under the hood (text or vision via VLM).- Cross-document dedup: surface what recurs, not what's per-file noise.
- Confidence floor: ignore weak detections without writing custom code.
- Two-tier output: aggregated registry for the team, occurrence-level CSV for audit.
- Interactive, no flags: every input is a console prompt.
🧠 Supported Models
- Alibaba Qwen 3.5 2B (~2 GB VRAM) — fast default.
- Alibaba Qwen 3.5 4B (~3.5 GB VRAM).
- Alibaba Qwen 3.5 9B (~7 GB VRAM) — Recommended for accuracy.
- Google Gemma 4 E4B (~6 GB VRAM).
- Alibaba Qwen 3.6 27B (~18 GB VRAM).
- Alibaba Qwen 3.6 35B-A3B (~22 GB VRAM).
- Any custom model URI.
Supported Inputs
- Documents: PDF, DOCX, TXT, MD, EML.
- Images: PNG, JPG/JPEG, TIFF, BMP, WebP.
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
- VRAM appropriate to the selected model (2–18 GB)
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/text-analysis/named-entity-recognition/ner_entity_extractor
dotnet run
Pick a model, then a mode. Everything else is a prompt.