Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/text-analysis/named-entity-recognition/ner_entity_extractor

Multi-Document Entity Registry for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that builds a cross-document entity registry from a folder of mixed documents (PDF, DOCX, TXT, MD, EML, PNG, JPG, …). Each entity is recognised by LM-Kit.NET's NamedEntityRecognition engine and then deduplicated across the whole corpus, so "Acme Corp.", "Acme Corp", and "ACME CORP" collapse into a single registry row with full per-document provenance.

All processing runs on-device. No data leaves the host.


👥 Industry Target Audience

  • Legal & compliance: party / address / date extraction across a contract set.
  • Due diligence: who/what is named across the data room?
  • Discovery / e-discovery: surface entities that recur across exhibits.
  • Healthcare records: cross-document patient/medication mentions.
  • Government / records management: redaction prep, registries.

🚀 Problem Solved

A per-document NER dump is the wrong artefact. The team asks: which entities show up across documents, and which document does each occurrence come from? This demo answers exactly that:

  1. Run NER per document with a confidence floor.
  2. Normalise each entity value (case-fold, trim trailing punctuation, collapse whitespace).
  3. Group by (label, normalised value) → one registry row per logical entity, carrying occurrence count, distinct-document count, max confidence, and the list of source documents.
  4. Keep every individual hit in a separate occurrences CSV so the audit trail is preserved.

💻 Application Overview

Interactive menu — no command-line arguments — with three modes. Model load happens once at startup.

Mode What it does
Live Paste a paragraph; entities appear immediately with confidence.
File Recognise entities in a single document (any supported format).
Folder Walk a folder, optionally recurse, apply a min-confidence filter, build the cross-document registry, emit CSVs.
Quit Exit.

Folder mode emits two artefacts:

  • entities_registry.csv — one row per (label, normalised value) with: label, normalised, representative, occurrences, distinct_documents, max_confidence, documents.
  • entities_occurrences.csv — one row per detection with: document, label, value, normalised, confidence.

✨ Key Features

  • NamedEntityRecognition.Recognize(attachment | text): one call per document, multimodal under the hood (text or vision via VLM).
  • Cross-document dedup: surface what recurs, not what's per-file noise.
  • Confidence floor: ignore weak detections without writing custom code.
  • Two-tier output: aggregated registry for the team, occurrence-level CSV for audit.
  • Interactive, no flags: every input is a console prompt.

🧠 Supported Models

  • Alibaba Qwen 3.5 2B (~2 GB VRAM) — fast default.
  • Alibaba Qwen 3.5 4B (~3.5 GB VRAM).
  • Alibaba Qwen 3.5 9B (~7 GB VRAM) — Recommended for accuracy.
  • Google Gemma 4 E4B (~6 GB VRAM).
  • Alibaba Qwen 3.6 27B (~18 GB VRAM).
  • Alibaba Qwen 3.6 35B-A3B (~22 GB VRAM).
  • Any custom model URI.

Supported Inputs

  • Documents: PDF, DOCX, TXT, MD, EML.
  • Images: PNG, JPG/JPEG, TIFF, BMP, WebP.

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later
  • VRAM appropriate to the selected model (2–18 GB)

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/text-analysis/named-entity-recognition/ner_entity_extractor
dotnet run

Pick a model, then a mode. Everything else is a prompt.

Share