Table of Contents

👉 Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/batch_document_classification

Batch Document Classification with AI in .NET Applications


🎯 Purpose of the Sample

Batch Document Classification demonstrates how to classify large volumes of heterogeneous documents using LM-Kit.NET. The sample processes files in parallel, assigns each document to the most relevant category, and automatically organizes outputs into category-based folders.

It supports images, PDFs, Office documents, text, and HTML files, making it suitable for real-world document pipelines.


👥 Target Audience

  • Enterprise & B2B Apps – automate document intake and routing
  • Back-office & Ops – sort incoming documents at scale
  • Compliance & Archiving – pre-classify documents before review
  • Demo & Benchmarking – measure throughput and confidence at scale

🚀 Problem Solved

  • Manual sorting of mixed document folders
  • Scalable classification with configurable parallelism
  • Consistent taxonomy across thousands of files
  • Automated organization of outputs by category

💻 Sample Application Description

Console application that:

  • Loads a local LM-Kit classification model.
  • Scans an input directory recursively.
  • Filters supported file types (images, PDFs, Office docs, text).
  • Classifies each document into a predefined category list.
  • Runs in parallel with configurable thread count.
  • Copies files into category-based output folders.
  • Displays real-time progress, confidence, and performance metrics.

📂 Supported File Types

  • Images: PNG, JPG, JPEG, TIFF, WEBP, BMP, GIF, PSD, HDR, TGA
  • Documents: PDF, DOCX, XLSX, PPTX
  • Text: TXT, HTML

🏷️ Supported Categories

Examples include:

  • Invoice, Receipt, Purchase Order
  • Contract, Letter, Resume
  • Bank Statement, Utility Bill, Pay Stub
  • Passport, ID Card, Driver License
  • Medical Record, Insurance Policy
  • Shipping Document, Shipping Label
  • Unknown (fallback)

⚙️ Key Features

  • ⚡ Parallel Processing – configurable number of threads
  • 📁 Auto-Sorting – output folders per detected category
  • 🧠 Confidence Scoring – per-document confidence value
  • 📊 Live Metrics – throughput, average latency, docs per second
  • 🧩 Mixed Inputs – images and documents handled uniformly
  • 🔁 Thread-Safe Design – shared model, per-thread categorizer

🛠️ Getting Started

📋 Prerequisites

  • .NET Framework 4.6.2 or .NET 8.0+
  • LM-Kit.NET license key

📥 Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/batch_document_classification

▶️ Run

dotnet build
dotnet run

You will be prompted for:

  • Input folder containing documents
  • Output folder for classified files
  • Number of processing threads

📈 Runtime Output

During execution, the console displays:

  • File name and detected category
  • Confidence score
  • Per-document processing time
  • Global progress and average latency

At completion:

  • Total documents processed
  • Documents per second
  • Average confidence
  • Error count (if any)

🔍 Notes

  • The model is loaded once and shared across threads.
  • Each worker thread uses its own Categorization instance.
  • Unknown or ambiguous documents are routed to the unknown category.
  • Output file names are auto-deduplicated.

🔧 Extend the Demo

  • Customize the category taxonomy.
  • Persist results to a database instead of folders.
  • Add confidence thresholds for rejection or review queues.
  • Integrate OCR preprocessing for scanned documents.
  • Combine with RAG or extraction pipelines for downstream processing.