Table of Contents

πŸ‘‰ Try the demo:
https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_classification

Document Classification in .NET Applications


🎯 Purpose of the Sample

Document Classification demonstrates how to use LM-Kit.NET to build an intelligent document categorization system that uses vision-capable models to automatically classify documents into predefined categories.

The sample shows how to:

  • Download and load vision models with progress callbacks.
  • Create a Categorization instance for document classification.
  • Define custom category sets tailored to your document types.
  • Classify documents using the Attachment class for multi-format support.
  • Process single files or entire directories in batch mode.
  • Retrieve confidence scores to assess classification reliability.
  • Handle unknown documents that don't fit predefined categories.

Why Document Classification with LM-Kit.NET?

  • Local-first: all processing runs on your hardwareβ€”no cloud dependencies for sensitive documents.
  • Multi-format support: classifies images, PDFs, Office documents, and text files with a single API.
  • Vision-powered: uses multimodal models to understand document layouts, headers, and visual structure.
  • Confidence scoring: know how certain the model is about each classification.
  • Batch processing: classify entire folders with detailed summaries and statistics.
  • Flexible categories: define your own category sets for any domain or use case.

πŸ‘₯ Target Audience

  • Enterprise Developers: build automated document routing and intake systems.
  • Back-Office & RPA: automate mailroom sorting and document triage workflows.
  • Legal & Compliance: automatically categorize contracts, filings, and regulatory documents.
  • Finance & Accounting: route invoices, receipts, and financial statements to appropriate queues.
  • Healthcare: classify medical records, insurance claims, and patient forms.
  • Demo & Education: explore vision-based document understanding in a practical C# example.

πŸš€ Problem Solved

  • Automate document sorting: eliminate manual classification of incoming documents.
  • Handle diverse formats: process scanned images, PDFs, and Office documents uniformly.
  • Scale document intake: batch-process entire folders with performance metrics.
  • Assess classification quality: confidence scores help identify documents needing human review.
  • Integrate easily: simple API for embedding into existing document workflows.
  • Customize categories: adapt to any industry or organizational taxonomy.

πŸ’» Sample Application Description

Console app that:

  • Lets you choose a vision model for classification (or paste a custom model URI).
  • Downloads the model if needed, with live progress updates.
  • Creates a Categorization instance configured for document analysis.
  • Enters an interactive loop where you can:
    • Classify single documents by entering a file path.
    • Batch-process folders by entering a directory path.
    • View available categories and supported formats.
  • Displays classification results with category, confidence, and timing.
  • Provides batch summaries grouped by category with average confidence.
  • Loops until you type exit or press Enter on an empty prompt.

✨ Key Features

  • πŸ“š Multi-format support: images (PNG, JPEG, TIFF, WebP, etc.), PDFs, Word, Excel, PowerPoint, HTML, and plain text.
  • 🏷️ 22 predefined categories: invoice, passport, driver_license, bank_statement, tax_form, receipt, contract, resume, and more.
  • πŸ“Š Confidence scoring: percentage-based confidence with color-coded display (green β‰₯80%, yellow β‰₯50%, red <50%).
  • πŸ“ Batch processing: process entire directories with per-file results and aggregate statistics.
  • ❓ Unknown handling: documents that don't match any category are classified as "unknown".
  • ⏱️ Performance metrics: timing for each classification and total batch duration.
  • 🎨 Rich console output: formatted tables, progress bars, and color-coded results.

🧰 Built-In Models (menu)

On startup, the sample shows a model selection menu:

Option Model Approx. VRAM Needed
0 MiniCPM 2.6 o 8.1B ~5.9 GB VRAM
1 Alibaba Qwen 3 VL 2B ~2.5 GB VRAM
2 Alibaba Qwen 3 VL 4B ~4.5 GB VRAM
3 Alibaba Qwen 3 VL 8B ~6.5 GB VRAM
4 Google Gemma 3 4B ~5.7 GB VRAM
5 Google Gemma 3 12B ~11 GB VRAM
6 Mistral Ministral 3 3B ~3.5 GB VRAM
7 Mistral Ministral 3 8B ~6.5 GB VRAM
8 Mistral Ministral 3 14B ~12 GB VRAM
other Custom model URI depends on model

Any input other than 0-8 is treated as a custom model URI and passed directly to the LM constructor.


🧠 Supported Models

The sample is pre-wired to LM-Kit's predefined model cards:

  • minicpm-o
  • qwen3-vl:2b / qwen3-vl:4b / qwen3-vl:8b
  • gemma3:4b / gemma3:12b
  • ministral3:3b / ministral3:8b / ministral3:14b

Internally:

// Model selection via predefined cards
string modelLink = ModelCard
    .GetPredefinedModelCardByModelID("qwen3-vl:4b")
    .ModelUri.ToString();

// Load model with progress callbacks
var model = new LM(
    new Uri(modelLink),
    downloadingProgress: ModelDownloadingProgress,
    loadingProgress: ModelLoadingProgress);

You can also provide any valid model URI manually (including local paths or custom model servers) by typing/pasting it when prompted.


🏷️ Predefined Categories

The sample includes 22 common document categories:

Category Description
invoice Commercial invoices and bills
passport Passport documents and ID pages
driver_license Driver's licenses and permits
bank_statement Bank account statements
tax_form Tax returns and related forms
receipt Purchase receipts and transaction records
contract Legal contracts and agreements
resume CVs and resumes
medical_record Medical reports and health records
insurance_claim Insurance claim forms
purchase_order Purchase orders and requisitions
shipping_label Shipping and mailing labels
company_registration Business registration documents
utility_bill Utility bills (electric, gas, water)
pay_stub Payroll stubs and salary slips
business_card Business cards
id_card ID cards and badges
birth_certificate Birth certificates
marriage_certificate Marriage certificates
loan_application Loan and credit applications
check Bank checks and money orders
letter General correspondence and letters

πŸ“„ Supported File Formats

The sample processes documents in these formats:

Format Type Extensions
Images .png, .jpg, .jpeg, .gif, .bmp, .tiff, .webp, .tga, .psd, .pic, .pnm, .hdr
Documents .pdf, .docx, .xlsx, .pptx
Text .txt, .html

πŸ› οΈ Commands & Flow

Startup Flow

  1. Model selection: choose a vision-capable model (0-8) or paste a custom URI.
  2. Model loading: model downloads (if needed) and loads with progress reporting.
  3. Interactive loop: enter file/folder paths to classify documents.

Interactive Commands

Inside the main loop, type these commands instead of a path:

Command Description
help Show available commands and supported formats.
categories Display all predefined document categories.
clear Clear the console and show the header.
exit Exit the application.
(empty) Press Enter on empty prompt to exit.

Single File Classification

  1. Enter the path to a document file.
  2. The document is loaded via the Attachment class.
  3. Classification runs with the Categorization.GetBestCategory() method.
  4. Results display: category, confidence percentage, and processing time.

Batch Directory Processing

  1. Enter the path to a folder containing documents.
  2. All supported files in the directory are discovered.
  3. Each file is classified with individual results shown.
  4. A summary displays:
    • Total files processed and elapsed time.
    • Breakdown by category with document counts.
    • Average confidence per category.

πŸ—£οΈ Example Use Cases

Try the sample with:

  • A folder of scanned mail β†’ batch-classify invoices, letters, and bills for routing.
  • ID verification documents β†’ classify passports, driver's licenses, and ID cards.
  • Financial document inbox β†’ sort receipts, bank statements, and tax forms.
  • HR document processing β†’ categorize resumes, contracts, and pay stubs.
  • Legal document intake β†’ route contracts, certificates, and applications.

After each classification, inspect:

  • Category assignment: does it match the document type?
  • Confidence score: high confidence (β‰₯80%) suggests reliable classification.
  • Processing time: acceptable latency for your use case?

πŸ“Š Understanding Results

Single File Output

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Category:   invoice                                  β”‚
β”‚ Confidence: 94%                                      β”‚
β”‚ Time:       342 ms                                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Batch Processing Output

Processing 15 document(s)...
──────────────────────────────────────────────────────────────────────
  βœ“ invoice_001.pdf                   invoice             94%  (287 ms)
  βœ“ receipt_scan.jpg                  receipt             87%  (312 ms)
  βœ“ contract_draft.docx               contract            91%  (445 ms)
  βœ“ unknown_doc.png                   unknown             23%  (298 ms)
  ...
──────────────────────────────────────────────────────────────────────
Batch complete: 15/15 processed in 4,523 ms

Summary by category:
  invoice              :   5 document(s)  (avg confidence: 89%)
  receipt              :   4 document(s)  (avg confidence: 82%)
  contract             :   3 document(s)  (avg confidence: 88%)
  letter               :   2 document(s)  (avg confidence: 76%)
  unknown              :   1 document(s)  (avg confidence: 23%)

Confidence Color Coding

Color Confidence Range Interpretation
🟒 Green β‰₯ 80% High confidenceβ€”reliable classification
🟑 Yellow 50% – 79% Medium confidenceβ€”may need verification
πŸ”΄ Red < 50% Low confidenceβ€”consider manual review

πŸ”§ Advanced Configuration

Custom Categories

Modify the Categories list to define your own document types:

private static readonly List<string> Categories = new()
{
    "purchase_order",
    "invoice",
    "packing_slip",
    "bill_of_lading",
    "customs_declaration",
    "certificate_of_origin"
};

Unknown Category Handling

The sample enables unknown category detection:

var categorizer = new Categorization(model)
{
    AllowUnknownCategory = true  // Returns -1 for unrecognized documents
};

int result = categorizer.GetBestCategory(Categories, attachment);
string category = result == -1 ? "unknown" : Categories[result];

Set AllowUnknownCategory = false to force classification into one of the defined categories.

Accessing Confidence Scores

After classification, retrieve the confidence score:

int result = categorizer.GetBestCategory(Categories, attachment);
double confidence = categorizer.Confidence;  // 0.0 to 1.0

if (confidence < 0.5)
{
    Console.WriteLine("Low confidenceβ€”consider manual review");
}

Adding Custom File Formats

Extend the supported extensions array:

private static readonly string[] SupportedExtensions =
    { ".png", ".bmp", ".gif", ".jpeg", ".jpg", ".pdf", ".docx", 
      ".xlsx", ".pptx", ".txt", ".html", ".rtf", ".odt" };  // Added RTF and ODT

βš™οΈ Behavior & Policies (quick reference)

  • Model selection: exactly one model per process. To change models, restart the app.
  • Download & load:
    • ModelDownloadingProgress displays a progress bar with percentage.
    • ModelLoadingProgress displays loading progress and clears when done.
  • Classification:
    • Uses Categorization.GetBestCategory() for single-label classification.
    • Returns index into the categories list, or -1 if unknown.
    • Confidence property provides reliability score (0.0–1.0).
  • Batch processing:
    • Scans directory for files with supported extensions.
    • Processes sequentially with individual timing.
    • Aggregates results by category with statistics.
  • Licensing:
    • You can set an optional license key via LicenseManager.SetLicenseKey("").
    • A free community license is available from the LM-Kit website.

πŸ’» Minimal Integration Snippet

using LMKit.Data;
using LMKit.Model;
using LMKit.TextAnalysis;

public class DocumentClassifier
{
    private readonly Categorization _categorizer;
    private readonly List<string> _categories;

    public DocumentClassifier(string modelUri, List<string> categories)
    {
        var model = new LM(new Uri(modelUri));
        _categorizer = new Categorization(model)
        {
            AllowUnknownCategory = true
        };
        _categories = categories;
    }

    public (string Category, double Confidence) Classify(string filePath)
    {
        var attachment = new Attachment(filePath);
        int result = _categorizer.GetBestCategory(_categories, attachment);
        
        string category = result == -1 ? "unknown" : _categories[result];
        double confidence = _categorizer.Confidence;
        
        return (category, confidence);
    }

    public Dictionary<string, List<string>> ClassifyDirectory(string directoryPath)
    {
        var results = new Dictionary<string, List<string>>();
        
        foreach (var file in Directory.GetFiles(directoryPath))
        {
            var (category, _) = Classify(file);
            
            if (!results.ContainsKey(category))
                results[category] = new List<string>();
            
            results[category].Add(file);
        }
        
        return results;
    }
}

Use this pattern to integrate document classification into web APIs, desktop apps, or document processing pipelines.


πŸ› οΈ Getting Started

πŸ“‹ Prerequisites

  • .NET Framework 4.6.2 or .NET 8.0+

πŸ“₯ Download

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/document_classification

Project Link: document_classification (same path as above)

▢️ Run

dotnet build
dotnet run

Then:

  1. Select a model by typing 0-8, or paste a custom model URI.
  2. Wait for the model to download (first run) and load.
  3. Enter a file path to classify a single document.
  4. Or enter a folder path to batch-process all documents.
  5. Use commands (help, categories, clear) as needed.
  6. Type exit or press Enter on an empty prompt to quit.

πŸ” Notes on Key Types

Core Classes

  • Categorization (LMKit.TextAnalysis) - main class for document classification:

    • Classifies documents into predefined categories.
    • Supports unknown category detection via AllowUnknownCategory.
    • Provides confidence scores via the Confidence property.
  • Attachment (LMKit.Data) - multi-format document wrapper:

    • Loads documents from file paths.
    • Supports images, PDFs, Office documents, and text files.
    • Handles format detection and content extraction automatically.
  • LM (LMKit.Model) - language model loader:

    • Downloads models from URIs with progress callbacks.
    • Manages model lifecycle and inference.
  • ModelCard (LMKit.Model) - predefined model registry:

    • Access curated models via GetPredefinedModelCardByModelID().
    • Provides model URIs, metadata, and recommended configurations.

Key Properties

Property Type Description
Categorization.AllowUnknownCategory bool When true, returns -1 for unrecognized documents.
Categorization.Confidence double Confidence score (0.0–1.0) for the last classification.

Key Methods

Method Returns Description
Categorization.GetBestCategory(categories, attachment) int Index of best matching category, or -1 if unknown.
Attachment(filePath) Attachment Creates an attachment from a file path.

⚠️ Troubleshooting

  • "Path not found"

    • Verify the file or folder path is correct.
    • Try using absolute paths or quoting paths with spaces.
  • "No supported documents found in directory"

    • The folder contains no files with supported extensions.
    • Check the supported formats list with the help command.
  • Low confidence scores

    • Document may not match any predefined category well.
    • Try a larger model for better accuracy.
    • Consider adding more specific categories for your document types.
    • Ensure document quality (clear scans, readable text).
  • Out-of-memory or driver errors

    • VRAM insufficient for selected model.
    • Pick smaller models (e.g., Qwen 3 VL 2B, Ministral 3 3B).
    • Close other GPU-intensive applications.
  • Slow classification

    • First classification may be slower due to model warm-up.
    • Use a smaller model for faster processing.
    • Consider GPU acceleration for production workloads.
  • Incorrect classifications

    • Review if your categories adequately cover your document types.
    • Add more specific categories or remove ambiguous ones.
    • Try a larger, more capable model.
    • Ensure documents are clear and legible.
  • "Unknown" classifications for known document types

    • Lower the internal threshold by using AllowUnknownCategory = false.
    • Add the specific document type to your categories list.
    • Check if the document format is clear enough for visual analysis.

πŸ”§ Extend the Demo

  • Web API integration: expose classification as a REST endpoint for document intake services.
  • Custom categories: define industry-specific taxonomies (healthcare, legal, finance).
  • Multi-label classification: extend to support documents belonging to multiple categories.
  • Confidence thresholds: route low-confidence documents to human review queues.
  • Database logging: store classification results with metadata for audit trails.
  • Watch folders: monitor directories for new documents and classify automatically.
  • Integration with other LM-Kit features:
    • Chain with Structured Extraction to pull specific fields from classified documents.
    • Use PdfChat for follow-up questions on classified documents.
    • Connect to Function Calling for automated document workflows.

πŸ“š Additional Resources