Table of Contents

Build a Classification and Extraction Pipeline

Real-world document processing rarely involves a single AI step. Incoming documents arrive in mixed types (invoices, contracts, support tickets) and each type requires different extraction logic. This tutorial builds an automated pipeline that classifies documents by type, then routes each type to a specialized extraction schema, producing structured JSON output.


Why Local Classification Pipelines Matter

Two enterprise problems that on-device classification pipelines solve:

  1. Accounts payable automation. Organizations receive thousands of documents monthly (invoices, purchase orders, receipts, credit notes) in varying formats. A classification pipeline identifies the document type and extracts the relevant fields, routing structured data directly into ERP systems without human sorting.
  2. Support ticket triage. Incoming support requests need to be categorized by department, priority, and issue type before extracting actionable fields (account numbers, product names, error codes). Running this on-premises avoids sending customer PII to external services.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB

Step 1: Create the Project

dotnet new console -n ClassifyExtractPipeline
cd ClassifyExtractPipeline
dotnet add package LM-Kit.NET

Step 2: Understand the Pipeline

  Incoming      ┌──────────────┐        ┌──────────────────────┐
  document ───► │ Categorize   │  ───►  │ Route to schema      │
                │ (what type?) │        │ (invoice? contract?) │
                └──────────────┘        └──────────┬───────────┘
                                                   │
                              ┌────────────────────┼────────────────────┐
                              ▼                    ▼                    ▼
                       Invoice schema      Contract schema      Ticket schema
                       (vendor, total)     (parties, dates)     (issue, priority)
                              │                    │                    │
                              ▼                    ▼                    ▼
                       Structured JSON     Structured JSON     Structured JSON

This pipeline uses two LM-Kit.NET classes:

  • Categorization to classify the document type
  • TextExtraction to extract typed fields using the appropriate schema

Step 3: The Complete Pipeline

using System.Text;
using LMKit.Data;
using LMKit.Extraction;
using LMKit.Model;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Define document categories
// ──────────────────────────────────────
string[] categories = { "invoice", "support_ticket", "contract" };

string[] descriptions =
{
    "A billing document with vendor info, line items, and amounts due",
    "A customer request for help with a product or service issue",
    "A legal agreement between parties with terms and conditions"
};

var categorizer = new Categorization(model)
{
    AllowUnknownCategory = true
};

// ──────────────────────────────────────
// 3. Define extraction schemas per type
// ──────────────────────────────────────
var invoiceSchema = new List<TextExtractionElement>
{
    new("VendorName", ElementType.String, "Name of the vendor or supplier", isRequired: true),
    new("InvoiceNumber", ElementType.String, "Invoice identifier"),
    new("InvoiceDate", ElementType.Date, "Date the invoice was issued"),
    new("Total", ElementType.Double, "Total amount due", isRequired: true),
    new("Currency", ElementType.String, "Currency code (e.g. USD, EUR)")
};

var ticketSchema = new List<TextExtractionElement>
{
    new("CustomerName", ElementType.String, "Name of the customer"),
    new("IssueType", ElementType.String, "Category of the problem"),
    new("Priority", ElementType.String, "Urgency level"),
    new("ProductName", ElementType.String, "Product or service mentioned"),
    new("Description", ElementType.String, "Summary of the issue")
};

var contractSchema = new List<TextExtractionElement>
{
    new("PartyA", ElementType.String, "First party in the agreement"),
    new("PartyB", ElementType.String, "Second party in the agreement"),
    new("EffectiveDate", ElementType.Date, "When the contract takes effect"),
    new("ExpirationDate", ElementType.Date, "When the contract expires"),
    new("ContractValue", ElementType.Double, "Total value of the contract")
};

// ──────────────────────────────────────
// 4. Process sample documents
// ──────────────────────────────────────
string[] sampleDocuments =
{
    "Invoice #4821 from Acme Supplies, dated January 15, 2025. " +
    "Items: 50 units of Widget A at $12.00 each, 20 units of Gadget B at $45.00 each. " +
    "Subtotal: $1,500.00. Tax: $120.00. Total due: $1,620.00 USD. Payment due February 14, 2025.",

    "Hi, my name is Sarah Chen and I've been unable to log into my account for 3 days. " +
    "I keep getting error code E-4012 on the CloudSync Pro dashboard. " +
    "This is blocking my entire team. Please treat this as urgent.",

    "This Service Agreement is entered into between TechCorp Inc. and Global Logistics Ltd., " +
    "effective March 1, 2025, and expiring February 28, 2026. " +
    "TechCorp will provide cloud infrastructure services valued at $240,000 annually."
};

var extractor = new TextExtraction(model);

Console.WriteLine("Processing documents:\n");

foreach (string doc in sampleDocuments)
{
    // Step A: Classify
    int categoryIndex = categorizer.GetBestCategory(categories, descriptions, doc);
    string docType = categoryIndex >= 0 ? categories[categoryIndex] : "unknown";
    float classifyConfidence = categorizer.Confidence;

    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"  Type: {docType} ({classifyConfidence:P0})");
    Console.ResetColor();

    if (categoryIndex < 0)
    {
        Console.WriteLine("  Skipping unknown document type.\n");
        continue;
    }

    // Step B: Route to appropriate schema
    extractor.Elements = docType switch
    {
        "invoice" => invoiceSchema,
        "support_ticket" => ticketSchema,
        "contract" => contractSchema,
        _ => null
    };

    if (extractor.Elements == null) continue;

    // Step C: Extract
    extractor.SetContent(doc);
    TextExtractionResult result = extractor.Parse();

    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  Extraction confidence: {result.Confidence:P0}");
    Console.ResetColor();
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"  {result.Json}");
    Console.ResetColor();
    Console.WriteLine();
}

Step 4: Processing File Attachments

The pipeline works with files (PDFs, images, Word documents) just as easily as raw text:

// Classify a PDF file
var attachment = new Attachment("incoming-document.pdf");
int typeIndex = categorizer.GetBestCategory(categories, descriptions, attachment);

// Extract from the same file
extractor.Elements = schemas[typeIndex];
extractor.SetContent(attachment);
var result = extractor.Parse();
Console.WriteLine(result.Json);

Step 5: Multi-Label Classification

Some documents may belong to multiple categories. Use GetTopCategories to get ranked matches:

string[] tags = { "billing", "technical", "account_access", "feature_request", "bug_report" };

string ticket = "I can't access my billing dashboard to download invoices. " +
                "The page shows error 500. This started after the latest update.";

List<int> topTags = categorizer.GetTopCategories(tags, ticket, maxCategories: 3);

Console.WriteLine("Tags:");
foreach (int tagIndex in topTags)
{
    Console.WriteLine($"  {tags[tagIndex]} ({categorizer.Confidence:P0})");
}
// Output: billing (87%), account_access (82%), bug_report (74%)

Step 6: Confidence Thresholds

In production, reject low-confidence classifications rather than processing them incorrectly:

const float MinClassificationConfidence = 0.70f;
const float MinExtractionConfidence = 0.60f;

int typeIndex = categorizer.GetBestCategory(categories, doc);

if (categorizer.Confidence < MinClassificationConfidence)
{
    Console.WriteLine("Low classification confidence. Routing to manual review.");
    continue;
}

extractor.Elements = schemas[typeIndex];
extractor.SetContent(doc);
var result = extractor.Parse();

if (result.Confidence < MinExtractionConfidence)
{
    Console.WriteLine("Low extraction confidence. Flagging for human verification.");
}

Step 7: Adding Category Descriptions and Guidance

Improve classification accuracy with descriptions and extraction guidance:

// Descriptions help the model distinguish similar categories
string[] categories = { "purchase_order", "invoice", "credit_note" };
string[] descriptions =
{
    "A request to buy goods or services, issued by the buyer before delivery",
    "A bill for goods or services already delivered, issued by the seller",
    "A document reducing the amount owed, issued by the seller as a correction"
};

int result = categorizer.GetBestCategory(categories, descriptions, documentText);

// Guidance helps extraction understand domain-specific formatting
extractor.Guidance = "This is a European document. Dates are in DD/MM/YYYY format. " +
                     "Amounts use comma as decimal separator (e.g., 1.234,56).";

Common Issues

Problem Cause Fix
Wrong category assigned Categories too similar without descriptions Add descriptions to distinguish them; use Guidance for context
Extraction returns nulls Schema descriptions too vague Write specific descriptions: "Total amount in USD including tax" instead of "Total"
Unknown documents classified anyway AllowUnknownCategory is false (default) Set AllowUnknownCategory = true to handle unrecognized types
Slow on large documents Model processes entire text Set categorizer.MaxInputTokens to limit analysis to first N tokens
Pipeline fails on PDFs Missing native PDF library Ensure LM-Kit.NET native binaries are in the output directory

Next Steps