Classify Documents with Custom Categories
Every organization has its own taxonomy: support ticket types, document classes, content tags, compliance labels. LM-Kit.NET's Categorization class sorts text and documents into any set of categories you define, without training data. You provide category names and optional descriptions, and the model classifies content on the spot. This tutorial builds a flexible document classifier with single-label, multi-label, and hierarchical classification.
Why Zero-Shot Classification Matters
Two enterprise problems that custom classification solves:
- No training data required. Traditional classifiers need hundreds of labeled examples per category. Zero-shot classification works immediately with just category names and descriptions. Change your taxonomy at runtime without retraining.
- Adapt to any domain. Support tickets, legal documents, medical records, compliance reports. Each domain has unique categories. A single model handles all of them by swapping the category list.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
| Disk | ~3 GB free for model download |
Step 1: Create the Project
dotnet new console -n ClassifyQuickstart
cd ClassifyQuickstart
dotnet add package LM-Kit.NET
Step 2: Single-Label Classification
Assign each document to exactly one category:
using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Define categories
// ──────────────────────────────────────
string[] categories =
{
"billing",
"technical_support",
"feature_request",
"bug_report",
"account_management"
};
string[] descriptions =
{
"Questions about invoices, payments, refunds, or subscription pricing",
"Issues with product functionality, errors, crashes, or performance",
"Requests for new features, improvements, or integrations",
"Reports of software defects, incorrect behavior, or unexpected results",
"Account creation, deletion, password resets, or profile changes"
};
var classifier = new Categorization(model)
{
AllowUnknownCategory = false
};
// ──────────────────────────────────────
// 3. Classify support tickets
// ──────────────────────────────────────
string[] tickets =
{
"I was charged twice for my subscription this month. Please refund the extra payment.",
"The export feature crashes when I try to generate a PDF with more than 50 pages.",
"It would be great if you could add dark mode to the web interface.",
"My password reset email never arrived. I've checked spam folders.",
"The search results are wrong. Searching for 'invoices' returns customer profiles."
};
Console.WriteLine("Classifying support tickets:\n");
foreach (string ticket in tickets)
{
int index = classifier.GetBestCategory(categories, descriptions, ticket);
string label = categories[index];
float confidence = classifier.Confidence;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write($" [{label,-20}]");
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($" ({confidence:P0}) ");
Console.ResetColor();
Console.WriteLine(ticket.Length > 55 ? ticket.Substring(0, 55) + "..." : ticket);
}
Step 3: Multi-Label Classification
Some content fits multiple categories. Use GetTopCategories to detect all applicable labels:
using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Define categories
// ──────────────────────────────────────
string[] categories =
{
"billing",
"technical_support",
"feature_request",
"bug_report",
"account_management"
};
string[] descriptions =
{
"Questions about invoices, payments, refunds, or subscription pricing",
"Issues with product functionality, errors, crashes, or performance",
"Requests for new features, improvements, or integrations",
"Reports of software defects, incorrect behavior, or unexpected results",
"Account creation, deletion, password resets, or profile changes"
};
var classifier = new Categorization(model)
{
AllowUnknownCategory = false
};
string[] tags =
{
"performance", "security", "usability",
"documentation", "compatibility", "accessibility"
};
string[] tagDescriptions =
{
"Speed, latency, resource usage, or throughput issues",
"Authentication, authorization, encryption, or vulnerability concerns",
"User interface clarity, workflow simplicity, or ease of use",
"Missing, outdated, or incorrect documentation",
"Cross-platform, browser, or version compatibility problems",
"Screen reader support, keyboard navigation, or WCAG compliance"
};
string feedback = """
The login page takes 8 seconds to load on mobile browsers and the password
field doesn't support screen readers. Also, the security documentation hasn't
been updated since the OAuth migration last year.
""";
List<int> matchingTags = classifier.GetTopCategories(
tags, tagDescriptions, feedback, maxCategories: 3);
Console.WriteLine("Applicable tags:");
foreach (int idx in matchingTags)
{
Console.WriteLine($" {tags[idx]}");
}
Step 4: Unknown Category Handling
When content does not fit any predefined category, enable unknown category detection:
using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
var flexibleClassifier = new Categorization(model)
{
AllowUnknownCategory = true
};
string[] departmentCategories = { "engineering", "marketing", "sales", "hr" };
string[] inputs =
{
"We need to refactor the authentication module before the release.",
"The new campaign assets are ready for review.",
"A customer in the lobby is asking for a product demo.",
"The office air conditioning is broken and it's 35 degrees inside."
};
foreach (string input in inputs)
{
int index = flexibleClassifier.GetBestCategory(departmentCategories, input);
if (index == -1)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.Write(" [unknown] ");
}
else
{
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write($" [{departmentCategories[index],-20}]");
}
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($" ({flexibleClassifier.Confidence:P0}) ");
Console.ResetColor();
Console.WriteLine(input.Length > 55 ? input.Substring(0, 55) + "..." : input);
}
Step 5: Classify Documents and Images
Classify file attachments (PDFs, images, Office documents):
using LMKit.Data;
string[] docTypes = { "invoice", "contract", "receipt", "report", "letter" };
string[] docDescriptions =
{
"A bill requesting payment for goods or services",
"A legal agreement between two or more parties",
"Proof of payment or transaction confirmation",
"An analytical document with findings, data, or recommendations",
"Formal correspondence addressed to a person or organization"
};
string filePath = "scanned_document.pdf";
var attachment = new Attachment(filePath);
int index = classifier.GetBestCategory(docTypes, docDescriptions, attachment);
string docType = docTypes[index];
Console.WriteLine($"Document: {Path.GetFileName(filePath)}");
Console.WriteLine($"Type: {docType} ({classifier.Confidence:P0})");
Step 6: Batch Classification with Export
Process a directory of files and export classification results:
using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Define categories
// ──────────────────────────────────────
string[] categories =
{
"billing",
"technical_support",
"feature_request",
"bug_report",
"account_management"
};
string[] descriptions =
{
"Questions about invoices, payments, refunds, or subscription pricing",
"Issues with product functionality, errors, crashes, or performance",
"Requests for new features, improvements, or integrations",
"Reports of software defects, incorrect behavior, or unexpected results",
"Account creation, deletion, password resets, or profile changes"
};
var classifier = new Categorization(model)
{
AllowUnknownCategory = false
};
// ──────────────────────────────────────
// 3. Classify support tickets
// ──────────────────────────────────────
string[] tickets =
{
"I was charged twice for my subscription this month. Please refund the extra payment.",
"The export feature crashes when I try to generate a PDF with more than 50 pages.",
"It would be great if you could add dark mode to the web interface.",
"My password reset email never arrived. I've checked spam folders.",
"The search results are wrong. Searching for 'invoices' returns customer profiles."
};
Console.WriteLine("Classifying support tickets:\n");
foreach (string ticket in tickets)
{
int index = classifier.GetBestCategory(categories, descriptions, ticket);
string label = categories[index];
float confidence = classifier.Confidence;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write($" [{label,-20}]");
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.Write($" ({confidence:P0}) ");
Console.ResetColor();
Console.WriteLine(ticket.Length > 55 ? ticket.Substring(0, 55) + "..." : ticket);
}
string[] files = Directory.GetFiles("inbox", "*.*")
.Where(f => new[] { ".txt", ".pdf", ".docx" }
.Contains(Path.GetExtension(f).ToLowerInvariant()))
.ToArray();
var output = new List<string>();
output.Add("file,category,confidence");
Console.WriteLine($"Classifying {files.Length} documents...\n");
foreach (string file in files)
{
string content = File.ReadAllText(file);
int idx = classifier.GetBestCategory(categories, descriptions, content);
string label = categories[idx];
string fileName = Path.GetFileName(file);
Console.WriteLine($" {fileName}: {label} ({classifier.Confidence:P0})");
output.Add($"\"{fileName}\",\"{label}\",{classifier.Confidence:F2}");
}
File.WriteAllLines("classification_results.csv", output);
Console.WriteLine($"\nExported to classification_results.csv");
Step 7: Using Guidance for Better Accuracy
The Guidance property provides context that helps the model make better decisions:
using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Define categories
// ──────────────────────────────────────
string[] categories =
{
"billing",
"technical_support",
"feature_request",
"bug_report",
"account_management"
};
string[] descriptions =
{
"Questions about invoices, payments, refunds, or subscription pricing",
"Issues with product functionality, errors, crashes, or performance",
"Requests for new features, improvements, or integrations",
"Reports of software defects, incorrect behavior, or unexpected results",
"Account creation, deletion, password resets, or profile changes"
};
var classifier = new Categorization(model)
{
AllowUnknownCategory = false
};
// ──────────────────────────────────────
// 3. Classify support tickets
// ──────────────────────────────────────
string[] tickets =
{
"I was charged twice for my subscription this month. Please refund the extra payment.",
"The export feature crashes when I try to generate a PDF with more than 50 pages.",
"It would be great if you could add dark mode to the web interface.",
"My password reset email never arrived. I've checked spam folders.",
"The search results are wrong. Searching for 'invoices' returns customer profiles."
};
classifier.Guidance = "These are customer support tickets from a SaaS company " +
"that sells project management software. " +
"Tickets about Gantt charts, task boards, and timelines are feature_request " +
"unless they describe something broken.";
string ambiguousTicket = "The Gantt chart doesn't show weekends. Can you add that?";
int idx = classifier.GetBestCategory(categories, descriptions, ambiguousTicket);
Console.WriteLine($"Category: {categories[idx]} ({classifier.Confidence:P0})");
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Wrong category for ambiguous text | Categories overlap | Add clearer descriptions that distinguish edge cases |
| Everything classified as unknown | AllowUnknownCategory too aggressive |
Set to false to force a best-match; or add broader categories |
| Low confidence scores | Categories too similar | Merge overlapping categories; add Guidance for context |
| Slow with many categories | Each category evaluated | Keep category lists under 20; use hierarchical classification for larger taxonomies |
| Image classification fails | Model not vision-capable | Use a VLM model like gemma3:4b |
Next Steps
- Build a Classification and Extraction Pipeline: classify then extract structured data.
- Analyze Customer Sentiment at Scale: combine classification with sentiment analysis.
- Samples: Custom Classification: custom classification demo.
- Samples: Document Classification: document classification demo.