Table of Contents

Extract Keywords from Text

Tagging articles, indexing documents, and building search features all require identifying the most important terms in a piece of text. LM-Kit.NET's KeywordExtraction class pulls the top keywords and key phrases from text and documents, supporting multi-word n-grams, configurable count, and language-aware extraction. This tutorial builds a keyword extraction tool that processes text, documents, and batches.


Why Local Keyword Extraction Matters

Two enterprise problems that on-device keyword extraction solves:

  1. Auto-tag content without exposing it. Internal documents, customer support tickets, and proprietary research papers need tags for search and categorization. Sending them to a cloud API for keyword extraction means a third party reads your content. Local extraction keeps the content private.
  2. Feed downstream pipelines. Keywords extracted locally feed into search indexes, recommendation engines, and topic clustering systems without network latency or API rate limits.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n KeywordQuickstart
cd KeywordQuickstart
dotnet add package LM-Kit.NET

Step 2: Basic Keyword Extraction

using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract keywords
// ──────────────────────────────────────
var extractor = new KeywordExtraction(model)
{
    KeywordCount = 8,
    MaxNgramSize = 3
};

string article = """
    Kubernetes has become the de facto standard for container orchestration in cloud-native
    applications. Organizations are increasingly adopting microservices architectures to
    improve scalability and deployment flexibility. Service mesh technologies like Istio
    provide observability, traffic management, and security for inter-service communication.
    Meanwhile, serverless computing platforms such as AWS Lambda and Azure Functions offer
    event-driven execution without infrastructure management, reducing operational overhead
    for development teams focused on rapid iteration.
    """;

List<KeywordExtraction.KeywordItem> keywords = extractor.ExtractKeywords(article);

Console.WriteLine($"Top {keywords.Count} keywords (confidence: {extractor.Confidence:P0}):\n");

for (int i = 0; i < keywords.Count; i++)
{
    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write($"  {i + 1}. ");
    Console.ResetColor();
    Console.WriteLine(keywords[i].Value);
}

Step 3: Configure Extraction Parameters

Tune the extraction for your use case:

using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract keywords
// ──────────────────────────────────────
var extractor = new KeywordExtraction(model)
{
    KeywordCount = 8,
    MaxNgramSize = 3
};

string article = """
    Kubernetes has become the de facto standard for container orchestration in cloud-native
    applications. Organizations are increasingly adopting microservices architectures to
    improve scalability and deployment flexibility. Service mesh technologies like Istio
    provide observability, traffic management, and security for inter-service communication.
    Meanwhile, serverless computing platforms such as AWS Lambda and Azure Functions offer
    event-driven execution without infrastructure management, reducing operational overhead
    for development teams focused on rapid iteration.
    """;

List<KeywordExtraction.KeywordItem> keywords = extractor.ExtractKeywords(article);

// Few broad keywords for high-level categorization
var broadExtractor = new KeywordExtraction(model)
{
    KeywordCount = 3,
    MaxNgramSize = 1  // Single words only
};

// Many detailed key phrases for fine-grained indexing
var detailedExtractor = new KeywordExtraction(model)
{
    KeywordCount = 15,
    MaxNgramSize = 4  // Up to 4-word phrases
};

// Domain-focused extraction with guidance
var domainExtractor = new KeywordExtraction(model)
{
    KeywordCount = 10,
    MaxNgramSize = 3,
    Guidance = "Focus on technical terms, product names, and programming frameworks. " +
        "Ignore generic words like 'application' or 'system'."
};

string text = "React and Next.js are popular choices for building server-rendered web applications.";

var broad = broadExtractor.ExtractKeywords(text);
var detailed = domainExtractor.ExtractKeywords(text);

Console.WriteLine("Broad:    " + string.Join(", ", broad.Select(k => k.Value)));
Console.WriteLine("Detailed: " + string.Join(", ", detailed.Select(k => k.Value)));

Step 4: Extract Keywords from Documents

Process PDFs, Word documents, and images:

using LMKit.Data;
using LMKit.TextAnalysis;

var extractor = new KeywordExtraction(model)
{
    KeywordCount = 10,
    MaxNgramSize = 3
};

string filePath = "research_paper.pdf";
var attachment = new Attachment(filePath);

List<KeywordExtraction.KeywordItem> docKeywords = extractor.ExtractKeywords(attachment);

Console.WriteLine($"Keywords from {Path.GetFileName(filePath)}:\n");
foreach (var kw in docKeywords)
{
    Console.WriteLine($"  {kw.Value}");
}

Step 5: Batch Keyword Extraction

Tag a collection of documents and export results:

using System.Text;
using LMKit.Model;
using LMKit.TextAnalysis;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Extract keywords
// ──────────────────────────────────────
var extractor = new KeywordExtraction(model)
{
    KeywordCount = 8,
    MaxNgramSize = 3
};

string article = """
    Kubernetes has become the de facto standard for container orchestration in cloud-native
    applications. Organizations are increasingly adopting microservices architectures to
    improve scalability and deployment flexibility. Service mesh technologies like Istio
    provide observability, traffic management, and security for inter-service communication.
    Meanwhile, serverless computing platforms such as AWS Lambda and Azure Functions offer
    event-driven execution without infrastructure management, reducing operational overhead
    for development teams focused on rapid iteration.
    """;

List<KeywordExtraction.KeywordItem> keywords = extractor.ExtractKeywords(article);

string[] files = Directory.GetFiles("articles", "*.txt");
var output = new List<string>();
output.Add("file,keywords");

var extractor = new KeywordExtraction(model)
{
    KeywordCount = 5,
    MaxNgramSize = 3
};

Console.WriteLine($"Extracting keywords from {files.Length} files...\n");

foreach (string file in files)
{
    string content = File.ReadAllText(file);
    var keywords = extractor.ExtractKeywords(content);
    string fileName = Path.GetFileName(file);
    string tags = string.Join("; ", keywords.Select(k => k.Value));

    Console.WriteLine($"  {fileName}: {tags}");

    output.Add($"\"{fileName}\",\"{tags}\"");
}

File.WriteAllLines("keyword_index.csv", output);
Console.WriteLine($"\nExported to keyword_index.csv");

Step 6: Language-Specific Extraction

Extract keywords in a specific language:

using LMKit.Global;
using LMKit.TextAnalysis;

var extractor = new KeywordExtraction(model)
{
    KeywordCount = 5,
    MaxNgramSize = 3,
    TargetLanguage = Language.French
};

string frenchText = """
    L'intelligence artificielle transforme les secteurs de la santé et de la finance.
    Les algorithmes d'apprentissage automatique permettent une analyse prédictive plus
    précise, tandis que le traitement du langage naturel améliore l'interaction
    homme-machine dans les applications grand public.
    """;

var keywords = extractor.ExtractKeywords(frenchText);

Console.WriteLine("French keywords:");
foreach (var kw in keywords)
{
    Console.WriteLine($"  {kw.Value}");
}

When TargetLanguage is set to Language.Undefined (the default), the model auto-detects the language.


Common Issues

Problem Cause Fix
Too many generic keywords No guidance provided Add Guidance to steer toward domain-specific terms
Single-character keywords MaxNgramSize set to 1 with short text Increase MaxNgramSize to 2 or 3 for multi-word phrases
Keywords not in source language TargetLanguage defaulting to English Set TargetLanguage to the source language explicitly
Slow on long documents Document exceeds context window Set MaximumContextLength to limit input size
Duplicate keywords Similar phrases counted separately Post-process results to deduplicate by stemming or similarity

Next Steps

Share