Table of Contents

Build a Document Summarization Pipeline for Large Archives

Organizations accumulate vast document archives: years of meeting minutes, research papers, regulatory filings, and internal memos. Finding relevant information across thousands of documents requires either reading them all or having structured summaries. LM-Kit.NET's Summarizer class generates both titles and content summaries from text, PDFs, and images, with built-in overflow handling for documents that exceed the model's context window. This tutorial builds a batch summarization pipeline that processes entire document archives and produces a searchable summary catalog.


Why Local Document Summarization Matters

Two enterprise problems that on-device summarization solves:

  1. Confidential document archives. Legal discovery, M&A due diligence, and internal audit processes require summarizing thousands of documents that contain privileged or confidential information. Cloud-based summarization services create data exposure risk and may violate legal holds. Local processing keeps every document on your infrastructure.
  2. Knowledge management at scale. Engineering teams, research labs, and consulting firms accumulate years of project reports and technical documents. A summarization pipeline creates a searchable index of document summaries, enabling staff to find relevant prior work without reading full documents.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n DocSummarizer
cd DocSummarizer
dotnet add package LM-Kit.NET

Step 2: Understand the Summarizer

                    ┌───────────────────┐
                    │    Summarizer     │
                    ├───────────────────┤
  Input ──────────► │  Summarize()      │
  (text/PDF/image)  │                   │
                    │  Overflow?        │
                    │  ├── Truncate     │
                    │  ├── Recursive    │◄── splits, summarizes
                    │  │   Summarize    │    each chunk, merges
                    │  └── Exception    │
                    │                   │
                    │  Output:          │
                    │  ├── Title        │
                    │  └── Content      │
                    └───────────────────┘
Property Default Purpose
MaxContentWords 200 Maximum words in the summary
MaxTitleWords 10 Maximum words in the title
GenerateTitle true Include a generated title
GenerateContent true Include summary content
OverflowStrategy RecursiveSummarize How to handle documents exceeding context window
Guidance empty Custom instructions for the summarization
Intent Classification Classification (label/categorize) or Abstraction (rewrite in own words)

Step 3: Summarize a Single Document

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
    MaxContentWords = 150,
    MaxTitleWords = 10,
    GenerateTitle = true,
    GenerateContent = true,
    Intent = Summarizer.SummarizationIntent.Abstraction
};

// ──────────────────────────────────────
// 3. Summarize a text string
// ──────────────────────────────────────
string reportText =
    "The Q3 2024 quarterly review meeting was held on October 12, 2024, with all department heads " +
    "present. Key highlights: Revenue grew 15% year-over-year to $47.2M, exceeding the $44M target. " +
    "The engineering team shipped version 3.0 of the platform with 42 new features, reducing customer " +
    "churn by 8%. Marketing launched the enterprise campaign in September, generating 340 qualified leads. " +
    "HR reported 12 new hires in engineering and 5 in sales, with overall headcount reaching 287. " +
    "Challenges discussed: supply chain delays affecting hardware shipments, increased cloud infrastructure " +
    "costs (+22%), and two key competitor product launches. Action items: CFO to present cost optimization " +
    "plan by November 1, CTO to evaluate multi-cloud strategy, VP Sales to accelerate Q4 pipeline.";

Console.WriteLine("=== Single Document Summary ===\n");

Summarizer.SummarizerResult result = summarizer.Summarize(reportText);

Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"Title:   {result.Title}");
Console.ResetColor();
Console.WriteLine($"Summary: {result.Summary}");
Console.WriteLine();

Step 4: Summarize PDF Documents

Process PDF files directly using Attachment:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
    MaxContentWords = 150,
    MaxTitleWords = 10,
    GenerateTitle = true,
    GenerateContent = true,
    Intent = Summarizer.SummarizationIntent.Abstraction
};

Console.WriteLine("=== PDF Summarization ===\n");

string pdfPath = "annual-report.pdf";

if (File.Exists(pdfPath))
{
    var attachment = new Attachment(pdfPath);

    Console.Write($"Summarizing {Path.GetFileName(pdfPath)}... ");

    Summarizer.SummarizerResult pdfResult = summarizer.Summarize(attachment);

    Console.WriteLine("done.\n");
    Console.ForegroundColor = ConsoleColor.Yellow;
    Console.WriteLine($"Title:   {pdfResult.Title}");
    Console.ResetColor();
    Console.WriteLine($"Summary: {pdfResult.Summary}");
}

Step 5: Handle Large Documents with Overflow Strategies

When a document exceeds the model's context window, the OverflowStrategy controls behavior:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
    MaxContentWords = 150,
    MaxTitleWords = 10,
    GenerateTitle = true,
    GenerateContent = true,
    Intent = Summarizer.SummarizationIntent.Abstraction
};

// Strategy 1: Recursive Summarize (default, recommended)
// Splits the document, summarizes each part, then summarizes the summaries
summarizer.OverflowStrategy = Summarizer.OverflowResolutionStrategy.RecursiveSummarize;

// Strategy 2: Truncate
// Keeps only the content that fits within the context window
summarizer.OverflowStrategy = Summarizer.OverflowResolutionStrategy.Truncate;

// Strategy 3: Exception
// Throws an exception when content exceeds the limit
summarizer.OverflowStrategy = Summarizer.OverflowResolutionStrategy.RaiseException;
Strategy Speed Quality Use When
RecursiveSummarize Slower Best You need full-document coverage and cannot miss details
Truncate Fastest Lower Executive summaries where the beginning contains key information
RaiseException N/A N/A You need to control chunking manually

Step 6: Batch Archive Summarization

Process an entire folder of documents and generate a summary catalog:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

Console.WriteLine("=== Batch Archive Summarization ===\n");

string archiveFolder = "documents";
string outputFile = "summary_catalog.csv";

if (!Directory.Exists(archiveFolder))
{
    Console.WriteLine($"Create a '{archiveFolder}' folder with documents, then run again.");
    return;
}

string[] supportedExtensions = { ".pdf", ".docx", ".txt", ".md", ".html" };

string[] files = Directory.GetFiles(archiveFolder)
    .Where(f => supportedExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
    .OrderBy(f => f)
    .ToArray();

Console.WriteLine($"Found {files.Length} document(s) in '{archiveFolder}'\n");

var catalogLines = new List<string>();
catalogLines.Add("file,title,summary,words");

var summarizeBatch = new Summarizer(model)
{
    MaxContentWords = 100,
    MaxTitleWords = 8,
    GenerateTitle = true,
    GenerateContent = true,
    Intent = Summarizer.SummarizationIntent.Abstraction,
    OverflowStrategy = Summarizer.OverflowResolutionStrategy.RecursiveSummarize
};

int successCount = 0;
int failCount = 0;

foreach (string filePath in files)
{
    string fileName = Path.GetFileName(filePath);
    Console.Write($"  {fileName}... ");

    try
    {
        var attachment = new Attachment(filePath);
        Summarizer.SummarizerResult r = summarizeBatch.Summarize(attachment);

        string title = (r.Title ?? "").Replace("\"", "\"\"");
        string content = (r.Summary ?? "").Replace("\"", "\"\"").Replace("\n", " ");
        int wordCount = content.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length;

        catalogLines.Add($"\"{fileName}\",\"{title}\",\"{content}\",{wordCount}");

        Console.ForegroundColor = ConsoleColor.Green;
        Console.WriteLine($"[{title}] ({wordCount} words)");
        Console.ResetColor();
        successCount++;
    }
    catch (Exception ex)
    {
        Console.ForegroundColor = ConsoleColor.Red;
        Console.WriteLine($"FAILED: {ex.Message}");
        Console.ResetColor();
        failCount++;
    }
}

File.WriteAllLines(outputFile, catalogLines);

Console.WriteLine($"\n=== Batch Summary ===");
Console.WriteLine($"  Succeeded: {successCount}");
Console.WriteLine($"  Failed:    {failCount}");
Console.WriteLine($"  Catalog:   {Path.GetFullPath(outputFile)}");

Step 7: Domain-Specific Summarization with Guidance

The Guidance property customizes summarization for specific domains:

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
    MaxContentWords = 150,
    MaxTitleWords = 10,
    GenerateTitle = true,
    GenerateContent = true,
    Intent = Summarizer.SummarizationIntent.Abstraction
};

// Legal document summaries
summarizer.Guidance = "Focus on parties involved, key obligations, dates, " +
                      "financial terms, and termination conditions. " +
                      "Flag any unusual clauses or risk indicators.";

// Technical report summaries
summarizer.Guidance = "Focus on methodology, key findings, quantitative results, " +
                      "and actionable recommendations. Include specific numbers and metrics.";

// Meeting minutes summaries
summarizer.Guidance = "Focus on decisions made, action items assigned (with owners and deadlines), " +
                      "and unresolved issues. Skip routine status updates.";

// Financial document summaries
summarizer.Guidance = "Focus on revenue, expenses, profit margins, year-over-year changes, " +
                      "and forward guidance. Include all specific dollar amounts mentioned.";

Step 8: Multilingual Summarization

Generate summaries in a specific language, regardless of the source document language:

using LMKit.TextGeneration;

// Summarize a German document into English
summarizer.TargetLanguage = Language.English;
var result = summarizer.Summarize(germanDocument);

// Summarize any document into French
summarizer.TargetLanguage = Language.French;
var frResult = summarizer.Summarize(englishDocument);

// Auto-detect: summarize in the same language as the source (default)
summarizer.TargetLanguage = Language.Undefined;

Model Selection

Model ID VRAM Speed Quality Best For
gemma3:4b ~3.5 GB Fast Good High-volume batch processing
qwen3:4b ~3.5 GB Fast Good Multilingual document archives
qwen3:8b ~6 GB Moderate Very good Technical and legal documents
gemma3:12b ~8 GB Slower Excellent Complex documents requiring nuanced understanding

For batch archive processing, gemma3:4b provides the best throughput. For legal or technical documents where precision matters, upgrade to qwen3:8b or larger.


Common Issues

Problem Cause Fix
Summary too short or vague MaxContentWords too low Increase to 200+ for detailed summaries
Title not generated GenerateTitle is false Set GenerateTitle = true
Out of memory on large PDFs Document exceeds context window with RaiseException strategy Switch to RecursiveSummarize strategy
Summaries miss second half of document Using Truncate strategy on long documents Switch to RecursiveSummarize for full coverage
Wrong language in output TargetLanguage not set Set TargetLanguage to the desired output language
Slow on large batches Sequential processing Normal behavior; use a smaller model for throughput

Next Steps

Share