Build a Document Summarization Pipeline for Large Archives
Organizations accumulate vast document archives: years of meeting minutes, research papers, regulatory filings, and internal memos. Finding relevant information across thousands of documents requires either reading them all or having structured summaries. LM-Kit.NET's Summarizer class generates both titles and content summaries from text, PDFs, and images, with built-in overflow handling for documents that exceed the model's context window. This tutorial builds a batch summarization pipeline that processes entire document archives and produces a searchable summary catalog.
Why Local Document Summarization Matters
Two enterprise problems that on-device summarization solves:
- Confidential document archives. Legal discovery, M&A due diligence, and internal audit processes require summarizing thousands of documents that contain privileged or confidential information. Cloud-based summarization services create data exposure risk and may violate legal holds. Local processing keeps every document on your infrastructure.
- Knowledge management at scale. Engineering teams, research labs, and consulting firms accumulate years of project reports and technical documents. A summarization pipeline creates a searchable index of document summaries, enabling staff to find relevant prior work without reading full documents.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB |
| Disk | ~3 GB free for model download |
Step 1: Create the Project
dotnet new console -n DocSummarizer
cd DocSummarizer
dotnet add package LM-Kit.NET
Step 2: Understand the Summarizer
┌───────────────────┐
│ Summarizer │
├───────────────────┤
Input ──────────► │ Summarize() │
(text/PDF/image) │ │
│ Overflow? │
│ ├── Truncate │
│ ├── Recursive │◄── splits, summarizes
│ │ Summarize │ each chunk, merges
│ └── Exception │
│ │
│ Output: │
│ ├── Title │
│ └── Content │
└───────────────────┘
| Property | Default | Purpose |
|---|---|---|
MaxContentWords |
200 | Maximum words in the summary |
MaxTitleWords |
10 | Maximum words in the title |
GenerateTitle |
true | Include a generated title |
GenerateContent |
true | Include summary content |
OverflowStrategy |
RecursiveSummarize |
How to handle documents exceeding context window |
Guidance |
empty | Custom instructions for the summarization |
Intent |
Classification |
Classification (label/categorize) or Abstraction (rewrite in own words) |
Step 3: Summarize a Single Document
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
MaxContentWords = 150,
MaxTitleWords = 10,
GenerateTitle = true,
GenerateContent = true,
Intent = Summarizer.SummarizationIntent.Abstraction
};
// ──────────────────────────────────────
// 3. Summarize a text string
// ──────────────────────────────────────
string reportText =
"The Q3 2024 quarterly review meeting was held on October 12, 2024, with all department heads " +
"present. Key highlights: Revenue grew 15% year-over-year to $47.2M, exceeding the $44M target. " +
"The engineering team shipped version 3.0 of the platform with 42 new features, reducing customer " +
"churn by 8%. Marketing launched the enterprise campaign in September, generating 340 qualified leads. " +
"HR reported 12 new hires in engineering and 5 in sales, with overall headcount reaching 287. " +
"Challenges discussed: supply chain delays affecting hardware shipments, increased cloud infrastructure " +
"costs (+22%), and two key competitor product launches. Action items: CFO to present cost optimization " +
"plan by November 1, CTO to evaluate multi-cloud strategy, VP Sales to accelerate Q4 pipeline.";
Console.WriteLine("=== Single Document Summary ===\n");
Summarizer.SummarizerResult result = summarizer.Summarize(reportText);
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"Title: {result.Title}");
Console.ResetColor();
Console.WriteLine($"Summary: {result.Summary}");
Console.WriteLine();
Step 4: Summarize PDF Documents
Process PDF files directly using Attachment:
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
MaxContentWords = 150,
MaxTitleWords = 10,
GenerateTitle = true,
GenerateContent = true,
Intent = Summarizer.SummarizationIntent.Abstraction
};
Console.WriteLine("=== PDF Summarization ===\n");
string pdfPath = "annual-report.pdf";
if (File.Exists(pdfPath))
{
var attachment = new Attachment(pdfPath);
Console.Write($"Summarizing {Path.GetFileName(pdfPath)}... ");
Summarizer.SummarizerResult pdfResult = summarizer.Summarize(attachment);
Console.WriteLine("done.\n");
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"Title: {pdfResult.Title}");
Console.ResetColor();
Console.WriteLine($"Summary: {pdfResult.Summary}");
}
Step 5: Handle Large Documents with Overflow Strategies
When a document exceeds the model's context window, the OverflowStrategy controls behavior:
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
MaxContentWords = 150,
MaxTitleWords = 10,
GenerateTitle = true,
GenerateContent = true,
Intent = Summarizer.SummarizationIntent.Abstraction
};
// Strategy 1: Recursive Summarize (default, recommended)
// Splits the document, summarizes each part, then summarizes the summaries
summarizer.OverflowStrategy = Summarizer.OverflowResolutionStrategy.RecursiveSummarize;
// Strategy 2: Truncate
// Keeps only the content that fits within the context window
summarizer.OverflowStrategy = Summarizer.OverflowResolutionStrategy.Truncate;
// Strategy 3: Exception
// Throws an exception when content exceeds the limit
summarizer.OverflowStrategy = Summarizer.OverflowResolutionStrategy.RaiseException;
| Strategy | Speed | Quality | Use When |
|---|---|---|---|
RecursiveSummarize |
Slower | Best | You need full-document coverage and cannot miss details |
Truncate |
Fastest | Lower | Executive summaries where the beginning contains key information |
RaiseException |
N/A | N/A | You need to control chunking manually |
Step 6: Batch Archive Summarization
Process an entire folder of documents and generate a summary catalog:
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
Console.WriteLine("=== Batch Archive Summarization ===\n");
string archiveFolder = "documents";
string outputFile = "summary_catalog.csv";
if (!Directory.Exists(archiveFolder))
{
Console.WriteLine($"Create a '{archiveFolder}' folder with documents, then run again.");
return;
}
string[] supportedExtensions = { ".pdf", ".docx", ".txt", ".md", ".html" };
string[] files = Directory.GetFiles(archiveFolder)
.Where(f => supportedExtensions.Contains(Path.GetExtension(f).ToLowerInvariant()))
.OrderBy(f => f)
.ToArray();
Console.WriteLine($"Found {files.Length} document(s) in '{archiveFolder}'\n");
var catalogLines = new List<string>();
catalogLines.Add("file,title,summary,words");
var summarizeBatch = new Summarizer(model)
{
MaxContentWords = 100,
MaxTitleWords = 8,
GenerateTitle = true,
GenerateContent = true,
Intent = Summarizer.SummarizationIntent.Abstraction,
OverflowStrategy = Summarizer.OverflowResolutionStrategy.RecursiveSummarize
};
int successCount = 0;
int failCount = 0;
foreach (string filePath in files)
{
string fileName = Path.GetFileName(filePath);
Console.Write($" {fileName}... ");
try
{
var attachment = new Attachment(filePath);
Summarizer.SummarizerResult r = summarizeBatch.Summarize(attachment);
string title = (r.Title ?? "").Replace("\"", "\"\"");
string content = (r.Summary ?? "").Replace("\"", "\"\"").Replace("\n", " ");
int wordCount = content.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length;
catalogLines.Add($"\"{fileName}\",\"{title}\",\"{content}\",{wordCount}");
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"[{title}] ({wordCount} words)");
Console.ResetColor();
successCount++;
}
catch (Exception ex)
{
Console.ForegroundColor = ConsoleColor.Red;
Console.WriteLine($"FAILED: {ex.Message}");
Console.ResetColor();
failCount++;
}
}
File.WriteAllLines(outputFile, catalogLines);
Console.WriteLine($"\n=== Batch Summary ===");
Console.WriteLine($" Succeeded: {successCount}");
Console.WriteLine($" Failed: {failCount}");
Console.WriteLine($" Catalog: {Path.GetFullPath(outputFile)}");
Step 7: Domain-Specific Summarization with Guidance
The Guidance property customizes summarization for specific domains:
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create the summarizer
// ──────────────────────────────────────
var summarizer = new Summarizer(model)
{
MaxContentWords = 150,
MaxTitleWords = 10,
GenerateTitle = true,
GenerateContent = true,
Intent = Summarizer.SummarizationIntent.Abstraction
};
// Legal document summaries
summarizer.Guidance = "Focus on parties involved, key obligations, dates, " +
"financial terms, and termination conditions. " +
"Flag any unusual clauses or risk indicators.";
// Technical report summaries
summarizer.Guidance = "Focus on methodology, key findings, quantitative results, " +
"and actionable recommendations. Include specific numbers and metrics.";
// Meeting minutes summaries
summarizer.Guidance = "Focus on decisions made, action items assigned (with owners and deadlines), " +
"and unresolved issues. Skip routine status updates.";
// Financial document summaries
summarizer.Guidance = "Focus on revenue, expenses, profit margins, year-over-year changes, " +
"and forward guidance. Include all specific dollar amounts mentioned.";
Step 8: Multilingual Summarization
Generate summaries in a specific language, regardless of the source document language:
using LMKit.TextGeneration;
// Summarize a German document into English
summarizer.TargetLanguage = Language.English;
var result = summarizer.Summarize(germanDocument);
// Summarize any document into French
summarizer.TargetLanguage = Language.French;
var frResult = summarizer.Summarize(englishDocument);
// Auto-detect: summarize in the same language as the source (default)
summarizer.TargetLanguage = Language.Undefined;
Model Selection
| Model ID | VRAM | Speed | Quality | Best For |
|---|---|---|---|---|
gemma3:4b |
~3.5 GB | Fast | Good | High-volume batch processing |
qwen3:4b |
~3.5 GB | Fast | Good | Multilingual document archives |
qwen3:8b |
~6 GB | Moderate | Very good | Technical and legal documents |
gemma3:12b |
~8 GB | Slower | Excellent | Complex documents requiring nuanced understanding |
For batch archive processing, gemma3:4b provides the best throughput. For legal or technical documents where precision matters, upgrade to qwen3:8b or larger.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Summary too short or vague | MaxContentWords too low |
Increase to 200+ for detailed summaries |
| Title not generated | GenerateTitle is false |
Set GenerateTitle = true |
| Out of memory on large PDFs | Document exceeds context window with RaiseException strategy |
Switch to RecursiveSummarize strategy |
| Summaries miss second half of document | Using Truncate strategy on long documents |
Switch to RecursiveSummarize for full coverage |
| Wrong language in output | TargetLanguage not set |
Set TargetLanguage to the desired output language |
| Slow on large batches | Sequential processing | Normal behavior; use a smaller model for throughput |
Next Steps
- Summarize Documents and Text: core summarization guide with additional options.
- Chat with PDF Documents: interactive Q&A instead of one-shot summarization.
- Build a Multi-Format Document Ingestion Pipeline: ingest and index documents for RAG.
- Classify Documents with Custom Categories: classify documents before summarization.