Automatically Split Multi-Document PDFs with AI Vision
Scanned PDF batches often contain multiple unrelated documents stapled together: an invoice followed by a contract, a pay slip next to an ID card. Manually sorting these pages is tedious and error-prone. LM-Kit.NET's DocumentSplitting class uses a vision language model to detect where one document ends and another begins, returning page ranges and labels for each detected segment. This tutorial builds a document splitter that processes multi-document PDFs and identifies logical boundaries.
Why Local Document Splitting Matters
Two enterprise problems that on-device document splitting solves:
- Mailroom and scanning automation. Large organizations scan incoming mail in bulk, producing multi-document PDFs. A splitting pipeline identifies individual documents (invoices, letters, forms, ID cards) so each can be routed to the correct department or extraction pipeline without manual page-by-page sorting.
- Compliance and record management. Regulatory filings often arrive as combined PDFs. Splitting them into individual documents allows automated classification, archival, and audit trail creation while keeping sensitive content on-premises.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB (vision model required) |
Step 1: Create the Project
dotnet new console -n DocumentSplitter
cd DocumentSplitter
dotnet add package LM-Kit.NET
Step 2: Split a Multi-Document PDF
using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Create the splitter
// ──────────────────────────────────────
var splitter = new DocumentSplitting(model);
// ──────────────────────────────────────
// 3. Analyze the PDF
// ──────────────────────────────────────
string pdfPath = "scanned_batch.pdf";
var attachment = new Attachment(pdfPath);
Console.WriteLine($"Analyzing {attachment.PageCount} pages...\n");
DocumentSplittingResult result = splitter.Split(attachment);
// ──────────────────────────────────────
// 4. Display results
// ──────────────────────────────────────
Console.WriteLine($"Documents found: {result.DocumentCount}");
Console.WriteLine($"Multiple documents: {result.ContainsMultipleDocuments}");
Console.WriteLine($"Confidence: {result.Confidence:P0}\n");
foreach (DocumentSegment segment in result.Segments)
{
Console.WriteLine($" {segment}");
}
Step 3: Use Guidance for Better Accuracy
If you know what types of documents the file contains, provide guidance to help the model:
var splitter = new DocumentSplitting(model)
{
Guidance = "The file contains a mix of invoices, contracts, and receipts."
};
DocumentSplittingResult result = splitter.Split(new Attachment("mixed_batch.pdf"));
Step 4: Physically Split the PDF into Separate Files
After detecting document boundaries, use PdfSplitter to extract each segment into a separate PDF file:
using LMKit.Document.Pdf;
var attachment = new Attachment("scanned_batch.pdf");
DocumentSplittingResult result = splitter.Split(attachment);
if (result.ContainsMultipleDocuments)
{
// Split into separate files based on detected boundaries
List<string> outputFiles = PdfSplitter.SplitToFiles(
attachment,
result,
outputDirectory: "split_output",
fileNamePrefix: "document");
foreach (string file in outputFiles)
{
Console.WriteLine($"Created: {file}");
}
}
Step 5: Process Each Detected Document
After splitting, route each segment to downstream processing:
DocumentSplittingResult result = splitter.Split(new Attachment("scanned_batch.pdf"));
foreach (DocumentSegment segment in result.Segments)
{
Console.WriteLine($"Processing: {segment}");
Console.WriteLine($" Pages {segment.StartPage} to {segment.EndPage} ({segment.PageCount} pages)");
Console.WriteLine($" Label: {segment.Label}");
// Route to specific extraction pipeline based on label
// For example: extract invoice fields, archive contracts, etc.
}
Step 6: Async Processing
For UI applications or web services, use the async API:
DocumentSplittingResult result = await splitter.SplitAsync(
new Attachment("large_batch.pdf"),
cancellationToken);
Console.WriteLine($"Found {result.DocumentCount} documents");
Step 7: Batch Processing
Process a folder of multi-document PDFs:
string[] pdfFiles = Directory.GetFiles("inbox", "*.pdf");
Console.WriteLine($"Processing {pdfFiles.Length} files...\n");
foreach (string file in pdfFiles)
{
string fileName = Path.GetFileName(file);
var attachment = new Attachment(file);
DocumentSplittingResult result = splitter.Split(attachment);
Console.WriteLine($"{fileName}: {result.DocumentCount} document(s)");
foreach (DocumentSegment segment in result.Segments)
{
Console.WriteLine($" {segment}");
}
Console.WriteLine();
}
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| All pages grouped as one document | Pages are continuation of same document | Correct behavior if pages share headers, page numbers (1/5, 2/5), or layout |
| Too many segments (every page separate) | Model treating related pages as distinct | Add Guidance describing the expected document types |
| Low confidence | Complex or poor-quality scans | Use a larger vision model for better accuracy |
| Single-page PDF returns one segment | Only one page in file | Expected behavior; ContainsMultipleDocuments will be false |
Agent-Based PDF Splitting with Built-In Tools
If you are building an AI agent that needs to split PDFs as part of a larger workflow, LM-Kit.NET provides a built-in PdfSplitTool that agents can call autonomously. Combined with other document tools, this enables end-to-end document processing:
using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;
var agent = Agent.CreateBuilder(model)
.WithPersona("Document Processing Agent")
.WithTools(tools =>
{
tools.Register(BuiltInTools.PdfSplit); // Split PDFs by page ranges
tools.Register(BuiltInTools.PdfMerge); // Merge multiple PDFs
tools.Register(BuiltInTools.PdfInfo); // Get page count and metadata
tools.Register(BuiltInTools.DocumentText); // Extract text content
})
.Build();
var result = await agent.RunAsync(
"Extract pages 1-3 from 'batch_scan.pdf' into 'invoice.pdf', " +
"then extract pages 4-8 into 'contract.pdf'.");
See Equip an Agent with Built-In Tools for the complete Document tools reference.
Next Steps
- Build a Classification and Extraction Pipeline: classify split documents and extract structured data from each.
- Extract Invoice Data from PDFs and Images: extract structured fields from individual documents after splitting.
- Convert Documents to Markdown with VLM OCR: convert split pages to text for further processing.
- Equip an Agent with Built-In Tools: use PdfSplit, PdfMerge, and other document tools in agent workflows.