Table of Contents

Automatically Split Multi-Document PDFs with AI Vision

Scanned PDF batches often contain multiple unrelated documents stapled together: an invoice followed by a contract, a pay slip next to an ID card. Manually sorting these pages is tedious and error-prone. LM-Kit.NET's DocumentSplitting class uses a vision language model to detect where one document ends and another begins, returning page ranges and labels for each detected segment. This tutorial builds a document splitter that processes multi-document PDFs and identifies logical boundaries.


Why Local Document Splitting Matters

Two enterprise problems that on-device document splitting solves:

  1. Mailroom and scanning automation. Large organizations scan incoming mail in bulk, producing multi-document PDFs. A splitting pipeline identifies individual documents (invoices, letters, forms, ID cards) so each can be routed to the correct department or extraction pipeline without manual page-by-page sorting.
  2. Compliance and record management. Regulatory filings often arrive as combined PDFs. Splitting them into individual documents allows automated classification, archival, and audit trail creation while keeping sensitive content on-premises.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB (vision model required)

Step 1: Create the Project

dotnet new console -n DocumentSplitter
cd DocumentSplitter
dotnet add package LM-Kit.NET

Step 2: Split a Multi-Document PDF

using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the splitter
// ──────────────────────────────────────
var splitter = new DocumentSplitting(model);

// ──────────────────────────────────────
// 3. Analyze the PDF
// ──────────────────────────────────────
string pdfPath = "scanned_batch.pdf";
var attachment = new Attachment(pdfPath);

Console.WriteLine($"Analyzing {attachment.PageCount} pages...\n");

DocumentSplittingResult result = splitter.Split(attachment);

// ──────────────────────────────────────
// 4. Display results
// ──────────────────────────────────────
Console.WriteLine($"Documents found: {result.DocumentCount}");
Console.WriteLine($"Multiple documents: {result.ContainsMultipleDocuments}");
Console.WriteLine($"Confidence: {result.Confidence:P0}\n");

foreach (DocumentSegment segment in result.Segments)
{
    Console.WriteLine($"  {segment}");
}

Step 3: Use Guidance for Better Accuracy

If you know what types of documents the file contains, provide guidance to help the model:

var splitter = new DocumentSplitting(model)
{
    Guidance = "The file contains a mix of invoices, contracts, and receipts."
};

DocumentSplittingResult result = splitter.Split(new Attachment("mixed_batch.pdf"));

Step 4: Physically Split the PDF into Separate Files

After detecting document boundaries, use PdfSplitter to extract each segment into a separate PDF file:

using LMKit.Document.Pdf;

var attachment = new Attachment("scanned_batch.pdf");
DocumentSplittingResult result = splitter.Split(attachment);

if (result.ContainsMultipleDocuments)
{
    // Split into separate files based on detected boundaries
    List<string> outputFiles = PdfSplitter.SplitToFiles(
        attachment,
        result,
        outputDirectory: "split_output",
        fileNamePrefix: "document");

    foreach (string file in outputFiles)
    {
        Console.WriteLine($"Created: {file}");
    }
}

Step 5: Process Each Detected Document

After splitting, route each segment to downstream processing:

DocumentSplittingResult result = splitter.Split(new Attachment("scanned_batch.pdf"));

foreach (DocumentSegment segment in result.Segments)
{
    Console.WriteLine($"Processing: {segment}");
    Console.WriteLine($"  Pages {segment.StartPage} to {segment.EndPage} ({segment.PageCount} pages)");
    Console.WriteLine($"  Label: {segment.Label}");

    // Route to specific extraction pipeline based on label
    // For example: extract invoice fields, archive contracts, etc.
}

Step 6: Async Processing

For UI applications or web services, use the async API:

DocumentSplittingResult result = await splitter.SplitAsync(
    new Attachment("large_batch.pdf"),
    cancellationToken);

Console.WriteLine($"Found {result.DocumentCount} documents");

Step 7: Batch Processing

Process a folder of multi-document PDFs:

string[] pdfFiles = Directory.GetFiles("inbox", "*.pdf");

Console.WriteLine($"Processing {pdfFiles.Length} files...\n");

foreach (string file in pdfFiles)
{
    string fileName = Path.GetFileName(file);
    var attachment = new Attachment(file);

    DocumentSplittingResult result = splitter.Split(attachment);

    Console.WriteLine($"{fileName}: {result.DocumentCount} document(s)");

    foreach (DocumentSegment segment in result.Segments)
    {
        Console.WriteLine($"  {segment}");
    }

    Console.WriteLine();
}

Common Issues

Problem Cause Fix
All pages grouped as one document Pages are continuation of same document Correct behavior if pages share headers, page numbers (1/5, 2/5), or layout
Too many segments (every page separate) Model treating related pages as distinct Add Guidance describing the expected document types
Low confidence Complex or poor-quality scans Use a larger vision model for better accuracy
Single-page PDF returns one segment Only one page in file Expected behavior; ContainsMultipleDocuments will be false

Agent-Based PDF Splitting with Built-In Tools

If you are building an AI agent that needs to split PDFs as part of a larger workflow, LM-Kit.NET provides a built-in PdfSplitTool that agents can call autonomously. Combined with other document tools, this enables end-to-end document processing:

using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;

var agent = Agent.CreateBuilder(model)
    .WithPersona("Document Processing Agent")
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.PdfSplit);        // Split PDFs by page ranges
        tools.Register(BuiltInTools.PdfMerge);        // Merge multiple PDFs
        tools.Register(BuiltInTools.PdfInfo);         // Get page count and metadata
        tools.Register(BuiltInTools.DocumentText);    // Extract text content
    })
    .Build();

var result = await agent.RunAsync(
    "Extract pages 1-3 from 'batch_scan.pdf' into 'invoice.pdf', " +
    "then extract pages 4-8 into 'contract.pdf'.");

See Equip an Agent with Built-In Tools for the complete Document tools reference.


Next Steps