Automatically Split Multi-Document PDFs with AI Vision

Scanned PDF batches often contain multiple unrelated documents stapled together: an invoice followed by a contract, a pay slip next to an ID card. Manually sorting these pages is tedious and error-prone. LM-Kit.NET's DocumentSplitting class uses a vision language model to detect where one document ends and another begins, returning page ranges and labels for each detected segment. This tutorial builds a document splitter that processes multi-document PDFs and identifies logical boundaries.

Why Local Document Splitting Matters

Two enterprise problems that on-device document splitting solves:

Mailroom and scanning automation. Large organizations scan incoming mail in bulk, producing multi-document PDFs. A splitting pipeline identifies individual documents (invoices, letters, forms, ID cards) so each can be routed to the correct department or extraction pipeline without manual page-by-page sorting.
Compliance and record management. Regulatory filings often arrive as combined PDFs. Splitting them into individual documents allows automated classification, archival, and audit trail creation while keeping sensitive content on-premises.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	4+ GB (vision model required)

Step 1: Create the Project

dotnet new console -n DocumentSplitter
cd DocumentSplitter
dotnet add package LM-Kit.NET

Step 2: Split a Multi-Document PDF

using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the splitter
// ──────────────────────────────────────
var splitter = new DocumentSplitting(model);

// ──────────────────────────────────────
// 3. Analyze the PDF
// ──────────────────────────────────────
string pdfPath = "scanned_batch.pdf";
var attachment = new Attachment(pdfPath);

Console.WriteLine($"Analyzing {attachment.PageCount} pages...\n");

DocumentSplittingResult result = splitter.Split(attachment);

// ──────────────────────────────────────
// 4. Display results
// ──────────────────────────────────────
Console.WriteLine($"Documents found: {result.DocumentCount}");
Console.WriteLine($"Multiple documents: {result.ContainsMultipleDocuments}");
Console.WriteLine($"Confidence: {result.Confidence:P0}\n");

foreach (DocumentSegment segment in result.Segments)
{
    Console.WriteLine($"  {segment}");
}

Step 3: Use Guidance for Better Accuracy

If you know what types of documents the file contains, provide guidance to help the model:

using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var splitter = new DocumentSplitting(model)
{
    Guidance = "The file contains a mix of invoices, contracts, and receipts."
};

DocumentSplittingResult result = splitter.Split(new Attachment("mixed_batch.pdf"));

Step 4: Physically Split the PDF into Separate Files

After detecting document boundaries, use PdfSplitter to extract each segment into a separate PDF file:

using LMKit.Document.Pdf;

var attachment = new Attachment("scanned_batch.pdf");
DocumentSplittingResult result = splitter.Split(attachment);

if (result.ContainsMultipleDocuments)
{
    // Split into separate files based on detected boundaries
    List<string> outputFiles = PdfSplitter.SplitToFiles(
        attachment,
        result,
        outputDirectory: "split_output",
        fileNamePrefix: "document");

    foreach (string file in outputFiles)
    {
        Console.WriteLine($"Created: {file}");
    }
}

Step 5: Process Each Detected Document

After splitting, route each segment to downstream processing:

using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the splitter
// ──────────────────────────────────────
var splitter = new DocumentSplitting(model);

DocumentSplittingResult result = splitter.Split(new Attachment("scanned_batch.pdf"));

foreach (DocumentSegment segment in result.Segments)
{
    Console.WriteLine($"Processing: {segment}");
    Console.WriteLine($"  Pages {segment.StartPage} to {segment.EndPage} ({segment.PageCount} pages)");
    Console.WriteLine($"  Label: {segment.Label}");

    // Route to specific extraction pipeline based on label
    // For example: extract invoice fields, archive contracts, etc.
}

Step 6: Async Processing

For UI applications or web services, use the async API:

using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the splitter
// ──────────────────────────────────────
var splitter = new DocumentSplitting(model);

DocumentSplittingResult result = await splitter.SplitAsync(
    new Attachment("large_batch.pdf"),
    cancellationToken);

Console.WriteLine($"Found {result.DocumentCount} documents");

Step 7: Batch Processing

Process a folder of multi-document PDFs:

using System.Text;
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load a vision model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3-vl:8b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Create the splitter
// ──────────────────────────────────────
var splitter = new DocumentSplitting(model);

// ──────────────────────────────────────
// 3. Analyze the PDF
// ──────────────────────────────────────
string pdfPath = "scanned_batch.pdf";
var attachment = new Attachment(pdfPath);

string[] pdfFiles = Directory.GetFiles("inbox", "*.pdf");

Console.WriteLine($"Processing {pdfFiles.Length} files...\n");

foreach (string file in pdfFiles)
{
    string fileName = Path.GetFileName(file);
    var attachment = new Attachment(file);

    DocumentSplittingResult result = splitter.Split(attachment);

    Console.WriteLine($"{fileName}: {result.DocumentCount} document(s)");

    foreach (DocumentSegment segment in result.Segments)
    {
        Console.WriteLine($"  {segment}");
    }

    Console.WriteLine();
}

Common Issues

Problem	Cause	Fix
All pages grouped as one document	Pages are continuation of same document	Correct behavior if pages share headers, page numbers (1/5, 2/5), or layout
Too many segments (every page separate)	Model treating related pages as distinct	Add `Guidance` describing the expected document types
Low confidence	Complex or poor-quality scans	Use a larger vision model for better accuracy
Single-page PDF returns one segment	Only one page in file	Expected behavior; `ContainsMultipleDocuments` will be `false`

Agent-Based PDF Splitting with Built-In Tools

If you are building an AI agent that needs to split PDFs as part of a larger workflow, LM-Kit.NET provides a built-in PdfSplitTool that agents can call autonomously. Combined with other document tools, this enables end-to-end document processing:

using LMKit.Agents;
using LMKit.Agents.Tools.BuiltIn;

var agent = Agent.CreateBuilder(model)
    .WithPersona("Document Processing Agent")
    .WithTools(tools =>
    {
        tools.Register(BuiltInTools.PdfSplit);        // Split PDFs by page ranges
        tools.Register(BuiltInTools.PdfMerge);        // Merge multiple PDFs
        tools.Register(BuiltInTools.PdfMetadata);      // Get page count and metadata
        tools.Register(BuiltInTools.DocumentTextExtract);    // Extract text content
    })
    .Build();

var result = await agent.RunAsync(
    "Extract pages 1-3 from 'batch_scan.pdf' into 'invoice.pdf', " +
    "then extract pages 4-8 into 'contract.pdf'.");

See Equip an Agent with Built-In Tools for the complete Document tools reference.

Next Steps

Build a Classification and Extraction Pipeline: classify split documents and extract structured data from each.
Extract Invoice Data from PDFs and Images: extract structured fields from individual documents after splitting.
Convert Documents to Markdown with VLM OCR: convert split pages to text for further processing.
Equip an Agent with Built-In Tools: use PdfSplit, PdfMerge, and other document tools in agent workflows.

Table of Contents