Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_splitting

Intelligent Document Splitting for C# .NET Applications


🎯 Purpose of the Sample

The Document Splitting demo shows how to use the LM-Kit.NET SDK to automatically detect logical document boundaries within a multi-page PDF and split it into individual document segments, each with page ranges and descriptive labels.

It uses a vision language model to analyze each page and determine where one document ends and another begins, without requiring manual page-by-page review.


👥 Industry Target Audience

This demo is useful for developers and businesses involved in:

  • Mailroom and scanning automation: process bulk-scanned mail containing invoices, letters, forms, and ID cards mixed together.
  • Insurance and financial services: separate combined claim documents, policy forms, and supporting documentation.
  • Legal and compliance: split multi-document filings into individual contracts, exhibits, and correspondence.
  • Healthcare administration: separate patient intake forms, insurance cards, and medical records from combined scans.

🚀 Problem Solved

Multi-page PDFs produced by bulk scanning often contain multiple unrelated documents. Sorting them manually is slow and error-prone.

This demo automates the workflow:

  • A vision model analyzes each page's layout, headers, and content
  • Logical document boundaries are detected automatically
  • Each detected segment is returned with its page range and a descriptive label

💻 Sample Application Description

The Document Splitting demo is a console app that:

  • Lets you select a vision-language model (or enter a custom model URI)
  • Downloads and loads the model with progress feedback
  • Prompts you for a PDF file path
  • Analyzes the PDF and detects document boundaries
  • Prints each detected segment with page ranges and labels

✨ Key Features

  • Vision-based analysis: uses a VLM to understand page layouts and detect document types.
  • Automatic boundary detection: identifies where one document ends and another begins.
  • Descriptive labels: each segment gets a label like "Invoice", "National ID Card", "Pay Slip".
  • Physical PDF splitting: optionally export each detected segment as a separate PDF file using PdfSplitter.
  • Optional guidance: provide hints about expected document types for improved accuracy.
  • Sync and async API: both Split and SplitAsync methods available.

💻 Minimal Integration Snippet

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;

// Load a vision model (8B+ recommended)
using LM model = LM.LoadFromModelID("qwen3-vl:8b");

// Create splitter
var splitter = new DocumentSplitting(model);

// Analyze PDF
DocumentSplittingResult result = splitter.Split(new Attachment("batch_scan.pdf"));

// Process results
Console.WriteLine($"Documents found: {result.DocumentCount}");

foreach (DocumentSegment segment in result.Segments)
{
    Console.WriteLine($"  Pages {segment.StartPage}-{segment.EndPage}: {segment.Label}");
}

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later

📥 Download the Project

▶️ Running the Application

  1. Clone the repository:
git clone https://github.com/LM-Kit/lm-kit-net-samples
  1. Navigate to the project directory:
cd lm-kit-net-samples/console_net/document_splitting
  1. Build and run the application:
dotnet build
dotnet run
  1. Follow the on-screen prompts to select a model and provide a multi-page PDF path.

💡 Example Usage

  1. Select a vision-language model: choose a model from the list (8B+ recommended for best accuracy).

  2. Provide a PDF file: enter the path to a multi-page PDF containing mixed documents.

  3. Review results:

    • number of detected documents
    • page ranges for each segment
    • descriptive labels assigned by the model
    • overall confidence score
  4. Process more files: press any key to continue and run another PDF.


🔍 Notes on Key Types

  • DocumentSplitting (LMKit.Extraction): main class that drives the splitting process. Requires a vision-capable model.

  • DocumentSplittingResult (LMKit.Extraction): result container with Segments, DocumentCount, ContainsMultipleDocuments, and Confidence.

  • DocumentSegment (LMKit.Extraction): a single detected document with StartPage, EndPage, PageCount, and Label.

  • PdfSplitter (LMKit.Document.Pdf): physically splits a PDF into separate files based on detected segments. Use PdfSplitter.SplitToFiles(attachment, result, outputDir, prefix) to export each segment as a separate PDF file.

  • Attachment (LMKit.Data): wraps PDF files for input. PageCount exposes the total number of pages.


🔧 Extend the Demo

  • Combine with TextExtraction to extract structured data from each detected segment.
  • Add a classification step to route segments to different processing pipelines.
  • Build a batch processor that scans an inbox folder and splits all incoming PDFs.
  • Use PdfSplitter to export detected segments as separate PDF files automatically.
  • Use PdfMerger to recombine selected segments into new documents.
  • Integrate with a document management system for automated filing.