👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/document_splitting
Intelligent Document Splitting for C# .NET Applications
🎯 Purpose of the Sample
The Document Splitting demo shows how to use the LM-Kit.NET SDK to automatically detect logical document boundaries within a multi-page PDF and split it into individual document segments, each with page ranges and descriptive labels.
It uses a vision language model to analyze each page and determine where one document ends and another begins, without requiring manual page-by-page review.
👥 Industry Target Audience
This demo is useful for developers and businesses involved in:
- Mailroom and scanning automation: process bulk-scanned mail containing invoices, letters, forms, and ID cards mixed together.
- Insurance and financial services: separate combined claim documents, policy forms, and supporting documentation.
- Legal and compliance: split multi-document filings into individual contracts, exhibits, and correspondence.
- Healthcare administration: separate patient intake forms, insurance cards, and medical records from combined scans.
🚀 Problem Solved
Multi-page PDFs produced by bulk scanning often contain multiple unrelated documents. Sorting them manually is slow and error-prone.
This demo automates the workflow:
- A vision model analyzes each page's layout, headers, and content
- Logical document boundaries are detected automatically
- Each detected segment is returned with its page range and a descriptive label
💻 Sample Application Description
The Document Splitting demo is a console app that:
- Lets you select a vision-language model (or enter a custom model URI)
- Downloads and loads the model with progress feedback
- Prompts you for a PDF file path
- Analyzes the PDF and detects document boundaries
- Prints each detected segment with page ranges and labels
✨ Key Features
- Vision-based analysis: uses a VLM to understand page layouts and detect document types.
- Automatic boundary detection: identifies where one document ends and another begins.
- Descriptive labels: each segment gets a label like "Invoice", "National ID Card", "Pay Slip".
- Physical PDF splitting: optionally export each detected segment as a separate PDF file using
PdfSplitter. - Optional guidance: provide hints about expected document types for improved accuracy.
- Sync and async API: both
SplitandSplitAsyncmethods available.
💻 Minimal Integration Snippet
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
// Load a vision model (8B+ recommended)
using LM model = LM.LoadFromModelID("qwen3-vl:8b");
// Create splitter
var splitter = new DocumentSplitting(model);
// Analyze PDF
DocumentSplittingResult result = splitter.Split(new Attachment("batch_scan.pdf"));
// Process results
Console.WriteLine($"Documents found: {result.DocumentCount}");
foreach (DocumentSegment segment in result.Segments)
{
Console.WriteLine($" Pages {segment.StartPage}-{segment.EndPage}: {segment.Label}");
}
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
📥 Download the Project
▶️ Running the Application
- Clone the repository:
git clone https://github.com/LM-Kit/lm-kit-net-samples
- Navigate to the project directory:
cd lm-kit-net-samples/console_net/document_splitting
- Build and run the application:
dotnet build
dotnet run
- Follow the on-screen prompts to select a model and provide a multi-page PDF path.
💡 Example Usage
Select a vision-language model: choose a model from the list (8B+ recommended for best accuracy).
Provide a PDF file: enter the path to a multi-page PDF containing mixed documents.
Review results:
- number of detected documents
- page ranges for each segment
- descriptive labels assigned by the model
- overall confidence score
Process more files: press any key to continue and run another PDF.
🔍 Notes on Key Types
DocumentSplitting(LMKit.Extraction): main class that drives the splitting process. Requires a vision-capable model.DocumentSplittingResult(LMKit.Extraction): result container withSegments,DocumentCount,ContainsMultipleDocuments, andConfidence.DocumentSegment(LMKit.Extraction): a single detected document withStartPage,EndPage,PageCount, andLabel.PdfSplitter(LMKit.Document.Pdf): physically splits a PDF into separate files based on detected segments. UsePdfSplitter.SplitToFiles(attachment, result, outputDir, prefix)to export each segment as a separate PDF file.Attachment(LMKit.Data): wraps PDF files for input.PageCountexposes the total number of pages.
🔧 Extend the Demo
- Combine with
TextExtractionto extract structured data from each detected segment. - Add a classification step to route segments to different processing pipelines.
- Build a batch processor that scans an inbox folder and splits all incoming PDFs.
- Use
PdfSplitterto export detected segments as separate PDF files automatically. - Use
PdfMergerto recombine selected segments into new documents. - Integrate with a document management system for automated filing.