Class DocumentSplitting
- Namespace
- LMKit.Extraction
- Assembly
- LM-Kit.NET.dll
Provides functionality to detect logical document boundaries within a multi-page file using a vision language model (VLM).
public sealed class DocumentSplitting
- Inheritance
-
DocumentSplitting
- Inherited Members
Examples
Example: Detect and split documents in a multi-page PDF
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Document.Pdf;
using LMKit.Data;
using System;
using System.Collections.Generic;
// Load a vision-capable model (8B or larger recommended)
LM model = LM.LoadFromModelID("qwen3-vl:8b");
// Create the splitter
DocumentSplitting splitter = new DocumentSplitting(model);
// Detect logical boundaries
var source = new Attachment("multi_document_scan.pdf");
DocumentSplittingResult result = splitter.Split(source);
// Display results
Console.WriteLine($"Document count: {result.DocumentCount}");
Console.WriteLine($"Confidence: {result.Confidence:P0}");
foreach (DocumentSegment segment in result.Segments)
{
Console.WriteLine($" Pages {segment.StartPage}-{segment.EndPage}: {segment.Label} ({segment.PageCount} pages)");
}
// Physically split the PDF into separate files using PdfSplitter
if (result.ContainsMultipleDocuments)
{
List<Attachment> documents = PdfSplitter.Split(source, result);
Console.WriteLine($"Split into {documents.Count} separate PDFs");
}
Example: Detect and split in one call
using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
// Load a vision-capable model (8B or larger recommended)
LM model = LM.LoadFromModelID("qwen3-vl:8b");
// Provide guidance about the expected document types
DocumentSplitting splitter = new DocumentSplitting(model)
{
Guidance = "The file contains a mix of invoices and purchase orders."
};
// Detect boundaries AND split the PDF into separate files in one call
DocumentSplittingResult result = splitter.Split(
new Attachment("scanned_batch.pdf"),
splitDocument: true,
outputDirectory: "output/split_docs");
for (int i = 0; i < result.Segments.Count; i++)
{
DocumentSegment segment = result.Segments[i];
Console.WriteLine($"{segment.Label}: pages {segment.StartPage}-{segment.EndPage} => {result.Documents[i]}");
}
Remarks
The DocumentSplitting class analyzes a multi-page PDF attachment and determines whether it contains multiple logical documents. For each detected document, it returns the page range for each one.
This class requires a vision-capable language model. The model must have
HasVision set to true. Page images are fed directly to
the VLM for visual boundary detection, which allows reliable splitting even on
scanned documents or documents with complex layouts.
Key Features
- Detect whether a multi-page PDF contains multiple logical documents
- Identify the page range for each detected document
- Optionally split the source PDF into separate attachments per detected segment
- Guidance text to improve detection accuracy
Typical Workflow
- Create a DocumentSplitting instance with a vision language model
- Optionally configure Guidance
- Call Split(Attachment, CancellationToken) or SplitAsync(Attachment, CancellationToken) to detect boundaries only, or use the
splitDocumentoverloads to also split the PDF - Access results via DocumentSplittingResult, including Documents when splitting was requested
Constructors
- DocumentSplitting(LM)
Initializes a new instance of the DocumentSplitting class with the specified vision language model.
Properties
- Guidance
Gets or sets semantic guidance for the splitting process.
- MaximumContextLength
Gets or sets the maximum context length (in tokens) allowed for the language model during splitting.
- Model
Gets the vision language model instance used to drive the document splitting process.
Methods
- Split(Attachment, bool, string, CancellationToken)
Detects logical document boundaries synchronously within the specified attachment, and optionally splits the source PDF into separate files for each detected segment.
- Split(Attachment, CancellationToken)
Detects logical document boundaries synchronously within the specified attachment.
- SplitAsync(Attachment, bool, string, CancellationToken)
Asynchronously detects logical document boundaries within the specified attachment, and optionally splits the source PDF into separate files for each detected segment.
- SplitAsync(Attachment, CancellationToken)
Asynchronously detects logical document boundaries within the specified attachment.