Class DocumentSplitting

Namespace: LMKit.Extraction

Assembly: LM-Kit.NET.dll

Provides functionality to detect logical document boundaries within a multi-page file using a vision language model (VLM).

public sealed class DocumentSplitting

Inheritance: object

DocumentSplitting

Inherited Members: object.Equals(object)

object.Equals(object, object)

object.GetHashCode()

object.GetType()

object.ReferenceEquals(object, object)

object.ToString()

Examples

Example: Detect and split documents in a multi-page PDF

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Document.Pdf;
using LMKit.Data;
using System;
using System.Collections.Generic;
// Load a vision-capable model (8B or larger recommended)
LM model = LM.LoadFromModelID("qwen3-vl:8b");
// Create the splitter
DocumentSplitting splitter = new DocumentSplitting(model);
// Detect logical boundaries
var source = new Attachment("multi_document_scan.pdf");
DocumentSplittingResult result = splitter.Split(source);
// Display results
Console.WriteLine($"Document count: {result.DocumentCount}");
Console.WriteLine($"Confidence: {result.Confidence:P0}");
foreach (DocumentSegment segment in result.Segments)
{
Console.WriteLine($"  Pages {segment.StartPage}-{segment.EndPage}: {segment.Label} ({segment.PageCount} pages)");
}
// Physically split the PDF into separate files using PdfSplitter
if (result.ContainsMultipleDocuments)
{
List<Attachment> documents = PdfSplitter.Split(source, result);
Console.WriteLine($"Split into {documents.Count} separate PDFs");
}

Example: Detect and split in one call

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;
// Load a vision-capable model (8B or larger recommended)
LM model = LM.LoadFromModelID("qwen3-vl:8b");
// Provide guidance about the expected document types
DocumentSplitting splitter = new DocumentSplitting(model)
{
Guidance = "The file contains a mix of invoices and purchase orders."
};
// Detect boundaries AND split the PDF into separate files in one call
DocumentSplittingResult result = splitter.Split(
new Attachment("scanned_batch.pdf"),
splitDocument: true,
outputDirectory: "output/split_docs");
for (int i = 0; i < result.Segments.Count; i++)
{
DocumentSegment segment = result.Segments[i];
Console.WriteLine($"{segment.Label}: pages {segment.StartPage}-{segment.EndPage} => {result.Documents[i]}");
}

Remarks

The DocumentSplitting class analyzes a multi-page PDF attachment and determines whether it contains multiple logical documents. For each detected document, it returns the page range for each one.

This class requires a vision-capable language model. The model must have HasVision set to true. Page images are fed directly to the VLM for visual boundary detection, which allows reliable splitting even on scanned documents or documents with complex layouts.

Key Features

Detect whether a multi-page PDF contains multiple logical documents
Identify the page range for each detected document
Optionally split the source PDF into separate attachments per detected segment
Guidance text to improve detection accuracy

Typical Workflow

Create a DocumentSplitting instance with a vision language model
Optionally configure Guidance
Call Split(Attachment, CancellationToken) or SplitAsync(Attachment, CancellationToken) to detect boundaries only, or use the splitDocument overloads to also split the PDF
Access results via DocumentSplittingResult, including Documents when splitting was requested

Constructors

DocumentSplitting(LM): Initializes a new instance of the DocumentSplitting class with the specified vision language model.

Properties

Guidance: Gets or sets semantic guidance for the splitting process.

MaximumContextLength: Gets or sets the maximum context length (in tokens) allowed for the language model during splitting.

Model: Gets the vision language model instance used to drive the document splitting process.

Methods

Split(Attachment, bool, string, CancellationToken): Detects logical document boundaries synchronously within the specified attachment, and optionally splits the source PDF into separate files for each detected segment.

Split(Attachment, CancellationToken): Detects logical document boundaries synchronously within the specified attachment.

SplitAsync(Attachment, bool, string, CancellationToken): Asynchronously detects logical document boundaries within the specified attachment, and optionally splits the source PDF into separate files for each detected segment.

SplitAsync(Attachment, CancellationToken): Asynchronously detects logical document boundaries within the specified attachment.