Table of Contents

Class DocumentSplitting

Namespace
LMKit.Extraction
Assembly
LM-Kit.NET.dll

Provides functionality to detect logical document boundaries within a multi-page file using a vision language model (VLM).

public sealed class DocumentSplitting
Inheritance
DocumentSplitting
Inherited Members

Examples

Example: Detect and split documents in a multi-page PDF

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Document.Pdf;
using LMKit.Data;
using System;
using System.Collections.Generic;

// Load a vision-capable model (8B or larger recommended) LM model = LM.LoadFromModelID("qwen3-vl:8b");

// Create the splitter DocumentSplitting splitter = new DocumentSplitting(model);

// Detect logical boundaries var source = new Attachment("multi_document_scan.pdf"); DocumentSplittingResult result = splitter.Split(source);

// Display results Console.WriteLine($"Document count: {result.DocumentCount}"); Console.WriteLine($"Confidence: {result.Confidence:P0}");

foreach (DocumentSegment segment in result.Segments) { Console.WriteLine($" Pages {segment.StartPage}-{segment.EndPage}: {segment.Label} ({segment.PageCount} pages)"); }

// Physically split the PDF into separate files using PdfSplitter if (result.ContainsMultipleDocuments) { List<Attachment> documents = PdfSplitter.Split(source, result); Console.WriteLine($"Split into {documents.Count} separate PDFs"); }

Example: Detect and split in one call

using LMKit.Model;
using LMKit.Extraction;
using LMKit.Data;
using System;

// Load a vision-capable model (8B or larger recommended) LM model = LM.LoadFromModelID("qwen3-vl:8b");

// Provide guidance about the expected document types DocumentSplitting splitter = new DocumentSplitting(model) { Guidance = "The file contains a mix of invoices and purchase orders." };

// Detect boundaries AND split the PDF into separate files in one call DocumentSplittingResult result = splitter.Split( new Attachment("scanned_batch.pdf"), splitDocument: true, outputDirectory: "output/split_docs");

for (int i = 0; i < result.Segments.Count; i++) { DocumentSegment segment = result.Segments[i]; Console.WriteLine($"{segment.Label}: pages {segment.StartPage}-{segment.EndPage} => {result.Documents[i]}"); }

Remarks

The DocumentSplitting class analyzes a multi-page PDF attachment and determines whether it contains multiple logical documents. For each detected document, it returns the page range for each one.

This class requires a vision-capable language model. The model must have HasVision set to true. Page images are fed directly to the VLM for visual boundary detection, which allows reliable splitting even on scanned documents or documents with complex layouts.

Key Features

  • Detect whether a multi-page PDF contains multiple logical documents
  • Identify the page range for each detected document
  • Optionally split the source PDF into separate attachments per detected segment
  • Guidance text to improve detection accuracy

Typical Workflow

  1. Create a DocumentSplitting instance with a vision language model
  2. Optionally configure Guidance
  3. Call Split(Attachment, CancellationToken) or SplitAsync(Attachment, CancellationToken) to detect boundaries only, or use the splitDocument overloads to also split the PDF
  4. Access results via DocumentSplittingResult, including Documents when splitting was requested

Constructors

DocumentSplitting(LM)

Initializes a new instance of the DocumentSplitting class with the specified vision language model.

Properties

Guidance

Gets or sets semantic guidance for the splitting process.

MaximumContextLength

Gets or sets the maximum context length (in tokens) allowed for the language model during splitting.

Model

Gets the vision language model instance used to drive the document splitting process.

Methods

Split(Attachment, bool, string, CancellationToken)

Detects logical document boundaries synchronously within the specified attachment, and optionally splits the source PDF into separate files for each detected segment.

Split(Attachment, CancellationToken)

Detects logical document boundaries synchronously within the specified attachment.

SplitAsync(Attachment, bool, string, CancellationToken)

Asynchronously detects logical document boundaries within the specified attachment, and optionally splits the source PDF into separate files for each detected segment.

SplitAsync(Attachment, CancellationToken)

Asynchronously detects logical document boundaries within the specified attachment.

Share