Table of Contents

Class TextChunking

Namespace
LMKit.Retrieval
Assembly
LM-Kit.NET.dll

Implements a recursive chunking strategy for partitioning text into manageable segments, known as "chunks," to support retrieval-augmented generation tasks.
This approach is particularly effective for processing extensive texts, systematically breaking them down into smaller segments that are easier to handle.
Unlike linear chunking methods that sequentially divide text, this recursive strategy dynamically adjusts the segmentation process based on the complexity and structure of the text.
This allows for more nuanced and efficient handling of text data, especially when dealing with nested or hierarchical information.

public sealed class TextChunking : IChunking
Inheritance
TextChunking
Implements
Inherited Members

Examples

Example: Configure chunking for RagEngine

using LMKit.Model;
using LMKit.Retrieval;
using System;

LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
RagEngine ragEngine = new RagEngine(embeddingModel);

// Configure custom chunking
var chunker = new TextChunking
{
    MaxChunkSize = 300,    // Smaller chunks for precise retrieval
    MaxOverlapSize = 30,   // Some overlap to preserve context
    KeepSpacings = false   // Normalize whitespace
};

ragEngine.DefaultIChunking = chunker;

// Import text with custom chunking
ragEngine.ImportText("Your long document text here...", "docs", "section1");
Console.WriteLine("Text imported with custom chunking settings.");

Example: Chunking for code or formatted text

using LMKit.Model;
using LMKit.Retrieval;
using System;

LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
RagEngine ragEngine = new RagEngine(embeddingModel);

// Configure chunking to preserve code formatting
var codeChunker = new TextChunking
{
    MaxChunkSize = 400,
    MaxOverlapSize = 50,
    KeepSpacings = true  // Preserve indentation and spacing
};

ragEngine.DefaultIChunking = codeChunker;

// Import source code
string sourceCode = File.ReadAllText("Program.cs");
ragEngine.ImportText(sourceCode, "codebase", "Program.cs");

Remarks

Key Features

  • Configurable chunk size via MaxChunkSize
  • Overlap between chunks to preserve context via MaxOverlapSize
  • Option to preserve original spacing via KeepSpacings
  • Recursive splitting for better handling of complex text structures

Usage with RagEngine
TextChunking is the default chunker used by RagEngine. You can customize chunking behavior by creating a configured instance and assigning it to DefaultIChunking.

Chunk Size Guidelines
Smaller chunks (200-300 tokens) work better for precise retrieval of specific facts. Larger chunks (500-800 tokens) preserve more context but may reduce retrieval precision. The overlap size helps maintain context continuity between adjacent chunks.

Fields

KeepSpacings

Determines whether the system preserves multiple consecutive spaces and maintains the original text layout.

Properties

MaxChunkSize

Gets or sets the maximum number of tokens that each text chunk can contain.
This property determines the size of the chunks into which the text is divided.

MaxOverlapSize

Gets or sets the maximum number of tokens to be duplicated (overlapped) between consecutive text chunks. This overlap ensures that context is not lost at the boundaries between chunks. It aids in maintaining the continuity of the text across chunks, especially important for cohesive text analysis and generation.