Class TextChunking
Implements a recursive chunking strategy for partitioning text into manageable segments, known as "chunks," to support retrieval-augmented generation tasks.
This approach is particularly effective for processing extensive texts, systematically breaking them down into smaller segments that are easier to handle.
Unlike linear chunking methods that sequentially divide text, this recursive strategy dynamically adjusts the segmentation process based on the complexity and structure of the text.
This allows for more nuanced and efficient handling of text data, especially when dealing with nested or hierarchical information.
public sealed class TextChunking : IChunking
- Inheritance
-
TextChunking
- Implements
- Inherited Members
Examples
Example: Configure chunking for RagEngine
using LMKit.Model;
using LMKit.Retrieval;
using System;
LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
RagEngine ragEngine = new RagEngine(embeddingModel);
// Configure custom chunking
var chunker = new TextChunking
{
MaxChunkSize = 300, // Smaller chunks for precise retrieval
MaxOverlapSize = 30, // Some overlap to preserve context
KeepSpacings = false // Normalize whitespace
};
ragEngine.DefaultIChunking = chunker;
// Import text with custom chunking
ragEngine.ImportText("Your long document text here...", "docs", "section1");
Console.WriteLine("Text imported with custom chunking settings.");
Example: Chunking for code or formatted text
using LMKit.Model;
using LMKit.Retrieval;
using System;
LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
RagEngine ragEngine = new RagEngine(embeddingModel);
// Configure chunking to preserve code formatting
var codeChunker = new TextChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 50,
KeepSpacings = true // Preserve indentation and spacing
};
ragEngine.DefaultIChunking = codeChunker;
// Import source code
string sourceCode = File.ReadAllText("Program.cs");
ragEngine.ImportText(sourceCode, "codebase", "Program.cs");
Remarks
Key Features
- Configurable chunk size via MaxChunkSize
- Overlap between chunks to preserve context via MaxOverlapSize
- Option to preserve original spacing via KeepSpacings
- Recursive splitting for better handling of complex text structures
Usage with RagEngine
TextChunking is the default chunker used by RagEngine. You can customize
chunking behavior by creating a configured instance and assigning it to DefaultIChunking.
Chunk Size Guidelines
Smaller chunks (200-300 tokens) work better for precise retrieval of specific facts.
Larger chunks (500-800 tokens) preserve more context but may reduce retrieval precision.
The overlap size helps maintain context continuity between adjacent chunks.
Fields
- KeepSpacings
Determines whether the system preserves multiple consecutive spaces and maintains the original text layout.
Properties
- MaxChunkSize
Gets or sets the maximum number of tokens that each text chunk can contain.
This property determines the size of the chunks into which the text is divided.
- MaxOverlapSize
Gets or sets the maximum number of tokens to be duplicated (overlapped) between consecutive text chunks. This overlap ensures that context is not lost at the boundaries between chunks. It aids in maintaining the continuity of the text across chunks, especially important for cohesive text analysis and generation.