Understanding Chunking in RAG and Embedding Pipelines
TL;DR
Chunking is the process of splitting large documents into smaller, semantically coherent segments called chunks (or partitions) before embedding them for retrieval. Good chunking preserves meaning across boundaries, keeps each chunk within embedding model limits, and directly impacts the relevance of search results. In LM-Kit.NET, the IChunking interface defines the chunking contract, with three concrete implementations: TextChunking for plain text, MarkdownChunking for Markdown, and HtmlChunking for HTML. These integrate directly with the RagEngine to control how documents are partitioned before embedding.
What is Chunking?
Definition: Chunking is the step in a Retrieval-Augmented Generation (RAG) pipeline that divides source documents into smaller units suitable for embedding and similarity search. Each chunk becomes a Partition in LM-Kit.NET, with its own embedding vector that can be compared against a user query during retrieval.
Why Chunking Matters
Chunking is the bridge between raw documents and effective retrieval. The quality of your chunks directly determines:
- Retrieval Precision: Chunks that are too large dilute relevant information with noise. The retrieval system returns the right document but buries the answer inside irrelevant context.
- Retrieval Recall: Chunks that are too small lose the surrounding context needed to understand a fact. The system retrieves a fragment that the model cannot interpret meaningfully.
- Embedding Quality: Embedding models have a maximum input length. Text exceeding that limit is truncated, losing information silently.
- Context Window Efficiency: Each retrieved chunk consumes tokens in the model's context window. Well-sized chunks maximize useful information per token.
- Generation Quality: When the model receives coherent, self-contained chunks as context, it produces more accurate, grounded responses with fewer hallucinations.
Chunking Strategies
Fixed-Size Chunking
The simplest approach: split text into segments of N tokens with an overlap of M tokens between consecutive segments. Fast and predictable, but blind to document structure.
Document: [===========================================]
Fixed chunks (overlap = 50 tokens):
Chunk 1: [=========]
Chunk 2: [=========]
Chunk 3: [=========]
Chunk 4: [=========]
Trade-offs: Easy to implement, consistent chunk sizes, but frequently breaks sentences, paragraphs, and logical sections mid-thought.
Structure-Aware Chunking
Splits text along natural boundaries: paragraphs, headings, list items, table rows, or code blocks. Each chunk respects the document's logical organization, preserving semantic coherence.
Markdown document:
# Section A --> Chunk 1 (heading + paragraph)
Paragraph text...
## Section A.1 --> Chunk 2 (subheading + list)
- Item 1
- Item 2
# Section B --> Chunk 3 (heading + paragraph)
Paragraph text...
Trade-offs: Better semantic coherence, but requires parsing the document format. Chunk sizes vary depending on the document structure.
Recursive Chunking
A hybrid that first tries structure-aware splits (headings, paragraphs), then falls back to sentence boundaries, and finally to fixed-size splits for very long sections. This is the strategy used by LM-Kit.NET's TextChunking class.
Practical Application in LM-Kit.NET SDK
LM-Kit.NET provides three chunking implementations through the IChunking interface. Each respects a different document format and can be plugged into the RagEngine pipeline.
1. TextChunking (Plain Text)
The default chunker. Uses a recursive strategy that respects paragraph boundaries and sentences before falling back to token-level splits.
MaxChunkSize: Target maximum chunk size in tokens. Default:500.MaxOverlapSize: Number of overlapping tokens between consecutive chunks. Default:50.KeepSpacings: Preserve original whitespace and layout. Default:false.
2. MarkdownChunking (Markdown)
Parses Markdown structure and splits along heading boundaries, keeping each section as a self-contained chunk. Ideal for documentation, README files, and knowledge bases authored in Markdown.
MaxChunkSize: Target maximum chunk size in tokens. Default:500.
3. HtmlChunking (HTML)
Parses the HTML DOM and splits along structural boundaries (headings, sections, tables, block elements). The most feature-rich chunker, with options for boilerplate removal and heading-context preservation.
MaxChunkSize: Target maximum chunk size in tokens. Default:500.MaxOverlapSize: Overlap between consecutive chunks. Default:50.StripBoilerplate: Remove navigation, footer, and sidebar elements before chunking. Default:true.PreserveHeadingContext: Prepend a breadcrumb trail of parent headings to each chunk. Default:true.
Code Example
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Embeddings;
// Load models
var chatModel = LM.LoadFromModelID("gemma3:12b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");
// Create RAG engine with embedding support
var ragEngine = new RagEngine(chatModel, new Embedder(embeddingModel));
// Option 1: Use TextChunking (default) with custom settings
var textChunking = new TextChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 80
};
ragEngine.DefaultIChunking = textChunking;
// Option 2: Use MarkdownChunking for .md files
var markdownChunking = new MarkdownChunking
{
MaxChunkSize = 600
};
// Option 3: Use HtmlChunking for web pages
var htmlChunking = new HtmlChunking
{
MaxChunkSize = 500,
StripBoilerplate = true,
PreserveHeadingContext = true
};
// Import documents with format-specific chunking
var dataSource = ragEngine.AddDataSource("knowledge-base");
dataSource.ImportText("Plain text content..."); // Uses DefaultIChunking
dataSource.ImportText("# Heading\nMarkdown...", markdownChunking); // Markdown-aware splits
dataSource.ImportText("<html>...</html>", htmlChunking); // HTML-aware splits
// Retrieve relevant chunks
var matches = ragEngine.FindMatchingPartitions("How does chunking work?", topK: 5);
foreach (var match in matches)
{
Console.WriteLine($"Score: {match.Similarity:F3} | {match.Partition.Text[..80]}...");
}
Choosing the Right Chunk Size
| Chunk Size | Pros | Cons | Best For |
|---|---|---|---|
| Small (100-200 tokens) | High precision, focused retrieval | May lose context, more partitions to embed | Factoid Q&A, keyword-heavy search |
| Medium (300-500 tokens) | Good balance of precision and context | May still split some sections | General-purpose RAG, documentation |
| Large (500-1000 tokens) | Preserves full context per chunk | Lower precision, fewer chunks per context window | Summarization, complex reasoning |
The Overlap Trade-Off
Overlap ensures that information near chunk boundaries is not lost. A typical overlap of 10-20% of the chunk size works well. Too much overlap creates redundant partitions that inflate storage and slow retrieval. Too little overlap risks splitting a key sentence between two chunks where neither contains the full thought.
Key Terms
- Chunk: A segment of text produced by a chunking strategy. Called a Partition in LM-Kit.NET.
- Partition: The LM-Kit.NET representation of a chunk, complete with embedding vector, token count, and metadata.
- Overlap: The number of tokens shared between consecutive chunks, preventing information loss at boundaries.
- Boilerplate: Structural HTML elements (navigation, footers, sidebars) that are irrelevant to content retrieval.
- Heading Context: A breadcrumb trail of parent headings prepended to a chunk so it remains interpretable in isolation.
- Recursive Chunking: A strategy that tries progressively finer split boundaries (sections, paragraphs, sentences, tokens) until the chunk fits within the size limit.
Related API Documentation
IChunking: Interface defining the chunking contractTextChunking: Recursive plain-text chunking with overlapMarkdownChunking: Structure-aware Markdown chunkingHtmlChunking: DOM-aware HTML chunking with boilerplate removalRagEngine: Core RAG engine that consumes chunked partitionsPartition: Individual chunk with embedding and metadataDataSource: Repository of chunked content for retrieval
Related Glossary Topics
- RAG (Retrieval-Augmented Generation): The pipeline that chunking feeds into
- Embeddings: Vector representations computed per chunk
- Semantic Similarity: How chunks are matched to queries
- Vector Database: Persistent storage for chunk embeddings
- Context Windows: Token budgets that constrain how many chunks fit in a prompt
- Tokenization: How text is converted to tokens that determine chunk size
- Reranking: Improving retrieval precision after initial chunk matching
- Intelligent Document Processing (IDP): Document pipelines where chunking is one stage
External Resources
- Chunking Strategies for LLM Applications (Gao et al., 2023): Survey of retrieval-augmented generation techniques including chunking
- RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval (Sarthi et al., 2024): Hierarchical chunking and summarization for retrieval
- LM-Kit RAG Demo: End-to-end RAG sample with chunking
- Chunk HTML and Markdown for RAG: Step-by-step chunking guide
Summary
Chunking is the critical step that determines how documents are divided into retrievable units in a RAG pipeline. In LM-Kit.NET, the IChunking interface provides a pluggable abstraction with three implementations: TextChunking for recursive plain-text splitting, MarkdownChunking for heading-aware Markdown splitting, and HtmlChunking for DOM-aware HTML splitting with boilerplate removal. By choosing the right strategy and tuning chunk size and overlap for your content, you directly improve retrieval precision and the quality of AI-generated responses.