Understanding Chunking in RAG and Embedding Pipelines

TL;DR

Chunking is the process of splitting large documents into smaller, semantically coherent segments called chunks (or partitions) before embedding them for retrieval. Good chunking preserves meaning across boundaries, keeps each chunk within embedding model limits, and directly impacts the relevance of search results. In LM-Kit.NET, the IChunking interface defines the chunking contract, with three concrete implementations: TextChunking for plain text, MarkdownChunking for Markdown, and HtmlChunking for HTML. These integrate directly with the RagEngine to control how documents are partitioned before embedding.

What is Chunking?

Definition: Chunking is the step in a Retrieval-Augmented Generation (RAG) pipeline that divides source documents into smaller units suitable for embedding and similarity search. Each chunk becomes a Partition in LM-Kit.NET, with its own embedding vector that can be compared against a user query during retrieval.

Why Chunking Matters

Chunking is the bridge between raw documents and effective retrieval. The quality of your chunks directly determines:

Retrieval Precision: Chunks that are too large dilute relevant information with noise. The retrieval system returns the right document but buries the answer inside irrelevant context.
Retrieval Recall: Chunks that are too small lose the surrounding context needed to understand a fact. The system retrieves a fragment that the model cannot interpret meaningfully.
Embedding Quality: Embedding models have a maximum input length. Text exceeding that limit is truncated, losing information silently.
Context Window Efficiency: Each retrieved chunk consumes tokens in the model's context window. Well-sized chunks maximize useful information per token.
Generation Quality: When the model receives coherent, self-contained chunks as context, it produces more accurate, grounded responses with fewer hallucinations.

Chunking Strategies

Fixed-Size Chunking

The simplest approach: split text into segments of N tokens with an overlap of M tokens between consecutive segments. Fast and predictable, but blind to document structure.

Document: [===========================================]

Fixed chunks (overlap = 50 tokens):
  Chunk 1: [=========]
  Chunk 2:        [=========]
  Chunk 3:               [=========]
  Chunk 4:                      [=========]

Trade-offs: Easy to implement, consistent chunk sizes, but frequently breaks sentences, paragraphs, and logical sections mid-thought.

Structure-Aware Chunking

Splits text along natural boundaries: paragraphs, headings, list items, table rows, or code blocks. Each chunk respects the document's logical organization, preserving semantic coherence.

Markdown document:
  # Section A           --> Chunk 1 (heading + paragraph)
  Paragraph text...

  ## Section A.1        --> Chunk 2 (subheading + list)
  - Item 1
  - Item 2

  # Section B           --> Chunk 3 (heading + paragraph)
  Paragraph text...

Trade-offs: Better semantic coherence, but requires parsing the document format. Chunk sizes vary depending on the document structure.

Recursive Chunking

A hybrid that first tries structure-aware splits (headings, paragraphs), then falls back to sentence boundaries, and finally to fixed-size splits for very long sections. This is the strategy used by LM-Kit.NET's TextChunking class.

Practical Application in LM-Kit.NET SDK

LM-Kit.NET provides three chunking implementations through the IChunking interface. Each respects a different document format and can be plugged into the RagEngine pipeline.

1. TextChunking (Plain Text)

The default chunker. Uses a recursive strategy that respects paragraph boundaries and sentences before falling back to token-level splits.

MaxChunkSize: Target maximum chunk size in tokens. Default: 500.
MaxOverlapSize: Number of overlapping tokens between consecutive chunks. Default: 50.
KeepSpacings: Preserve original whitespace and layout. Default: false.

2. MarkdownChunking (Markdown)

Parses Markdown structure and splits along heading boundaries, keeping each section as a self-contained chunk. Ideal for documentation, README files, and knowledge bases authored in Markdown.

MaxChunkSize: Target maximum chunk size in tokens. Default: 500.

3. HtmlChunking (HTML)

Parses the HTML DOM and splits along structural boundaries (headings, sections, tables, block elements). The most feature-rich chunker, with options for boilerplate removal and heading-context preservation.

MaxChunkSize: Target maximum chunk size in tokens. Default: 500.
MaxOverlapSize: Overlap between consecutive chunks. Default: 50.
StripBoilerplate: Remove navigation, footer, and sidebar elements before chunking. Default: true.
PreserveHeadingContext: Prepend a breadcrumb trail of parent headings to each chunk. Default: true.

Code Example

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Embeddings;

// Load models
var chatModel = LM.LoadFromModelID("gemma4:e4b");
var embeddingModel = LM.LoadFromModelID("qwen3-embedding:0.6b");

// Create RAG engine with embedding support
var ragEngine = new RagEngine(chatModel, new Embedder(embeddingModel));

// Option 1: Use TextChunking (default) with custom settings
var textChunking = new TextChunking
{
    MaxChunkSize = 400,
    MaxOverlapSize = 80
};
ragEngine.DefaultIChunking = textChunking;

// Option 2: Use MarkdownChunking for .md files
var markdownChunking = new MarkdownChunking
{
    MaxChunkSize = 600
};

// Option 3: Use HtmlChunking for web pages
var htmlChunking = new HtmlChunking
{
    MaxChunkSize = 500,
    StripBoilerplate = true,
    PreserveHeadingContext = true
};

// Import documents with format-specific chunking
var dataSource = ragEngine.AddDataSource("knowledge-base");
dataSource.ImportText("Plain text content...");                    // Uses DefaultIChunking
dataSource.ImportText("# Heading\nMarkdown...", markdownChunking); // Markdown-aware splits
dataSource.ImportText("<html>...</html>", htmlChunking);           // HTML-aware splits

// Retrieve relevant chunks
var matches = ragEngine.FindMatchingPartitions("How does chunking work?", topK: 5);
foreach (var match in matches)
{
    Console.WriteLine($"Score: {match.Similarity:F3} | {match.Partition.Text[..80]}...");
}

Choosing the Right Chunk Size

Chunk Size	Pros	Cons	Best For
Small (100-200 tokens)	High precision, focused retrieval	May lose context, more partitions to embed	Factoid Q&A, keyword-heavy search
Medium (300-500 tokens)	Good balance of precision and context	May still split some sections	General-purpose RAG, documentation
Large (500-1000 tokens)	Preserves full context per chunk	Lower precision, fewer chunks per context window	Summarization, complex reasoning

The Overlap Trade-Off

Overlap ensures that information near chunk boundaries is not lost. A typical overlap of 10-20% of the chunk size works well. Too much overlap creates redundant partitions that inflate storage and slow retrieval. Too little overlap risks splitting a key sentence between two chunks where neither contains the full thought.

Key Terms

Chunk: A segment of text produced by a chunking strategy. Called a Partition in LM-Kit.NET.
Partition: The LM-Kit.NET representation of a chunk, complete with embedding vector, token count, and metadata.
Overlap: The number of tokens shared between consecutive chunks, preventing information loss at boundaries.
Boilerplate: Structural HTML elements (navigation, footers, sidebars) that are irrelevant to content retrieval.
Heading Context: A breadcrumb trail of parent headings prepended to a chunk so it remains interpretable in isolation.
Recursive Chunking: A strategy that tries progressively finer split boundaries (sections, paragraphs, sentences, tokens) until the chunk fits within the size limit.

IChunking: Interface defining the chunking contract
TextChunking: Recursive plain-text chunking with overlap
MarkdownChunking: Structure-aware Markdown chunking
HtmlChunking: DOM-aware HTML chunking with boilerplate removal
RagEngine: Core RAG engine that consumes chunked partitions
Partition: Individual chunk with embedding and metadata
DataSource: Repository of chunked content for retrieval

RAG (Retrieval-Augmented Generation): The pipeline that chunking feeds into
Embeddings: Vector representations computed per chunk
Semantic Similarity: How chunks are matched to queries
Vector Database: Persistent storage for chunk embeddings
Context Windows: Token budgets that constrain how many chunks fit in a prompt
Tokenization: How text is converted to tokens that determine chunk size
Reranking: Improving retrieval precision after initial chunk matching
Intelligent Document Processing (IDP): Document pipelines where chunking is one stage

External Resources

Chunking Strategies for LLM Applications (Gao et al., 2023): Survey of retrieval-augmented generation techniques including chunking
RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval (Sarthi et al., 2024): Hierarchical chunking and summarization for retrieval
LM-Kit RAG Demo: End-to-end RAG sample with chunking
Chunk HTML and Markdown for RAG: Step-by-step chunking guide

Summary

Chunking is the critical step that determines how documents are divided into retrievable units in a RAG pipeline. In LM-Kit.NET, the IChunking interface provides a pluggable abstraction with three implementations: TextChunking for recursive plain-text splitting, MarkdownChunking for heading-aware Markdown splitting, and HtmlChunking for DOM-aware HTML splitting with boilerplate removal. By choosing the right strategy and tuning chunk size and overlap for your content, you directly improve retrieval precision and the quality of AI-generated responses.

Table of Contents