Optimize RAG Results with Custom Chunking Strategies
Chunking is how your documents get split into passages before they are embedded and indexed. It is the single biggest lever for RAG quality after model selection. Chunks that are too large dilute the embedding signal, making retrieval imprecise. Chunks that are too small lose surrounding context, producing incomplete answers. LM-Kit.NET ships three chunking strategies out of the box (TextChunking, MarkdownChunking, and HtmlChunking) and lets you configure them per import, so you can tailor splitting to each document type in your pipeline.
TL;DR
using LMKit.Model;
using LMKit.Retrieval;
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
var rag = new RagEngine(embeddingModel);
// Use Markdown-aware chunking for structured content
rag.DefaultIChunking = new MarkdownChunking
{
MaxChunkSize = 400
};
// Or use text chunking with custom overlap for plain text
rag.DefaultIChunking = new TextChunking
{
MaxChunkSize = 300,
MaxOverlapSize = 75
};
// Or use HTML-aware chunking for web content
rag.DefaultIChunking = new HtmlChunking
{
MaxChunkSize = 400,
StripBoilerplate = true,
PreserveHeadingContext = true
};
Why This Matters
Two problems that custom chunking solves:
FAQ and support documents return vague answers. When chunk size is too large, a single chunk may contain answers to several unrelated questions. The embedding becomes an average of all those topics, reducing similarity to any one question. Smaller chunks isolate each answer, producing precise matches.
Technical documentation loses structure. Splitting Markdown or code files on a fixed token count can break a code block in half or separate a heading from its content. Structure-aware chunking keeps headings, code fences, and tables intact, preserving the semantic coherence of each chunk.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 8 GB recommended |
| VRAM | 1 GB (embedding model) |
| Disk | ~500 MB free for model download |
Step 1: Understand the Default Behavior
When you create a RagEngine without configuring chunking, it uses TextChunking with default settings:
using LMKit.Model;
using LMKit.Retrieval;
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
var rag = new RagEngine(embeddingModel);
// This is equivalent to the default:
rag.DefaultIChunking = new TextChunking
{
MaxChunkSize = 500, // 500 tokens per chunk
MaxOverlapSize = 50, // 50 tokens overlap between adjacent chunks
KeepSpacings = false // normalize whitespace
};
TextChunking splits text using a priority cascade:
- Paragraph breaks (
\n\n) - Line breaks (
\n) - Sentence punctuation (
.,;,?,!) - Whitespace
- Hard token limit (forced split)
This priority system means the chunker always prefers to split at natural boundaries. A hard split only happens when a single paragraph or sentence exceeds MaxChunkSize.
Overlap works by prepending or appending tokens from adjacent chunks at sentence or paragraph boundaries, so context is not lost at the edges. When the final chunk in a document is very small, it is merged with the previous chunk to avoid tiny trailing fragments.
Step 2: Tune Chunk Size
Chunk size controls the trade-off between retrieval precision and context completeness.
using LMKit.Model;
using LMKit.Retrieval;
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
// Small chunks: precise retrieval, less context per chunk
var faqChunker = new TextChunking
{
MaxChunkSize = 200
};
// Medium chunks: balanced (the default)
var generalChunker = new TextChunking
{
MaxChunkSize = 500
};
// Large chunks: more context, less precise matching
var longFormChunker = new TextChunking
{
MaxChunkSize = 800
};
Trade-offs:
- Smaller chunks (200 to 300 tokens) produce embeddings that represent a single idea, making similarity search more precise. However, each chunk may lack enough context for the LLM to generate a complete answer. Best for FAQ databases, glossaries, and short-answer knowledge bases.
- Medium chunks (400 to 600 tokens) balance precision with context. This range works well for general documentation, product manuals, and policy documents.
- Larger chunks (800+ tokens) capture more surrounding context, which helps the LLM produce coherent answers from long-form content such as research papers, legal contracts, or book chapters. The downside is that embedding similarity becomes less discriminating when each chunk covers multiple subtopics.
MaxChunkSize is specified in tokens and is clamped between 50 and the model's embedding size.
Step 3: Tune Overlap
Overlap prevents information loss at chunk boundaries. When a sentence spans two chunks, overlap ensures that both chunks contain the full sentence.
using LMKit.Model;
using LMKit.Retrieval;
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
// No overlap: maximum storage efficiency, risk of boundary splits
var noOverlap = new TextChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 0
};
// Moderate overlap (default): good boundary coverage
var moderate = new TextChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 50
};
// Higher overlap: better boundary coverage, more storage and compute
var highOverlap = new TextChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 100
};
MaxOverlapSize is clamped between 0 and MaxChunkSize / 4. Setting it to 100 on a 400-token chunk means up to 25% of each chunk may be shared with its neighbors.
When to increase overlap:
- Documents with long, complex sentences that are likely to land on chunk boundaries.
- Content where key information appears at the end of one paragraph and the beginning of the next.
When to decrease overlap:
- Storage and embedding compute are constrained.
- Documents are already well-structured with clear paragraph boundaries.
Step 4: Use Markdown Chunking for Structured Content
MarkdownChunking understands Markdown structure and splits along document boundaries instead of raw text boundaries.
using LMKit.Retrieval;
var markdownChunker = new MarkdownChunking
{
MaxChunkSize = 400
};
Splitting priority for MarkdownChunking:
- H1 headings (
#) - H2 headings (
##) - H3+ headings (
###,####, etc.) - Code fences and thematic breaks (
---,***) - Tables, lists, and blockquotes
- Paragraph breaks
- Newlines
- Hard token limit
Key differences from TextChunking:
- No overlap support. This is intentional. Because Markdown chunks align to structural boundaries (headings, code blocks), the content within each chunk is self-contained. Overlap would duplicate heading content across chunks without adding value.
- Fence-aware splitting. The chunker avoids splitting inside fenced code blocks. A code example stays in one chunk even if it approaches
MaxChunkSize. - Preserves formatting. Unlike
TextChunkingwithKeepSpacings = false, Markdown chunking preserves the original whitespace and formatting so that Markdown rendering remains intact.
Step 5: Use HTML Chunking for Web Content
HtmlChunking parses HTML with AngleSharp and splits along semantic DOM boundaries, making it ideal for web pages, knowledge base articles, and any HTML content ingested into a RAG pipeline.
using LMKit.Retrieval;
var htmlChunker = new HtmlChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 40,
StripBoilerplate = true, // remove nav, footer, sidebar, ad containers
PreserveHeadingContext = true // prepend heading breadcrumb to each chunk
};
Key features:
- Boilerplate stripping. When
StripBoilerplateistrue(default), the chunker removes<nav>,<footer>,<aside>, and elements with common sidebar, cookie, or advertisement class/id patterns before extracting text. - Heading breadcrumb. When
PreserveHeadingContextistrue(default), each chunk is prefixed with its heading hierarchy (e.g., "Products > Pricing > Enterprise"), giving the embedding a clear topic signal even when the chunk text alone is ambiguous. - Table preservation. Tables are extracted as pipe-delimited text and kept as a single chunk when they fit within
MaxChunkSize. Oversized tables are sub-split using the plain text partitioner. - Preformatted block preservation.
<pre>content is extracted verbatim without whitespace normalization. - Overlap support. Like
TextChunking, overlap tokens are added at chunk boundaries to preserve context continuity.
using LMKit.Model;
using LMKit.Retrieval;
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
var rag = new RagEngine(embeddingModel);
rag.DefaultIChunking = new HtmlChunking
{
MaxChunkSize = 400,
StripBoilerplate = true,
PreserveHeadingContext = true
};
string html = File.ReadAllText("knowledge-base-article.html");
rag.ImportText(html, "kb", "article-1");
Step 6: Choose the Right Strategy for Your Content
| Content Type | Strategy | Why |
|---|---|---|
| Plain text files | TextChunking |
Paragraph and sentence splitting with overlap |
| Markdown documentation | MarkdownChunking |
Heading-aware splitting preserves structure |
| Code in Markdown fences | MarkdownChunking |
Fence-aware splitting keeps code intact |
| HTML pages and articles | HtmlChunking |
DOM-aware splitting with boilerplate removal |
| Web scraped content | HtmlChunking |
Strips navigation, ads; preserves heading context |
| Mixed document types | DocumentRag |
Auto-selects the right chunking strategy |
| PDF documents | DocumentRag |
Built-in parsing with auto strategy selection |
DocumentRag exposes its own MaxChunkSize property (default 500) and automatically selects the appropriate chunking strategy based on the processing mode.
Step 7: Apply Per-Import Chunking
You can use different chunking strategies for different document types within the same RagEngine by passing a chunker to each ImportText call.
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Data;
using LM embeddingModel = LM.LoadFromModelID(
"embeddinggemma-300m");
var dataSource = DataSource.CreateInMemoryDataSource(
"KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
// FAQ: small chunks, no overlap for short Q&A pairs
var faqChunker = new TextChunking
{
MaxChunkSize = 200,
MaxOverlapSize = 0
};
// Technical docs: structure-aware Markdown splitting
var docsChunker = new MarkdownChunking
{
MaxChunkSize = 400
};
// Long-form reports: large chunks with overlap
var reportChunker = new TextChunking
{
MaxChunkSize = 800,
MaxOverlapSize = 100
};
// Import each document type with its own strategy
string faqContent = File.ReadAllText("docs/faq.txt");
rag.ImportText(
faqContent, chunker: faqChunker,
"KnowledgeBase", "faq");
string docsContent = File.ReadAllText("docs/api-reference.md");
rag.ImportText(
docsContent, chunker: docsChunker,
"KnowledgeBase", "api-reference");
string reportContent = File.ReadAllText("docs/annual-report.txt");
rag.ImportText(
reportContent, chunker: reportChunker,
"KnowledgeBase", "annual-report");
The per-import chunker parameter overrides rag.DefaultIChunking for that specific call. You can also use ImportTextAsync with the same parameter.
Step 8: Inspect Chunk Quality
After importing, iterate over the partitions in your data source to verify that chunks are well-formed.
using LMKit.Data;
using LMKit.Retrieval;
// After importing documents...
foreach (var section in dataSource.Sections)
{
Console.WriteLine($"\nSection: {section.Identifier}");
int count = section.Partitions.Count();
Console.WriteLine($" Partition count: {count}");
int index = 0;
foreach (TextPartition partition in section.Partitions)
{
int previewLen = Math.Min(120, partition.Text.Length);
string preview = partition.Text[..previewLen];
Console.WriteLine($"\n Chunk #{index}:");
Console.WriteLine($" SplitMode: {partition.SplitMode}");
Console.WriteLine($" Length: {partition.Text.Length} chars");
Console.WriteLine($" Preview: {preview}...");
index++;
}
}
The SplitMode property on each TextPartition tells you where the split occurred:
| SplitMode | Meaning |
|---|---|
Paragraph |
Split at a paragraph break (\n\n) |
Line |
Split at a line break (\n) |
Punctuation |
Split at sentence punctuation (., ;, ?, !) |
EndOfText |
Final chunk (end of input) |
HardLimit |
Forced split at the token limit |
If you see many HardLimit splits, your MaxChunkSize may be too small for the content, or the text contains very long paragraphs without natural boundaries.
Complete Example
This example builds a full pipeline: configure chunking per document type, import documents, query, and display results.
using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID(
"embeddinggemma-300m",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue)
{
double pct = (double)read / len.Value * 100;
Console.Write($"\r Downloading: {pct:F1}% ");
}
return true;
},
loadingProgress: p =>
{
Console.Write($"\r Loading: {p * 100:F0}% ");
return true;
});
Console.WriteLine(" Done.\n");
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID(
"gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue)
{
double pct = (double)read / len.Value * 100;
Console.Write($"\r Downloading: {pct:F1}% ");
}
return true;
},
loadingProgress: p =>
{
Console.Write($"\r Loading: {p * 100:F0}% ");
return true;
});
Console.WriteLine(" Done.\n");
// ──────────────────────────────────────
// 2. Create RAG engine and data source
// ──────────────────────────────────────
var dataSource = DataSource.CreateInMemoryDataSource(
"KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);
// ──────────────────────────────────────
// 3. Import documents with tailored chunking
// ──────────────────────────────────────
var faqChunker = new TextChunking
{
MaxChunkSize = 250,
MaxOverlapSize = 0
};
var docsChunker = new MarkdownChunking
{
MaxChunkSize = 400
};
// Import FAQ (small, precise chunks)
string faqText = """
Q: How do I reset my password?
A: Go to Settings > Account > Reset Password
and follow the prompts.
Q: What file formats are supported?
A: We support PDF, DOCX, TXT, and Markdown.
Q: How do I enable GPU acceleration?
A: Install the appropriate GPU backend package
and set the backend in your configuration.
""";
rag.ImportText(
faqText, chunker: faqChunker,
"KnowledgeBase", "faq");
// Import technical docs (structure-aware chunks)
string docsText = """
# Configuration Guide
## GPU Backends
LM-Kit.NET supports multiple GPU backends
for accelerated inference.
### CUDA
For NVIDIA GPUs, install the CUDA backend.
Requires CUDA 12.0 or later.
### Vulkan
For cross-platform GPU support, use the
Vulkan backend.
## Memory Management
Monitor VRAM usage when loading multiple
models simultaneously.
""";
rag.ImportText(
docsText, chunker: docsChunker,
"KnowledgeBase", "config-guide");
// ──────────────────────────────────────
// 4. Inspect chunk quality
// ──────────────────────────────────────
Console.WriteLine("Chunk inspection:");
foreach (var section in dataSource.Sections)
{
Console.WriteLine($"\n Section: {section.Identifier}");
int i = 0;
foreach (TextPartition partition in section.Partitions)
{
string preview = partition.Text.Length > 80
? partition.Text[..80] + "..."
: partition.Text;
Console.WriteLine(
$" Chunk #{i}: [{partition.SplitMode}] " +
$"{preview}");
i++;
}
}
// ──────────────────────────────────────
// 5. Query loop
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
SystemPrompt =
"Answer the question using only the provided context. " +
"If the context does not contain the answer, say so.",
MaximumCompletionTokens = 256
};
Console.WriteLine("\nAsk a question (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("Question: ");
Console.ResetColor();
string? query = Console.ReadLine();
if (string.IsNullOrWhiteSpace(query))
break;
if (query.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
var matches = rag.FindMatchingPartitions(
query, topK: 3, minScore: 0.3f);
if (matches.Count == 0)
{
Console.WriteLine("No relevant passages found.\n");
continue;
}
Console.ForegroundColor = ConsoleColor.DarkGray;
foreach (var m in matches)
{
Console.WriteLine(
$" [{m.SectionIdentifier}] score={m.Similarity:F3}");
}
Console.ResetColor();
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("\nAnswer: ");
Console.ResetColor();
var result = rag.QueryPartitions(
query, matches, chat);
Console.WriteLine(
$"\n [{result.GeneratedTokenCount} tokens, " +
$"{result.TokenGenerationRate:F1} tok/s]\n");
}
Example session:
Chunk inspection:
Section: faq
Chunk #0: [Paragraph] Q: How do I reset my ...
Chunk #1: [Paragraph] Q: What file formats a...
Chunk #2: [EndOfText] Q: How do I enable GPU...
Section: config-guide
Chunk #0: [Paragraph] # Configuration Guide...
Chunk #1: [Paragraph] ### Vulkan For cross-p...
Chunk #2: [EndOfText] ## Memory Management M...
Ask a question (or 'quit' to exit):
Question: How do I set up Vulkan?
[config-guide] score=0.872
[faq] score=0.431
Answer: To set up Vulkan, use the Vulkan
backend which provides cross-platform GPU
support for NVIDIA, AMD, and Intel GPUs.
[41 tokens, 35.2 tok/s]
Chunking Guidelines
| Content Type | ChunkSize | Overlap | Strategy |
|---|---|---|---|
| FAQ / Q&A pairs | 200 to 300 | 0 | TextChunking |
| Product manuals | 400 to 500 | 50 | TextChunking |
| Markdown docs | 300 to 500 | N/A | MarkdownChunking |
| HTML pages / articles | 400 to 500 | 40 to 50 | HtmlChunking |
| Web scraped content | 300 to 400 | 40 | HtmlChunking |
| Legal contracts | 600 to 800 | 75 to 100 | TextChunking |
| Research papers | 800 to 1000 | 100 | TextChunking |
| Code repos (MD) | 400 to 600 | N/A | MarkdownChunking |
| Social media posts | 100 to 200 | 0 | TextChunking |
Troubleshooting
Many HardLimit splits in partition inspection
MaxChunkSize is too small for the content. Increase it, or preprocess the text to add paragraph breaks.
Relevant information split across two chunks
No overlap or overlap too small. Increase MaxOverlapSize (up to MaxChunkSize / 4).
Code blocks split in half
Using TextChunking on Markdown content. Switch to MarkdownChunking, which is fence-aware.
Headings separated from their content
Using TextChunking on structured documents. Switch to MarkdownChunking for Markdown or HtmlChunking for HTML, which split along heading boundaries.
HTML chunks contain navigation and footer text
StripBoilerplate is false or the boilerplate uses non-standard markup. Set StripBoilerplate = true on HtmlChunking. For non-standard patterns, preprocess the HTML to remove unwanted elements before chunking.
Too many tiny chunks
Document has many short paragraphs. Increase MaxChunkSize so multiple paragraphs fit in one chunk. TextChunking merges tiny trailing chunks automatically.
Embeddings are imprecise (all similarity scores similar)
Chunks are too large, covering multiple topics. Decrease MaxChunkSize to isolate individual topics.
High storage and embedding compute cost
Overlap is too high. Reduce MaxOverlapSize. Overlap is clamped to MaxChunkSize / 4 at most.
MarkdownChunking not splitting at headings
Headings are malformed (missing space after #). Ensure Markdown follows standard syntax: # Heading, not #Heading.
Next Steps
- Build a RAG Pipeline Over Your Own Documents: full RAG pipeline with indexing, search, and answer generation.
- Boost Retrieval with Hybrid Search: combine chunking improvements with hybrid (vector + BM25) retrieval.
- Diversify and Filter RAG Results: ensure well-chunked passages are not wasted by redundant retrieval. MMR diversity pairs naturally with good chunking.
- Improve RAG Results with Reranking: add a cross-encoder reranker to boost retrieval precision.
- Build a Unified Multimodal RAG System: index audio, images, and text in one knowledge base.
- Import and Query Documents with Vision: use VLM processing mode with automatic
MarkdownChunking. - Samples: Conversational RAG: multi-turn RAG with four query generation modes.