Optimize RAG Results with Custom Chunking Strategies

Chunking is how your documents get split into passages before they are embedded and indexed. It is the single biggest lever for RAG quality after model selection. Chunks that are too large dilute the embedding signal, making retrieval imprecise. Chunks that are too small lose surrounding context, producing incomplete answers. LM-Kit.NET ships three chunking strategies out of the box (TextChunking, MarkdownChunking, and HtmlChunking) and lets you configure them per import, so you can tailor splitting to each document type in your pipeline.

TL;DR

using LMKit.Model;
using LMKit.Retrieval;

using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m"); // or "harrier-oss:0.6b"
var rag = new RagEngine(embeddingModel);

// Use Markdown-aware chunking for structured content
rag.DefaultIChunking = new MarkdownChunking
{
    MaxChunkSize = 400
};

// Or use text chunking with custom overlap for plain text
rag.DefaultIChunking = new TextChunking
{
    MaxChunkSize = 300,
    MaxOverlapSize = 75
};

// Or use HTML-aware chunking for web content
rag.DefaultIChunking = new HtmlChunking
{
    MaxChunkSize = 400,
    StripBoilerplate = true,
    PreserveHeadingContext = true
};

Why This Matters

Two problems that custom chunking solves:

FAQ and support documents return vague answers. When chunk size is too large, a single chunk may contain answers to several unrelated questions. The embedding becomes an average of all those topics, reducing similarity to any one question. Smaller chunks isolate each answer, producing precise matches.
Technical documentation loses structure. Splitting Markdown or code files on a fixed token count can break a code block in half or separate a heading from its content. Structure-aware chunking keeps headings, code fences, and tables intact, preserving the semantic coherence of each chunk.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	8 GB recommended
VRAM	1 GB (embedding model)
Disk	~500 MB free for model download

Step 1: Understand the Default Behavior

When you create a RagEngine without configuring chunking, it uses TextChunking with default settings:

using LMKit.Model;
using LMKit.Retrieval;

using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
var rag = new RagEngine(embeddingModel);

// This is equivalent to the default:
rag.DefaultIChunking = new TextChunking
{
    MaxChunkSize = 500,     // 500 tokens per chunk
    MaxOverlapSize = 50,    // 50 tokens overlap between adjacent chunks
    KeepSpacings = false    // normalize whitespace
};

TextChunking splits text using a priority cascade:

Paragraph breaks (\n\n)
Line breaks (\n)
Sentence punctuation (., ;, ?, !)
Whitespace
Hard token limit (forced split)

This priority system means the chunker always prefers to split at natural boundaries. A hard split only happens when a single paragraph or sentence exceeds MaxChunkSize.

Overlap works by prepending or appending tokens from adjacent chunks at sentence or paragraph boundaries, so context is not lost at the edges. When the final chunk in a document is very small, it is merged with the previous chunk to avoid tiny trailing fragments.

Step 2: Tune Chunk Size

Chunk size controls the trade-off between retrieval precision and context completeness.

using LMKit.Model;
using LMKit.Retrieval;

using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");

// Small chunks: precise retrieval, less context per chunk
var faqChunker = new TextChunking
{
    MaxChunkSize = 200
};

// Medium chunks: balanced (the default)
var generalChunker = new TextChunking
{
    MaxChunkSize = 500
};

// Large chunks: more context, less precise matching
var longFormChunker = new TextChunking
{
    MaxChunkSize = 800
};

Trade-offs:

Smaller chunks (200 to 300 tokens) produce embeddings that represent a single idea, making similarity search more precise. However, each chunk may lack enough context for the LLM to generate a complete answer. Best for FAQ databases, glossaries, and short-answer knowledge bases.
Medium chunks (400 to 600 tokens) balance precision with context. This range works well for general documentation, product manuals, and policy documents.
Larger chunks (800+ tokens) capture more surrounding context, which helps the LLM produce coherent answers from long-form content such as research papers, legal contracts, or book chapters. The downside is that embedding similarity becomes less discriminating when each chunk covers multiple subtopics.

MaxChunkSize is specified in tokens and is clamped between 50 and the model's embedding size.

Step 3: Tune Overlap

Overlap prevents information loss at chunk boundaries. When a sentence spans two chunks, overlap ensures that both chunks contain the full sentence.

using LMKit.Model;
using LMKit.Retrieval;

using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");

// No overlap: maximum storage efficiency, risk of boundary splits
var noOverlap = new TextChunking
{
    MaxChunkSize = 400,
    MaxOverlapSize = 0
};

// Moderate overlap (default): good boundary coverage
var moderate = new TextChunking
{
    MaxChunkSize = 400,
    MaxOverlapSize = 50
};

// Higher overlap: better boundary coverage, more storage and compute
var highOverlap = new TextChunking
{
    MaxChunkSize = 400,
    MaxOverlapSize = 100
};

MaxOverlapSize is clamped between 0 and MaxChunkSize / 4. Setting it to 100 on a 400-token chunk means up to 25% of each chunk may be shared with its neighbors.

When to increase overlap:

Documents with long, complex sentences that are likely to land on chunk boundaries.
Content where key information appears at the end of one paragraph and the beginning of the next.

When to decrease overlap:

Storage and embedding compute are constrained.
Documents are already well-structured with clear paragraph boundaries.

Step 4: Use Markdown Chunking for Structured Content

MarkdownChunking understands Markdown structure and splits along document boundaries instead of raw text boundaries.

using LMKit.Retrieval;

var markdownChunker = new MarkdownChunking
{
    MaxChunkSize = 400
};

Splitting priority for MarkdownChunking:

H1 headings (#)
H2 headings (##)
H3+ headings (###, ####, etc.)
Code fences and thematic breaks (---, ***)
Tables, lists, and blockquotes
Paragraph breaks
Newlines
Hard token limit

Key differences from TextChunking:

No overlap support. This is intentional. Because Markdown chunks align to structural boundaries (headings, code blocks), the content within each chunk is self-contained. Overlap would duplicate heading content across chunks without adding value.
Fence-aware splitting. The chunker avoids splitting inside fenced code blocks. A code example stays in one chunk even if it approaches MaxChunkSize.
Preserves formatting. Unlike TextChunking with KeepSpacings = false, Markdown chunking preserves the original whitespace and formatting so that Markdown rendering remains intact.

Step 5: Use HTML Chunking for Web Content

HtmlChunking parses HTML with AngleSharp and splits along semantic DOM boundaries, making it ideal for web pages, knowledge base articles, and any HTML content ingested into a RAG pipeline.

using LMKit.Retrieval;

var htmlChunker = new HtmlChunking
{
    MaxChunkSize = 400,
    MaxOverlapSize = 40,
    StripBoilerplate = true,       // remove nav, footer, sidebar, ad containers
    PreserveHeadingContext = true   // prepend heading breadcrumb to each chunk
};

Key features:

Boilerplate stripping. When StripBoilerplate is true (default), the chunker removes <nav>, <footer>, <aside>, and elements with common sidebar, cookie, or advertisement class/id patterns before extracting text.
Heading breadcrumb. When PreserveHeadingContext is true (default), each chunk is prefixed with its heading hierarchy (e.g., "Products > Pricing > Enterprise"), giving the embedding a clear topic signal even when the chunk text alone is ambiguous.
Table preservation. Tables are extracted as pipe-delimited text and kept as a single chunk when they fit within MaxChunkSize. Oversized tables are sub-split using the plain text partitioner.
Preformatted block preservation. <pre> content is extracted verbatim without whitespace normalization.
Overlap support. Like TextChunking, overlap tokens are added at chunk boundaries to preserve context continuity.

using LMKit.Model;
using LMKit.Retrieval;

using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
var rag = new RagEngine(embeddingModel);

rag.DefaultIChunking = new HtmlChunking
{
    MaxChunkSize = 400,
    StripBoilerplate = true,
    PreserveHeadingContext = true
};

string html = File.ReadAllText("knowledge-base-article.html");
rag.ImportText(html, "kb", "article-1");

Step 6: Choose the Right Strategy for Your Content

Content Type	Strategy	Why
Plain text files	`TextChunking`	Paragraph and sentence splitting with overlap
Markdown documentation	`MarkdownChunking`	Heading-aware splitting preserves structure
Code in Markdown fences	`MarkdownChunking`	Fence-aware splitting keeps code intact
HTML pages and articles	`HtmlChunking`	DOM-aware splitting with boilerplate removal
Web scraped content	`HtmlChunking`	Strips navigation, ads; preserves heading context
Mixed document types	`DocumentRag`	Auto-selects the right chunking strategy
PDF documents	`DocumentRag`	Built-in parsing with auto strategy selection

DocumentRag exposes its own MaxChunkSize property (default 500) and automatically selects the appropriate chunking strategy based on the processing mode.

Step 7: Apply Per-Import Chunking

You can use different chunking strategies for different document types within the same RagEngine by passing a chunker to each ImportText call.

using LMKit.Model;
using LMKit.Retrieval;
using LMKit.Data;

using LM embeddingModel = LM.LoadFromModelID(
    "embeddinggemma-300m");

var dataSource = DataSource.CreateInMemoryDataSource(
    "KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);

// FAQ: small chunks, no overlap for short Q&A pairs
var faqChunker = new TextChunking
{
    MaxChunkSize = 200,
    MaxOverlapSize = 0
};

// Technical docs: structure-aware Markdown splitting
var docsChunker = new MarkdownChunking
{
    MaxChunkSize = 400
};

// Long-form reports: large chunks with overlap
var reportChunker = new TextChunking
{
    MaxChunkSize = 800,
    MaxOverlapSize = 100
};

// Import each document type with its own strategy
string faqContent = File.ReadAllText("docs/faq.txt");
rag.ImportText(
    faqContent, chunker: faqChunker,
    "KnowledgeBase", "faq");

string docsContent = File.ReadAllText("docs/api-reference.md");
rag.ImportText(
    docsContent, chunker: docsChunker,
    "KnowledgeBase", "api-reference");

string reportContent = File.ReadAllText("docs/annual-report.txt");
rag.ImportText(
    reportContent, chunker: reportChunker,
    "KnowledgeBase", "annual-report");

The per-import chunker parameter overrides rag.DefaultIChunking for that specific call. You can also use ImportTextAsync with the same parameter.

Step 8: Inspect Chunk Quality

After importing, iterate over the partitions in your data source to verify that chunks are well-formed.

using LMKit.Data;
using LMKit.Retrieval;

// After importing documents...
foreach (var section in dataSource.Sections)
{
    Console.WriteLine($"\nSection: {section.Identifier}");
    int count = section.Partitions.Count();
    Console.WriteLine($"  Partition count: {count}");

    int index = 0;
    foreach (TextPartition partition in section.Partitions)
    {
        int previewLen = Math.Min(120, partition.Text.Length);
        string preview = partition.Text[..previewLen];

        Console.WriteLine($"\n  Chunk #{index}:");
        Console.WriteLine($"    SplitMode: {partition.SplitMode}");
        Console.WriteLine($"    Length:    {partition.Text.Length} chars");
        Console.WriteLine($"    Preview:   {preview}...");
        index++;
    }
}

The SplitMode property on each TextPartition tells you where the split occurred:

SplitMode	Meaning
`Paragraph`	Split at a paragraph break (`\n\n`)
`Line`	Split at a line break (`\n`)
`Punctuation`	Split at sentence punctuation (`.`, `;`, `?`, `!`)
`EndOfText`	Final chunk (end of input)
`HardLimit`	Forced split at the token limit

If you see many HardLimit splits, your MaxChunkSize may be too small for the content, or the text contains very long paragraphs without natural boundaries.

Complete Example

This example builds a full pipeline: configure chunking per document type, import documents, query, and display results.

using System.Text;
using LMKit.Data;
using LMKit.Model;
using LMKit.Retrieval;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load models
// ──────────────────────────────────────
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID(
    "embeddinggemma-300m",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue)
        {
            double pct = (double)read / len.Value * 100;
            Console.Write($"\r  Downloading: {pct:F1}%   ");
        }
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });
Console.WriteLine(" Done.\n");

Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID(
    "gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue)
        {
            double pct = (double)read / len.Value * 100;
            Console.Write($"\r  Downloading: {pct:F1}%   ");
        }
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\r  Loading: {p * 100:F0}%   ");
        return true;
    });
Console.WriteLine(" Done.\n");

// ──────────────────────────────────────
// 2. Create RAG engine and data source
// ──────────────────────────────────────
var dataSource = DataSource.CreateInMemoryDataSource(
    "KnowledgeBase", embeddingModel);
var rag = new RagEngine(embeddingModel);
rag.AddDataSource(dataSource);

// ──────────────────────────────────────
// 3. Import documents with tailored chunking
// ──────────────────────────────────────
var faqChunker = new TextChunking
{
    MaxChunkSize = 250,
    MaxOverlapSize = 0
};

var docsChunker = new MarkdownChunking
{
    MaxChunkSize = 400
};

// Import FAQ (small, precise chunks)
string faqText = """
    Q: How do I reset my password?
    A: Go to Settings > Account > Reset Password
    and follow the prompts.

    Q: What file formats are supported?
    A: We support PDF, DOCX, TXT, and Markdown.

    Q: How do I enable GPU acceleration?
    A: Install the appropriate GPU backend package
    and set the backend in your configuration.
    """;

rag.ImportText(
    faqText, chunker: faqChunker,
    "KnowledgeBase", "faq");

// Import technical docs (structure-aware chunks)
string docsText = """
    # Configuration Guide

    ## GPU Backends

    LM-Kit.NET supports multiple GPU backends
    for accelerated inference.

    ### CUDA

    For NVIDIA GPUs, install the CUDA backend.
    Requires CUDA 12.0 or later.

    ### Vulkan

    For cross-platform GPU support, use the
    Vulkan backend.

    ## Memory Management

    Monitor VRAM usage when loading multiple
    models simultaneously.
    """;

rag.ImportText(
    docsText, chunker: docsChunker,
    "KnowledgeBase", "config-guide");

// ──────────────────────────────────────
// 4. Inspect chunk quality
// ──────────────────────────────────────
Console.WriteLine("Chunk inspection:");
foreach (var section in dataSource.Sections)
{
    Console.WriteLine($"\n  Section: {section.Identifier}");
    int i = 0;
    foreach (TextPartition partition in section.Partitions)
    {
        string preview = partition.Text.Length > 80
            ? partition.Text[..80] + "..."
            : partition.Text;

        Console.WriteLine(
            $"    Chunk #{i}: [{partition.SplitMode}] " +
            $"{preview}");
        i++;
    }
}

// ──────────────────────────────────────
// 5. Query loop
// ──────────────────────────────────────
var chat = new SingleTurnConversation(chatModel)
{
    SystemPrompt =
        "Answer the question using only the provided context. " +
        "If the context does not contain the answer, say so.",
    MaximumCompletionTokens = 256
};

Console.WriteLine("\nAsk a question (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("Question: ");
    Console.ResetColor();

    string? query = Console.ReadLine();

    if (string.IsNullOrWhiteSpace(query))
        break;
    if (query.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    var matches = rag.FindMatchingPartitions(
        query, topK: 3, minScore: 0.3f);

    if (matches.Count == 0)
    {
        Console.WriteLine("No relevant passages found.\n");
        continue;
    }

    Console.ForegroundColor = ConsoleColor.DarkGray;
    foreach (var m in matches)
    {
        Console.WriteLine(
            $"  [{m.SectionIdentifier}] score={m.Similarity:F3}");
    }
    Console.ResetColor();

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("\nAnswer: ");
    Console.ResetColor();

    var result = rag.QueryPartitions(
        query, matches, chat);

    Console.WriteLine(
        $"\n  [{result.GeneratedTokenCount} tokens, " +
        $"{result.TokenGenerationRate:F1} tok/s]\n");
}

Example session:

Chunk inspection:

  Section: faq
    Chunk #0: [Paragraph] Q: How do I reset my ...
    Chunk #1: [Paragraph] Q: What file formats a...
    Chunk #2: [EndOfText] Q: How do I enable GPU...

  Section: config-guide
    Chunk #0: [Paragraph] # Configuration Guide...
    Chunk #1: [Paragraph] ### Vulkan For cross-p...
    Chunk #2: [EndOfText] ## Memory Management M...

Ask a question (or 'quit' to exit):

Question: How do I set up Vulkan?
  [config-guide] score=0.872
  [faq] score=0.431

Answer: To set up Vulkan, use the Vulkan
backend which provides cross-platform GPU
support for NVIDIA, AMD, and Intel GPUs.
  [41 tokens, 35.2 tok/s]

Chunking Guidelines

Content Type	ChunkSize	Overlap	Strategy
FAQ / Q&A pairs	200 to 300	0	`TextChunking`
Product manuals	400 to 500	50	`TextChunking`
Markdown docs	300 to 500	N/A	`MarkdownChunking`
HTML pages / articles	400 to 500	40 to 50	`HtmlChunking`
Web scraped content	300 to 400	40	`HtmlChunking`
Legal contracts	600 to 800	75 to 100	`TextChunking`
Research papers	800 to 1000	100	`TextChunking`
Code repos (MD)	400 to 600	N/A	`MarkdownChunking`
Social media posts	100 to 200	0	`TextChunking`

Troubleshooting

Many HardLimit splits in partition inspection MaxChunkSize is too small for the content. Increase it, or preprocess the text to add paragraph breaks.

Relevant information split across two chunks No overlap or overlap too small. Increase MaxOverlapSize (up to MaxChunkSize / 4).

Code blocks split in half Using TextChunking on Markdown content. Switch to MarkdownChunking, which is fence-aware.

Headings separated from their content Using TextChunking on structured documents. Switch to MarkdownChunking for Markdown or HtmlChunking for HTML, which split along heading boundaries.

HTML chunks contain navigation and footer text StripBoilerplate is false or the boilerplate uses non-standard markup. Set StripBoilerplate = true on HtmlChunking. For non-standard patterns, preprocess the HTML to remove unwanted elements before chunking.

Too many tiny chunks Document has many short paragraphs. Increase MaxChunkSize so multiple paragraphs fit in one chunk. TextChunking merges tiny trailing chunks automatically.

Embeddings are imprecise (all similarity scores similar) Chunks are too large, covering multiple topics. Decrease MaxChunkSize to isolate individual topics.

High storage and embedding compute cost Overlap is too high. Reduce MaxOverlapSize. Overlap is clamped to MaxChunkSize / 4 at most.

MarkdownChunking not splitting at headings Headings are malformed (missing space after #). Ensure Markdown follows standard syntax: # Heading, not #Heading.

Next Steps

Build a RAG Pipeline Over Your Own Documents: full RAG pipeline with indexing, search, and answer generation.
Boost Retrieval with Hybrid Search: combine chunking improvements with hybrid (vector + BM25) retrieval.
Diversify and Filter RAG Results: ensure well-chunked passages are not wasted by redundant retrieval. MMR diversity pairs naturally with good chunking.
Improve RAG Results with Reranking: add a cross-encoder reranker to boost retrieval precision.
Build a Unified Multimodal RAG System: index audio, images, and text in one knowledge base.
Import and Query Documents with Vision: use VLM processing mode with automatic MarkdownChunking.
Samples: Conversational RAG: multi-turn RAG with four query generation modes.

Table of Contents