Chunk HTML and Markdown Documents for RAG Pipelines
Default text chunking splits content by character count with overlap, ignoring document structure. This produces chunks that break mid-paragraph, split tables, and lose heading context. HtmlChunking and MarkdownChunking solve this by splitting at structural boundaries (headings, sections, tables, code blocks) and preserving semantic context across chunks.
For background on RAG, see the RAG glossary entry. For embeddings, see Embeddings.
Why This Matters
Two production problems that structure-aware chunking solves:
- Noisy retrieval from web content. Web pages contain navigation bars, footers, cookie banners, and ads. Default chunking mixes this boilerplate with actual content, polluting search results.
HtmlChunkingstrips boilerplate automatically and splits at semantic boundaries, producing clean, relevant chunks. - Lost context in technical documentation. Markdown docs use heading hierarchies to organize information. When a chunk contains a paragraph from a subsection but loses the parent headings, the embedding captures the content without its context.
MarkdownChunkingandHtmlChunking.PreserveHeadingContextprepend heading breadcrumbs to each chunk, so the embedding captures both content and its place in the document structure.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 8+ GB |
| VRAM | 2+ GB (for the embedding model) |
Step 1: Create the Project
dotnet new console -n ChunkingDemo
cd ChunkingDemo
dotnet add package LM-Kit.NET
Step 2: Understand the Chunking Strategies
All chunking strategies implement the IChunking interface and plug into RagEngine.ImportText() or RagEngine.DefaultIChunking.
| Strategy | Best for | Key features |
|---|---|---|
TextChunking |
Plain text, logs, transcripts | Recursive splitting, configurable overlap |
MarkdownChunking |
Documentation, README files, wiki pages | Splits at heading and paragraph boundaries |
HtmlChunking |
Web pages, HTML emails, exported docs | DOM-aware splitting, boilerplate removal, heading breadcrumbs, table preservation |
All strategies share the MaxChunkSize property (in tokens). The effective size is clamped between 50 tokens and the embedding model's maximum input size.
Step 3: Index HTML Content with Boilerplate Removal
using System.Text;
using LMKit.Model;
using LMKit.Retrieval;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// Load embedding model
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
var ragEngine = new RagEngine(embeddingModel);
// Configure HTML chunking
var htmlChunker = new HtmlChunking
{
MaxChunkSize = 400, // Target chunk size in tokens
MaxOverlapSize = 40, // Context overlap between consecutive chunks
StripBoilerplate = true, // Remove nav, footer, sidebar, ads
PreserveHeadingContext = true, // Prepend heading breadcrumbs to each chunk
KeepSpacings = false // Normalize whitespace
};
string htmlContent = @"
<html>
<head><title>Product Guide</title></head>
<body>
<nav><a href='/'>Home</a> | <a href='/docs'>Docs</a></nav>
<main>
<h1>Getting Started</h1>
<h2>Installation</h2>
<p>Download the installer from our website. Run the setup wizard
and follow the on-screen instructions. The installation typically
takes less than five minutes.</p>
<h2>Configuration</h2>
<p>After installation, open the settings panel. Configure your
API key and select your preferred region.</p>
<h3>Advanced Settings</h3>
<p>For production deployments, enable SSL and configure the
connection pool size based on your expected load.</p>
<table>
<tr><th>Setting</th><th>Default</th><th>Recommended</th></tr>
<tr><td>Pool Size</td><td>10</td><td>50</td></tr>
<tr><td>Timeout</td><td>30s</td><td>60s</td></tr>
</table>
</main>
<footer>Copyright 2025 Acme Corp</footer>
</body>
</html>";
// Import with HTML-aware chunking
ragEngine.ImportText(
htmlContent,
chunker: htmlChunker,
dataSourceIdentifier: "product-guide",
sectionIdentifier: "getting-started");
Console.WriteLine("HTML content indexed successfully.");
What HtmlChunking does with this content:
- Removes the
<nav>and<footer>elements (boilerplate). - Splits at
<h1>,<h2>,<h3>, and<p>boundaries. - Preserves the table as a single chunk (pipe-delimited format).
- Prepends heading context: a chunk under "Advanced Settings" gets the breadcrumb
"Getting Started > Configuration > Advanced Settings".
Step 4: Index Markdown Documentation
var markdownChunker = new MarkdownChunking
{
MaxChunkSize = 350
};
string markdownContent = @"
# API Reference
## Authentication
All API calls require a Bearer token in the Authorization header.
Tokens expire after 24 hours and must be refreshed.
### Obtaining a Token
Send a POST request to `/auth/token` with your client credentials.
The response includes an `access_token` and `expires_in` field.
## Endpoints
### GET /users
Returns a paginated list of users. Supports `limit` and `offset`
query parameters. Default page size is 25.
### POST /users
Creates a new user. Requires `name` and `email` in the request body.
Returns the created user with a generated `id`.
";
ragEngine.ImportText(
markdownContent,
chunker: markdownChunker,
dataSourceIdentifier: "api-docs",
sectionIdentifier: "api-reference");
Console.WriteLine("Markdown content indexed successfully.");
Step 5: Set a Default Chunking Strategy
Instead of passing a chunker to every ImportText call, set it as the default for the RagEngine instance.
// All subsequent ImportText calls will use HTML chunking
ragEngine.DefaultIChunking = new HtmlChunking
{
MaxChunkSize = 400,
StripBoilerplate = true,
PreserveHeadingContext = true
};
// These calls now use the default HTML chunker
ragEngine.ImportText(page1Html, "website", "page1");
ragEngine.ImportText(page2Html, "website", "page2");
// Override for a specific import
var textChunker = new TextChunking
{
MaxChunkSize = 500,
MaxOverlapSize = 50
};
ragEngine.ImportText(plainTextLog, textChunker, "logs", "server-log");
Step 6: Query and Verify Chunk Quality
After indexing, query the RAG engine to verify that chunks produce relevant results.
// Load a chat model for RAG Q&A
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// Search for relevant chunks
var matches = ragEngine.FindMatchingPartitions(
"What is the recommended pool size?",
topK: 3,
minScore: 0.3f);
Console.WriteLine($"Found {matches.Count} matching chunks:\n");
foreach (var match in matches)
{
Console.WriteLine($"Score: {match.RawSimilarity:F3}");
Console.WriteLine($"Content: {match.Payload}");
Console.WriteLine();
}
With HTML chunking, the table chunk should rank highest because it was preserved intact and contains the exact answer.
Chunk Size Guidelines
| Chunk size | Trade-off |
|---|---|
| 200-300 tokens | High precision retrieval. Best for FAQ-style lookups. |
| 300-500 tokens | Balanced precision and context. Good default for most use cases. |
| 500-800 tokens | More context per chunk. Better for complex questions that need surrounding information. Slightly lower retrieval precision. |
The overlap size should generally be 10-15% of the chunk size. Maximum overlap is clamped to 25% of the chunk size internally.
What to Read Next
- Build a RAG Pipeline Over Your Own Documents: end-to-end RAG tutorial
- Improve RAG Results with Reranking: cross-encoder reranking for better precision
- Optimize RAG with Custom Chunking Strategies: advanced chunking tuning
- RAG: concept overview
- Embeddings: how vector search works