Table of Contents

Chunk HTML and Markdown Documents for RAG Pipelines

Default text chunking splits content by character count with overlap, ignoring document structure. This produces chunks that break mid-paragraph, split tables, and lose heading context. HtmlChunking and MarkdownChunking solve this by splitting at structural boundaries (headings, sections, tables, code blocks) and preserving semantic context across chunks.

For background on RAG, see the RAG glossary entry. For embeddings, see Embeddings.


Why This Matters

Two production problems that structure-aware chunking solves:

  1. Noisy retrieval from web content. Web pages contain navigation bars, footers, cookie banners, and ads. Default chunking mixes this boilerplate with actual content, polluting search results. HtmlChunking strips boilerplate automatically and splits at semantic boundaries, producing clean, relevant chunks.
  2. Lost context in technical documentation. Markdown docs use heading hierarchies to organize information. When a chunk contains a paragraph from a subsection but loses the parent headings, the embedding captures the content without its context. MarkdownChunking and HtmlChunking.PreserveHeadingContext prepend heading breadcrumbs to each chunk, so the embedding captures both content and its place in the document structure.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 8+ GB
VRAM 2+ GB (for the embedding model)

Step 1: Create the Project

dotnet new console -n ChunkingDemo
cd ChunkingDemo
dotnet add package LM-Kit.NET

Step 2: Understand the Chunking Strategies

All chunking strategies implement the IChunking interface and plug into RagEngine.ImportText() or RagEngine.DefaultIChunking.

Strategy Best for Key features
TextChunking Plain text, logs, transcripts Recursive splitting, configurable overlap
MarkdownChunking Documentation, README files, wiki pages Splits at heading and paragraph boundaries
HtmlChunking Web pages, HTML emails, exported docs DOM-aware splitting, boilerplate removal, heading breadcrumbs, table preservation

All strategies share the MaxChunkSize property (in tokens). The effective size is clamped between 50 tokens and the embedding model's maximum input size.


Step 3: Index HTML Content with Boilerplate Removal

using System.Text;
using LMKit.Model;
using LMKit.Retrieval;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// Load embedding model
Console.WriteLine("Loading embedding model...");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var ragEngine = new RagEngine(embeddingModel);

// Configure HTML chunking
var htmlChunker = new HtmlChunking
{
    MaxChunkSize = 400,        // Target chunk size in tokens
    MaxOverlapSize = 40,       // Context overlap between consecutive chunks
    StripBoilerplate = true,   // Remove nav, footer, sidebar, ads
    PreserveHeadingContext = true,  // Prepend heading breadcrumbs to each chunk
    KeepSpacings = false       // Normalize whitespace
};

string htmlContent = @"
<html>
<head><title>Product Guide</title></head>
<body>
  <nav><a href='/'>Home</a> | <a href='/docs'>Docs</a></nav>
  <main>
    <h1>Getting Started</h1>
    <h2>Installation</h2>
    <p>Download the installer from our website. Run the setup wizard
    and follow the on-screen instructions. The installation typically
    takes less than five minutes.</p>

    <h2>Configuration</h2>
    <p>After installation, open the settings panel. Configure your
    API key and select your preferred region.</p>

    <h3>Advanced Settings</h3>
    <p>For production deployments, enable SSL and configure the
    connection pool size based on your expected load.</p>

    <table>
      <tr><th>Setting</th><th>Default</th><th>Recommended</th></tr>
      <tr><td>Pool Size</td><td>10</td><td>50</td></tr>
      <tr><td>Timeout</td><td>30s</td><td>60s</td></tr>
    </table>
  </main>
  <footer>Copyright 2025 Acme Corp</footer>
</body>
</html>";

// Import with HTML-aware chunking
ragEngine.ImportText(
    htmlContent,
    chunker: htmlChunker,
    dataSourceIdentifier: "product-guide",
    sectionIdentifier: "getting-started");

Console.WriteLine("HTML content indexed successfully.");

What HtmlChunking does with this content:

  • Removes the <nav> and <footer> elements (boilerplate).
  • Splits at <h1>, <h2>, <h3>, and <p> boundaries.
  • Preserves the table as a single chunk (pipe-delimited format).
  • Prepends heading context: a chunk under "Advanced Settings" gets the breadcrumb "Getting Started > Configuration > Advanced Settings".

Step 4: Index Markdown Documentation

var markdownChunker = new MarkdownChunking
{
    MaxChunkSize = 350
};

string markdownContent = @"
# API Reference

## Authentication

All API calls require a Bearer token in the Authorization header.
Tokens expire after 24 hours and must be refreshed.

### Obtaining a Token

Send a POST request to `/auth/token` with your client credentials.
The response includes an `access_token` and `expires_in` field.

## Endpoints

### GET /users

Returns a paginated list of users. Supports `limit` and `offset`
query parameters. Default page size is 25.

### POST /users

Creates a new user. Requires `name` and `email` in the request body.
Returns the created user with a generated `id`.
";

ragEngine.ImportText(
    markdownContent,
    chunker: markdownChunker,
    dataSourceIdentifier: "api-docs",
    sectionIdentifier: "api-reference");

Console.WriteLine("Markdown content indexed successfully.");

Step 5: Set a Default Chunking Strategy

Instead of passing a chunker to every ImportText call, set it as the default for the RagEngine instance.

// All subsequent ImportText calls will use HTML chunking
ragEngine.DefaultIChunking = new HtmlChunking
{
    MaxChunkSize = 400,
    StripBoilerplate = true,
    PreserveHeadingContext = true
};

// These calls now use the default HTML chunker
ragEngine.ImportText(page1Html, "website", "page1");
ragEngine.ImportText(page2Html, "website", "page2");

// Override for a specific import
var textChunker = new TextChunking
{
    MaxChunkSize = 500,
    MaxOverlapSize = 50
};

ragEngine.ImportText(plainTextLog, textChunker, "logs", "server-log");

Step 6: Query and Verify Chunk Quality

After indexing, query the RAG engine to verify that chunks produce relevant results.

// Load a chat model for RAG Q&A
Console.WriteLine("Loading chat model...");
using LM chatModel = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// Search for relevant chunks
var matches = ragEngine.FindMatchingPartitions(
    "What is the recommended pool size?",
    topK: 3,
    minScore: 0.3f);

Console.WriteLine($"Found {matches.Count} matching chunks:\n");

foreach (var match in matches)
{
    Console.WriteLine($"Score: {match.RawSimilarity:F3}");
    Console.WriteLine($"Content: {match.Payload}");
    Console.WriteLine();
}

With HTML chunking, the table chunk should rank highest because it was preserved intact and contains the exact answer.


Chunk Size Guidelines

Chunk size Trade-off
200-300 tokens High precision retrieval. Best for FAQ-style lookups.
300-500 tokens Balanced precision and context. Good default for most use cases.
500-800 tokens More context per chunk. Better for complex questions that need surrounding information. Slightly lower retrieval precision.

The overlap size should generally be 10-15% of the chunk size. Maximum overlap is clamped to 25% of the chunk size internally.


Share