Class HtmlChunking
Provides HTML-aware chunking for retrieval workflows. The splitter uses AngleSharp to parse the HTML DOM and respects structural boundaries such as headings, sections, tables, and block-level elements to produce semantically coherent chunks.
public sealed class HtmlChunking : IChunking
- Inheritance
-
HtmlChunking
- Implements
- Inherited Members
Examples
Example: Import an HTML page into a RAG pipeline
using LMKit.Model;
using LMKit.Retrieval;
LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
RagEngine ragEngine = new RagEngine(embeddingModel);
var htmlChunker = new HtmlChunking
{
MaxChunkSize = 400,
MaxOverlapSize = 40,
StripBoilerplate = true,
PreserveHeadingContext = true
};
ragEngine.DefaultIChunking = htmlChunker;
string html = File.ReadAllText("page.html");
ragEngine.ImportText(html, "web-docs", "page1");
Remarks
Key Features
- Configurable chunk size via MaxChunkSize
- Overlap between chunks to preserve context via MaxOverlapSize
- Optional boilerplate stripping (nav, footer, sidebar) via StripBoilerplate
- Heading breadcrumb context prepended to chunks via PreserveHeadingContext
- Tables and preformatted blocks kept intact when they fit in a single chunk
Usage with RagEngine
Create a configured instance and assign it to DefaultIChunking
or pass it to individual ImportText(string, IChunking, string, string, MetadataCollection, CancellationToken) calls.
Properties
- KeepSpacings
Gets or sets whether to preserve original whitespace in the extracted text.
- MaxChunkSize
Gets or sets the maximum target size of a produced chunk, expressed in tokens.
- MaxOverlapSize
Gets or sets the maximum overlap between consecutive chunks, expressed in tokens. Overlap preserves context continuity across chunk boundaries.
- PreserveHeadingContext
Gets or sets whether to prepend heading context (a breadcrumb trail of parent headings) to each chunk that falls under a heading hierarchy. This improves retrieval quality by giving each chunk a clear topic signal.
- StripBoilerplate
Gets or sets whether to strip boilerplate elements such as navigation bars, footers, sidebars, and common advertisement containers from the HTML before chunking.