Table of Contents

Class HtmlChunking

Namespace
LMKit.Retrieval
Assembly
LM-Kit.NET.dll

Provides HTML-aware chunking for retrieval workflows. The splitter uses AngleSharp to parse the HTML DOM and respects structural boundaries such as headings, sections, tables, and block-level elements to produce semantically coherent chunks.

public sealed class HtmlChunking : IChunking
Inheritance
HtmlChunking
Implements
Inherited Members

Examples

Example: Import an HTML page into a RAG pipeline

using LMKit.Model;
using LMKit.Retrieval;

LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m"); RagEngine ragEngine = new RagEngine(embeddingModel);

var htmlChunker = new HtmlChunking { MaxChunkSize = 400, MaxOverlapSize = 40, StripBoilerplate = true, PreserveHeadingContext = true };

ragEngine.DefaultIChunking = htmlChunker;

string html = File.ReadAllText("page.html"); ragEngine.ImportText(html, "web-docs", "page1");

Remarks

Key Features

  • Configurable chunk size via MaxChunkSize
  • Overlap between chunks to preserve context via MaxOverlapSize
  • Optional boilerplate stripping (nav, footer, sidebar) via StripBoilerplate
  • Heading breadcrumb context prepended to chunks via PreserveHeadingContext
  • Tables and preformatted blocks kept intact when they fit in a single chunk

Usage with RagEngine
Create a configured instance and assign it to DefaultIChunking or pass it to individual ImportText(string, IChunking, string, string, MetadataCollection, CancellationToken) calls.

Properties

KeepSpacings

Gets or sets whether to preserve original whitespace in the extracted text.

MaxChunkSize

Gets or sets the maximum target size of a produced chunk, expressed in tokens.

MaxOverlapSize

Gets or sets the maximum overlap between consecutive chunks, expressed in tokens. Overlap preserves context continuity across chunk boundaries.

PreserveHeadingContext

Gets or sets whether to prepend heading context (a breadcrumb trail of parent headings) to each chunk that falls under a heading hierarchy. This improves retrieval quality by giving each chunk a clear topic signal.

StripBoilerplate

Gets or sets whether to strip boilerplate elements such as navigation bars, footers, sidebars, and common advertisement containers from the HTML before chunking.

Share