Table of Contents

How Do I Handle Documents Larger Than the Model's Context Window?


TL;DR

LM-Kit.NET provides multiple strategies: RAG with chunking (split documents into searchable chunks and retrieve only relevant passages), overflow policies (automatically trim or shift input when context fills up), recursive summarization (break text into segments, summarize each, merge results), and context recycling (reuse cached tokens between turns to maximize effective context). The best approach depends on whether you need to answer specific questions (use RAG) or process the entire document (use summarization).


Strategy 1: RAG with Chunking (Most Common)

Instead of fitting the entire document into context, split it into chunks, index them with embeddings, and retrieve only the relevant passages at query time:

using LMKit.Model;
using LMKit.Retrieval;

using LM chatModel = LM.LoadFromModelID("qwen3.5:9b");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");

var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("large-manual.pdf");  // Automatically chunked and indexed

// Only relevant passages are retrieved and injected into context
var chat = new MultiTurnConversation(chatModel);
string answer = ragEngine.QueryWithContext(chat, "What are the safety requirements?");

Chunking Strategies

Strategy Class Best For
Text chunking TextChunking General text. Recursive splitting with configurable overlap. Default: 500 tokens per chunk, 50 token overlap.
Markdown chunking MarkdownChunking Markdown documents. Respects heading boundaries and code blocks.
HTML chunking HtmlChunking Web content. Splits at block boundaries (sections, paragraphs, tables). Can strip boilerplate and preserve heading context.
// Customize chunking for your content
ragEngine.DefaultChunking = new TextChunking
{
    MaxChunkSize = 300,       // Tokens per chunk (200-300 for precise retrieval)
    MaxOverlapSize = 50       // Overlap tokens for context preservation
};

Strategy 2: PdfChat (Automatic Size-Based Routing)

PdfChat automatically chooses the best strategy based on document size:

  • Small documents (under 4096 tokens by default): The full document is included in context.
  • Large documents: Switches to passage retrieval, injecting only relevant excerpts per question.
using LMKit.Retrieval;

var pdfChat = new PdfChat(chatModel, embeddingModel);
pdfChat.ImportDocument("report.pdf");

// The SDK decides whether to use full-document or passage retrieval
string answer = pdfChat.Submit("What was the Q3 revenue?");

Strategy 3: Overflow Policies

For conversations that gradually fill the context window, LM-Kit.NET provides automatic overflow handling:

Input Length Overflow

When the input prompt exceeds the available context:

Policy Behavior
TrimAuto (default) Automatically trims input using the best method
TrimStart Removes the earliest tokens first
TrimEnd Removes the latest tokens first
KVCacheShifting Shifts the KV cache without directly trimming input
Throw Raises an exception so you can handle it manually

Context Overflow During Generation

When the context fills up during token generation:

Policy Behavior
KVCacheShifting (default) Dynamically shifts the KV cache to make room
StopGeneration Stops generation and returns a context-size-exceeded reason

Strategy 4: Recursive Summarization

For summarizing documents that exceed context, the Summarizer class supports recursive splitting:

Strategy Behavior
Truncate Removes content from the end until it fits
RecursiveSummarize Breaks input into segments, summarizes each, merges summaries iteratively until the result fits
Reject Halts if input exceeds the configured maximum

Strategy 5: Context Recycling

KV cache recycling (enabled by default) detects token overlap between conversation turns and reuses cached computations. This means the model does not re-process tokens it has already seen, effectively extending the useful context for multi-turn conversations.


Which Strategy Should I Use?

Goal Best Strategy
Answer specific questions about a large document RAG with chunking or PdfChat
Summarize an entire long document Recursive summarization
Multi-turn conversation that grows over time Overflow policies with KV cache shifting
Multiple documents in a knowledge base RAG with chunking
Process everything in a single pass Reduce document size or use a model with a larger context window

Share