How Do I Handle Documents Larger Than the Model's Context Window?
TL;DR
LM-Kit.NET provides multiple strategies: RAG with chunking (split documents into searchable chunks and retrieve only relevant passages), overflow policies (automatically trim or shift input when context fills up), recursive summarization (break text into segments, summarize each, merge results), and context recycling (reuse cached tokens between turns to maximize effective context). The best approach depends on whether you need to answer specific questions (use RAG) or process the entire document (use summarization).
Strategy 1: RAG with Chunking (Most Common)
Instead of fitting the entire document into context, split it into chunks, index them with embeddings, and retrieve only the relevant passages at query time:
using LMKit.Model;
using LMKit.Retrieval;
using LM chatModel = LM.LoadFromModelID("qwen3.5:9b");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");
var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("large-manual.pdf"); // Automatically chunked and indexed
// Only relevant passages are retrieved and injected into context
var chat = new MultiTurnConversation(chatModel);
string answer = ragEngine.QueryWithContext(chat, "What are the safety requirements?");
Chunking Strategies
| Strategy | Class | Best For |
|---|---|---|
| Text chunking | TextChunking |
General text. Recursive splitting with configurable overlap. Default: 500 tokens per chunk, 50 token overlap. |
| Markdown chunking | MarkdownChunking |
Markdown documents. Respects heading boundaries and code blocks. |
| HTML chunking | HtmlChunking |
Web content. Splits at block boundaries (sections, paragraphs, tables). Can strip boilerplate and preserve heading context. |
// Customize chunking for your content
ragEngine.DefaultChunking = new TextChunking
{
MaxChunkSize = 300, // Tokens per chunk (200-300 for precise retrieval)
MaxOverlapSize = 50 // Overlap tokens for context preservation
};
Strategy 2: PdfChat (Automatic Size-Based Routing)
PdfChat automatically chooses the best strategy based on document size:
- Small documents (under 4096 tokens by default): The full document is included in context.
- Large documents: Switches to passage retrieval, injecting only relevant excerpts per question.
using LMKit.Retrieval;
var pdfChat = new PdfChat(chatModel, embeddingModel);
pdfChat.ImportDocument("report.pdf");
// The SDK decides whether to use full-document or passage retrieval
string answer = pdfChat.Submit("What was the Q3 revenue?");
Strategy 3: Overflow Policies
For conversations that gradually fill the context window, LM-Kit.NET provides automatic overflow handling:
Input Length Overflow
When the input prompt exceeds the available context:
| Policy | Behavior |
|---|---|
| TrimAuto (default) | Automatically trims input using the best method |
| TrimStart | Removes the earliest tokens first |
| TrimEnd | Removes the latest tokens first |
| KVCacheShifting | Shifts the KV cache without directly trimming input |
| Throw | Raises an exception so you can handle it manually |
Context Overflow During Generation
When the context fills up during token generation:
| Policy | Behavior |
|---|---|
| KVCacheShifting (default) | Dynamically shifts the KV cache to make room |
| StopGeneration | Stops generation and returns a context-size-exceeded reason |
Strategy 4: Recursive Summarization
For summarizing documents that exceed context, the Summarizer class supports recursive splitting:
| Strategy | Behavior |
|---|---|
| Truncate | Removes content from the end until it fits |
| RecursiveSummarize | Breaks input into segments, summarizes each, merges summaries iteratively until the result fits |
| Reject | Halts if input exceeds the configured maximum |
Strategy 5: Context Recycling
KV cache recycling (enabled by default) detects token overlap between conversation turns and reuses cached computations. This means the model does not re-process tokens it has already seen, effectively extending the useful context for multi-turn conversations.
Which Strategy Should I Use?
| Goal | Best Strategy |
|---|---|
| Answer specific questions about a large document | RAG with chunking or PdfChat |
| Summarize an entire long document | Recursive summarization |
| Multi-turn conversation that grows over time | Overflow policies with KV cache shifting |
| Multiple documents in a knowledge base | RAG with chunking |
| Process everything in a single pass | Reduce document size or use a model with a larger context window |
📚 Related Content
- What is the maximum context length I can use?: Context window limits by model and hardware.
- Should I use RAG or fine-tuning for my use case?: When RAG is the right approach.
- What happens when a model does not fit in my GPU memory?: Context size vs GPU layers trade-off.
- Build a RAG Pipeline Over Your Own Documents: Step-by-step RAG implementation guide.