How Do I Handle Documents Larger Than the Model's Context Window?

TL;DR

LM-Kit.NET provides multiple strategies: RAG with chunking (split documents into searchable chunks and retrieve only relevant passages), overflow policies (automatically trim or shift input when context fills up), recursive summarization (break text into segments, summarize each, merge results), and context recycling (reuse cached tokens between turns to maximize effective context). The best approach depends on whether you need to answer specific questions (use RAG) or process the entire document (use summarization).

Strategy 1: RAG with Chunking (Most Common)

Instead of fitting the entire document into context, split it into chunks, index them with embeddings, and retrieve only the relevant passages at query time:

using LMKit.Model;
using LMKit.Retrieval;

using LM chatModel = LM.LoadFromModelID("qwen3.5:9b");
using LM embeddingModel = LM.LoadFromModelID("embeddinggemma-300m");

var ragEngine = new RagEngine(embeddingModel);
ragEngine.ImportDocument("large-manual.pdf");  // Automatically chunked and indexed

// Only relevant passages are retrieved and injected into context
var chat = new MultiTurnConversation(chatModel);
string answer = ragEngine.QueryWithContext(chat, "What are the safety requirements?");

Chunking Strategies

Strategy	Class	Best For
Text chunking	`TextChunking`	General text. Recursive splitting with configurable overlap. Default: 500 tokens per chunk, 50 token overlap.
Markdown chunking	`MarkdownChunking`	Markdown documents. Respects heading boundaries and code blocks.
HTML chunking	`HtmlChunking`	Web content. Splits at block boundaries (sections, paragraphs, tables). Can strip boilerplate and preserve heading context.

// Customize chunking for your content
ragEngine.DefaultChunking = new TextChunking
{
    MaxChunkSize = 300,       // Tokens per chunk (200-300 for precise retrieval)
    MaxOverlapSize = 50       // Overlap tokens for context preservation
};

Strategy 2: PdfChat (Automatic Size-Based Routing)

PdfChat automatically chooses the best strategy based on document size:

Small documents (under 4096 tokens by default): The full document is included in context.
Large documents: Switches to passage retrieval, injecting only relevant excerpts per question.

using LMKit.Retrieval;

var pdfChat = new PdfChat(chatModel, embeddingModel);
pdfChat.ImportDocument("report.pdf");

// The SDK decides whether to use full-document or passage retrieval
string answer = pdfChat.Submit("What was the Q3 revenue?");

Strategy 3: Overflow Policies

For conversations that gradually fill the context window, LM-Kit.NET provides automatic overflow handling:

Input Length Overflow

When the input prompt exceeds the available context:

Policy	Behavior
TrimAuto (default)	Automatically trims input using the best method
TrimStart	Removes the earliest tokens first
TrimEnd	Removes the latest tokens first
KVCacheShifting	Shifts the KV cache without directly trimming input
Throw	Raises an exception so you can handle it manually

Context Overflow During Generation

When the context fills up during token generation:

Policy	Behavior
KVCacheShifting (default)	Dynamically shifts the KV cache to make room
StopGeneration	Stops generation and returns a context-size-exceeded reason

Strategy 4: Recursive Summarization

For summarizing documents that exceed context, the Summarizer class supports recursive splitting:

Strategy	Behavior
Truncate	Removes content from the end until it fits
RecursiveSummarize	Breaks input into segments, summarizes each, merges summaries iteratively until the result fits
Reject	Halts if input exceeds the configured maximum

Strategy 5: Context Recycling

KV cache recycling (enabled by default) detects token overlap between conversation turns and reuses cached computations. This means the model does not re-process tokens it has already seen, effectively extending the useful context for multi-turn conversations.

Which Strategy Should I Use?

Goal	Best Strategy
Answer specific questions about a large document	RAG with chunking or PdfChat
Summarize an entire long document	Recursive summarization
Multi-turn conversation that grows over time	Overflow policies with KV cache shifting
Multiple documents in a knowledge base	RAG with chunking
Process everything in a single pass	Reduce document size or use a model with a larger context window

What is the maximum context length I can use?: Context window limits by model and hardware.
Should I use RAG or fine-tuning for my use case?: When RAG is the right approach.
What happens when a model does not fit in my GPU memory?: Context size vs GPU layers trade-off.
Build a RAG Pipeline Over Your Own Documents: Step-by-step RAG implementation guide.

Table of Contents