What Is the Maximum Context Length I Can Use?
TL;DR
There is no fixed SDK-wide limit. The maximum context length depends on two factors: the model's trained context window and your available memory. Most models support 8K to 128K tokens natively, but the practical limit is whatever your hardware can fit. LM-Kit.NET provides the MemoryEstimation.FitParameters() API to calculate the exact maximum context your hardware supports for a given model.
Context Length by Model
Each model has a maximum context window defined by its training. Here are the context lengths for popular models:
| Model | Max Context | Notes |
|---|---|---|
qwen3.5:0.8b to qwen3.5:27b |
32,768 tokens | Qwen 3.5 family |
qwen3.5:27b |
131,072 tokens | Extended context |
gemma4:e4b to gemma4:26b-a4b |
8,192 tokens | Gemma 4 family |
gptoss:20b |
131,072 tokens | Long-context reasoning |
glm4.7-flash |
131,072 tokens | Long-context MoE |
phi4-mini:3.8b |
16,384 tokens | Compact Phi-4 |
phi4:14.7b |
16,384 tokens | Full Phi-4 |
llama3.1:8b |
131,072 tokens | Llama 3.1 |
mistral-small |
32,768 tokens | Mistral Small |
You can check the context length of any loaded model programmatically:
using LMKit.Model;
using LM model = LM.LoadFromModelID("qwen3.5:9b");
Console.WriteLine($"Max context: {model.ContextLength} tokens");
Memory Limits the Practical Context
The model's trained context is an upper bound, but your actual usable context depends on available memory. Context consumes memory through the KV cache, which grows linearly with the number of tokens:
- Small context (2K to 4K tokens): Minimal KV cache overhead. Leaves maximum memory for model layers.
- Medium context (8K to 16K tokens): Moderate KV cache. Good balance for most chat and RAG applications.
- Large context (32K to 128K tokens): Significant memory consumption. May require reducing GPU layers or using a smaller model.
Use the MemoryEstimation API to find the maximum context your hardware supports:
using LMKit.Model;
// Auto-detect maximum context for your current hardware
var fit = MemoryEstimation.FitParameters(
modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
contextSize: 0 // 0 = auto-detect
);
if (fit.Success)
{
Console.WriteLine($"Max context: {fit.ContextSize} tokens");
Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
}
Context vs GPU Layers Trade-Off
Context size and GPU offloading compete for the same memory pool. Requesting a larger context reduces the number of model layers that fit on GPU (and vice versa):
| Configuration | Context | GPU Layers | Speed |
|---|---|---|---|
| Maximum GPU layers, minimal context | 2,048 | All layers | Fastest generation |
| Balanced | 8,192 | Most layers | Good speed with useful context |
| Maximum context, fewer GPU layers | 32,768+ | Fewer layers | Slower, but handles long documents |
For most applications, 8K to 16K tokens provides enough context for multi-turn conversations and RAG queries while keeping inference fast.
When You Need Long Context
Certain use cases benefit from large context windows:
- Document Q&A over long PDFs: Insert entire document sections into the prompt.
- Multi-turn conversations with memory: Accumulate conversation history without truncation.
- Code analysis: Process entire source files for review, refactoring, or documentation.
- Meeting transcripts: Process full meeting recordings for summarization and action items.
For these scenarios, choose a model with 32K+ context (Qwen 3.5, GPT-OSS, GLM 4.7) and ensure your hardware has enough memory. The minimum context floor in LM-Kit.NET is 2,048 tokens.
📚 Related Content
- How do I choose the right model size for my hardware?: Understand how model size and context interact with available memory.
- What happens when a model does not fit in my GPU memory?: Options for reducing context to free memory for model layers.
- Estimating Memory and Context Size: Full walkthrough of the MemoryEstimation API.
- Glossary: Context Windows: How context windows work in transformer models.
- Glossary: KV-Cache: Why context size affects memory consumption.