What Is the Maximum Context Length I Can Use?

TL;DR

There is no fixed SDK-wide limit. The maximum context length depends on two factors: the model's trained context window and your available memory. Most models support 8K to 128K tokens natively, but the practical limit is whatever your hardware can fit. LM-Kit.NET provides the MemoryEstimation.FitParameters() API to calculate the exact maximum context your hardware supports for a given model.

Context Length by Model

Each model has a maximum context window defined by its training. Here are the context lengths for popular models:

Model	Max Context	Notes
`qwen3.5:0.8b` to `qwen3.6:27b`	32,768 tokens	Qwen 3.5 family
`qwen3.6:27b`	131,072 tokens	Extended context
`gemma4:e4b` to `gemma4:26b-a4b`	8,192 tokens	Gemma 4 family
`gptoss:20b`	131,072 tokens	Long-context reasoning
`glm4.7-flash`	131,072 tokens	Long-context MoE
`phi4-mini:3.8b`	16,384 tokens	Compact Phi-4
`phi4:14.7b`	16,384 tokens	Full Phi-4
`llama3.1:8b`	131,072 tokens	Llama 3.1
`mistral-small`	32,768 tokens	Mistral Small

You can check the context length of any loaded model programmatically:

using LMKit.Model;

using LM model = LM.LoadFromModelID("qwen3.5:9b");
Console.WriteLine($"Max context: {model.ContextLength} tokens");

Memory Limits the Practical Context

The model's trained context is an upper bound, but your actual usable context depends on available memory. Context consumes memory through the KV cache, which grows linearly with the number of tokens:

Small context (2K to 4K tokens): Minimal KV cache overhead. Leaves maximum memory for model layers.
Medium context (8K to 16K tokens): Moderate KV cache. Good balance for most chat and RAG applications.
Large context (32K to 128K tokens): Significant memory consumption. May require reducing GPU layers or using a smaller model.

Use the MemoryEstimation API to find the maximum context your hardware supports:

using LMKit.Model;

// Auto-detect maximum context for your current hardware
var fit = MemoryEstimation.FitParameters(
    modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
    contextSize: 0  // 0 = auto-detect
);

if (fit.Success)
{
    Console.WriteLine($"Max context: {fit.ContextSize} tokens");
    Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
}

Context vs GPU Layers Trade-Off

Context size and GPU offloading compete for the same memory pool. Requesting a larger context reduces the number of model layers that fit on GPU (and vice versa):

Configuration	Context	GPU Layers	Speed
Maximum GPU layers, minimal context	2,048	All layers	Fastest generation
Balanced	8,192	Most layers	Good speed with useful context
Maximum context, fewer GPU layers	32,768+	Fewer layers	Slower, but handles long documents

For most applications, 8K to 16K tokens provides enough context for multi-turn conversations and RAG queries while keeping inference fast.

When You Need Long Context

Certain use cases benefit from large context windows:

Document Q&A over long PDFs: Insert entire document sections into the prompt.
Multi-turn conversations with memory: Accumulate conversation history without truncation.
Code analysis: Process entire source files for review, refactoring, or documentation.
Meeting transcripts: Process full meeting recordings for summarization and action items.

For these scenarios, choose a model with 32K+ context (Qwen 3.5, GPT-OSS, GLM 4.7) and ensure your hardware has enough memory. The minimum context floor in LM-Kit.NET is 2,048 tokens.

How do I choose the right model size for my hardware?: Understand how model size and context interact with available memory.
What happens when a model does not fit in my GPU memory?: Options for reducing context to free memory for model layers.
Estimating Memory and Context Size: Full walkthrough of the MemoryEstimation API.
Glossary: Context Windows: How context windows work in transformer models.
Glossary: KV-Cache: Why context size affects memory consumption.

Table of Contents