Table of Contents

What Is the Maximum Context Length I Can Use?


TL;DR

There is no fixed SDK-wide limit. The maximum context length depends on two factors: the model's trained context window and your available memory. Most models support 8K to 128K tokens natively, but the practical limit is whatever your hardware can fit. LM-Kit.NET provides the MemoryEstimation.FitParameters() API to calculate the exact maximum context your hardware supports for a given model.


Context Length by Model

Each model has a maximum context window defined by its training. Here are the context lengths for popular models:

Model Max Context Notes
qwen3.5:0.8b to qwen3.5:27b 32,768 tokens Qwen 3.5 family
qwen3.5:27b 131,072 tokens Extended context
gemma4:e4b to gemma4:26b-a4b 8,192 tokens Gemma 4 family
gptoss:20b 131,072 tokens Long-context reasoning
glm4.7-flash 131,072 tokens Long-context MoE
phi4-mini:3.8b 16,384 tokens Compact Phi-4
phi4:14.7b 16,384 tokens Full Phi-4
llama3.1:8b 131,072 tokens Llama 3.1
mistral-small 32,768 tokens Mistral Small

You can check the context length of any loaded model programmatically:

using LMKit.Model;

using LM model = LM.LoadFromModelID("qwen3.5:9b");
Console.WriteLine($"Max context: {model.ContextLength} tokens");

Memory Limits the Practical Context

The model's trained context is an upper bound, but your actual usable context depends on available memory. Context consumes memory through the KV cache, which grows linearly with the number of tokens:

  • Small context (2K to 4K tokens): Minimal KV cache overhead. Leaves maximum memory for model layers.
  • Medium context (8K to 16K tokens): Moderate KV cache. Good balance for most chat and RAG applications.
  • Large context (32K to 128K tokens): Significant memory consumption. May require reducing GPU layers or using a smaller model.

Use the MemoryEstimation API to find the maximum context your hardware supports:

using LMKit.Model;

// Auto-detect maximum context for your current hardware
var fit = MemoryEstimation.FitParameters(
    modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
    contextSize: 0  // 0 = auto-detect
);

if (fit.Success)
{
    Console.WriteLine($"Max context: {fit.ContextSize} tokens");
    Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
}

Context vs GPU Layers Trade-Off

Context size and GPU offloading compete for the same memory pool. Requesting a larger context reduces the number of model layers that fit on GPU (and vice versa):

Configuration Context GPU Layers Speed
Maximum GPU layers, minimal context 2,048 All layers Fastest generation
Balanced 8,192 Most layers Good speed with useful context
Maximum context, fewer GPU layers 32,768+ Fewer layers Slower, but handles long documents

For most applications, 8K to 16K tokens provides enough context for multi-turn conversations and RAG queries while keeping inference fast.


When You Need Long Context

Certain use cases benefit from large context windows:

  • Document Q&A over long PDFs: Insert entire document sections into the prompt.
  • Multi-turn conversations with memory: Accumulate conversation history without truncation.
  • Code analysis: Process entire source files for review, refactoring, or documentation.
  • Meeting transcripts: Process full meeting recordings for summarization and action items.

For these scenarios, choose a model with 32K+ context (Qwen 3.5, GPT-OSS, GLM 4.7) and ensure your hardware has enough memory. The minimum context floor in LM-Kit.NET is 2,048 tokens.


Share