Optimize Memory with Context Recycling and KV-Cache Configuration
Every LLM inference allocates a context window that holds the KV-cache (key-value pairs from the attention mechanism). Creating and destroying these contexts for each request is expensive. LM-Kit.NET provides context recycling and KV-cache recycling that pool and reuse contexts across requests, dramatically reducing memory allocation overhead and improving throughput in server and batch scenarios.
This tutorial shows how to configure the caching system, tune its parameters for your workload, and monitor cache behavior.
Why Context Recycling Matters
Two real-world problems that context recycling solves:
- High-throughput API servers handling concurrent requests. Without recycling, each request allocates a new GPU context. On a server handling hundreds of requests per minute, allocation and deallocation become the bottleneck. Recycling reuses contexts, keeping latency low.
- Repetitive workloads with shared prefixes. Applications like document Q&A or chatbots process the same system prompt and document context repeatedly. KV-cache recycling detects the shared token prefix and skips re-computation, jumping straight to the new tokens.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 8 GB (16 GB recommended for large context sizes) |
| VRAM | 4+ GB |
Step 1: Create the Project
dotnet new console -n CacheOptimization
cd CacheOptimization
dotnet add package LM-Kit.NET
Step 2: Understand the Caching Architecture
Request arrives
│
▼
┌───────────────────┐
│ Context Pool │
│ (cached contexts)│
└────────┬──────────┘
│
┌───────────┴───────────┐
│ Exact size match? │
└───┬───────────────┬───┘
Yes No
│ │
▼ ▼
┌──────────────┐ ┌───────────────────┐
│ Reuse context│ │ Approximate match │
│ │ │ (within 1.2x) │
└──────┬───────┘ └────────┬──────────┘
│ │
└────────┬──────────┘
│
▼
┌──────────────────┐
│ KV-Cache prefix │
│ reuse (if match) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Inference runs │
│ (only new tokens)│
└──────────────────┘
| Configuration | Purpose | Default |
|---|---|---|
EnableContextRecycling |
Pool contexts for reuse across requests | true |
EnableKVCacheRecycling |
Reuse KV-cache token prefixes between turns | true |
MaxCachedContextLength |
Maximum context token length to cache | 4096 |
MinContextSize |
Smallest context size to allocate (floor: 256) | 256 |
EnableContextTokenHealing |
Correct tokenization mismatches at context boundaries | true |
Step 3: Write the Program
using System.Diagnostics;
using System.Text;
using LMKit.Global;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Configure the caching system
// ──────────────────────────────────────
// Context recycling: reuse allocated contexts across requests
Configuration.EnableContextRecycling = true;
// KV-cache recycling: detect shared token prefixes and skip recomputation
Configuration.EnableKVCacheRecycling = true;
// Maximum context length to cache (increase for long-context workloads)
Configuration.MaxCachedContextLength = 8192;
// Minimum context size (smaller values save memory for short requests)
Configuration.MinContextSize = 512;
// Token healing corrects boundary artifacts when contexts are reused
Configuration.EnableContextTokenHealing = true;
Console.WriteLine("Cache configuration:");
Console.WriteLine($" Context recycling: {Configuration.EnableContextRecycling}");
Console.WriteLine($" KV-cache recycling: {Configuration.EnableKVCacheRecycling}");
Console.WriteLine($" Max cached context: {Configuration.MaxCachedContextLength} tokens");
Console.WriteLine($" Min context size: {Configuration.MinContextSize} tokens");
Console.WriteLine($" Token healing: {Configuration.EnableContextTokenHealing}");
Console.WriteLine();
// ──────────────────────────────────────
// 2. Load a model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
},
loadingProgress: progress =>
{
Console.Write($"\r Loading: {progress * 100:F0}% ");
return true;
});
Console.WriteLine($"\n Model loaded. Context length: {model.ContextLength}\n");
// ──────────────────────────────────────
// 3. Demonstrate KV-cache reuse with shared prefixes
// ──────────────────────────────────────
Console.WriteLine("=== KV-Cache Reuse Benchmark ===\n");
string sharedSystemPrompt = "You are a product support agent for Acme Corp. " +
"Answer questions about Acme products using the following knowledge base:\n\n" +
"The Acme Widget X100 is a portable device that weighs 250g. " +
"It has a battery life of 12 hours and charges via USB-C. " +
"The warranty covers manufacturing defects for 2 years. " +
"Common issues include firmware update failures (solved by holding reset for 5 seconds) " +
"and Bluetooth pairing problems (solved by clearing paired devices list).";
string[] questions = {
"How long does the battery last?",
"How do I fix a firmware update failure?",
"What does the warranty cover?",
"How do I fix Bluetooth issues?"
};
var chat = new SingleTurnConversation(model)
{
SystemPrompt = sharedSystemPrompt,
MaximumCompletionTokens = 128
};
// Suppress streaming output for benchmarking
var sw = new Stopwatch();
foreach (string question in questions)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"Q: {question}");
Console.ResetColor();
sw.Restart();
var result = chat.Submit(question);
sw.Stop();
Console.WriteLine($" ({sw.ElapsedMilliseconds} ms, {result.TokenGenerationRate:F1} tok/s)");
Console.ForegroundColor = ConsoleColor.Cyan;
Console.WriteLine($"A: {result.Completion}\n");
Console.ResetColor();
}
// ──────────────────────────────────────
// 4. Compare with recycling disabled
// ──────────────────────────────────────
Console.WriteLine("=== Without KV-Cache Recycling ===\n");
Configuration.EnableKVCacheRecycling = false;
// Clear cached data so we start fresh
model.ClearCache();
foreach (string question in questions)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write($"Q: {question}");
Console.ResetColor();
sw.Restart();
var result = chat.Submit(question);
sw.Stop();
Console.WriteLine($" ({sw.ElapsedMilliseconds} ms, {result.TokenGenerationRate:F1} tok/s)");
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($"A: {result.Completion}\n");
Console.ResetColor();
}
// Restore defaults
Configuration.EnableKVCacheRecycling = true;
Console.WriteLine("Done. Compare timings above: subsequent requests with shared " +
"prefixes should be noticeably faster when KV-cache recycling is enabled.");
Step 4: Run the Benchmark
dotnet run
Expected output pattern:
Q: How long does the battery last? (450 ms, 42.3 tok/s)
Q: How do I fix a firmware update failure? (180 ms, 45.1 tok/s) ← faster: prefix reused
Q: What does the warranty cover? (165 ms, 44.8 tok/s) ← faster: prefix reused
Q: How do I fix Bluetooth issues? (170 ms, 44.5 tok/s) ← faster: prefix reused
The first request computes the full KV-cache. Subsequent requests with the same system prompt reuse the cached prefix, so only the new question tokens are processed.
Tuning Cache Parameters
MaxCachedContextLength
Controls the largest context that will be pooled for reuse:
| Value | Use Case |
|---|---|
2048 |
Chatbots with short conversations |
4096 (default) |
General purpose |
8192 |
Document Q&A with moderate context |
16384+ |
Long-context RAG workloads |
Higher values use more memory but enable recycling for longer conversations.
MinContextSize
The smallest context the system will allocate. Values below 256 are clamped to 256:
// For a server handling many short queries
Configuration.MinContextSize = 256;
// For workloads where every request needs substantial context
Configuration.MinContextSize = 1024;
Smaller values reduce memory waste for short requests. Larger values avoid frequent re-allocation when requests grow.
When to Disable Caching
In most scenarios, keep both settings enabled. Consider disabling in these cases:
| Scenario | Action |
|---|---|
| Extremely memory-constrained devices (< 4 GB RAM) | Set MaxCachedContextLength to a small value like 1024 |
| One-shot batch processing (no request reuse) | EnableContextRecycling = false saves pool overhead |
| Every request has unique content (no shared prefixes) | KV-cache recycling still helps via context pooling, so keep it on |
Clearing the Cache Manually
When you need to reclaim memory immediately:
// Release all cached contexts for a specific model
model.ClearCache();
This is useful when switching between different workloads or before loading a second model.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| No speedup on subsequent requests | Different system prompts across requests | KV-cache reuse requires shared token prefixes. Use consistent system prompts |
| High memory usage | MaxCachedContextLength too large |
Reduce to match your typical context size |
| Garbled output at context boundaries | Token healing disabled | Set Configuration.EnableContextTokenHealing = true |
| First request is slow, rest are fast | Expected behavior | The first request builds the KV-cache. Recycling benefits subsequent calls |
Next Steps
- Handle Long Inputs with Overflow Policies: managing inputs that exceed context length.
- Distribute Large Models Across Multiple GPUs: split large models when VRAM is tight.
- Save and Restore Conversation Sessions: persist conversation state across app restarts.