Optimize Memory with Context Recycling and KV-Cache Configuration

Every LLM inference allocates a context window that holds the KV-cache (key-value pairs from the attention mechanism). Creating and destroying these contexts for each request is expensive. LM-Kit.NET provides context recycling and KV-cache recycling that pool and reuse contexts across requests, dramatically reducing memory allocation overhead and improving throughput in server and batch scenarios.

This tutorial shows how to configure the caching system, tune its parameters for your workload, and monitor cache behavior.

Why Context Recycling Matters

Two real-world problems that context recycling solves:

High-throughput API servers handling concurrent requests. Without recycling, each request allocates a new GPU context. On a server handling hundreds of requests per minute, allocation and deallocation become the bottleneck. Recycling reuses contexts, keeping latency low.
Repetitive workloads with shared prefixes. Applications like document Q&A or chatbots process the same system prompt and document context repeatedly. KV-cache recycling detects the shared token prefix and skips re-computation, jumping straight to the new tokens.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	8 GB (16 GB recommended for large context sizes)
VRAM	4+ GB

Step 1: Create the Project

dotnet new console -n CacheOptimization
cd CacheOptimization
dotnet add package LM-Kit.NET

Step 2: Understand the Caching Architecture

                    Request arrives
                         │
                         ▼
               ┌───────────────────┐
               │  Context Pool     │
               │  (cached contexts)│
               └────────┬──────────┘
                        │
            ┌───────────┴───────────┐
            │ Exact size match?     │
            └───┬───────────────┬───┘
               Yes              No
                │                │
                ▼                ▼
        ┌──────────────┐  ┌───────────────────┐
        │ Reuse context│  │ Approximate match │
        │              │  │ (within 1.2x)     │
        └──────┬───────┘  └────────┬──────────┘
               │                   │
               └────────┬──────────┘
                        │
                        ▼
              ┌──────────────────┐
              │ KV-Cache prefix  │
              │ reuse (if match) │
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │ Inference runs   │
              │ (only new tokens)│
              └──────────────────┘

Configuration	Purpose	Default
`EnableContextRecycling`	Pool contexts for reuse across requests	`true`
`EnableKVCacheRecycling`	Reuse KV-cache token prefixes between turns	`true`
`MaxCachedContextMemoryRatio`	Fraction of device memory reserved for cached contexts (per device)	`0.15`
`MinContextSize`	Smallest context size to allocate (floor: 256)	`256`
`EnableContextTokenHealing`	Correct tokenization mismatches at context boundaries	`true`

Step 3: Write the Program

using System.Diagnostics;
using System.Text;
using LMKit.Global;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the caching system
// ──────────────────────────────────────

// Context recycling: reuse allocated contexts across requests
Configuration.EnableContextRecycling = true;

// KV-cache recycling: detect shared token prefixes and skip recomputation
Configuration.EnableKVCacheRecycling = true;

// Per-device memory budget for cached contexts (15% of device memory by default).
// Increase for servers with plenty of VRAM; decrease on memory-constrained devices.
Configuration.MaxCachedContextMemoryRatio = 0.15;

// Minimum context size (smaller values save memory for short requests)
Configuration.MinContextSize = 512;

// Token healing corrects boundary artifacts when contexts are reused
Configuration.EnableContextTokenHealing = true;

Console.WriteLine("Cache configuration:");
Console.WriteLine($"  Context recycling:     {Configuration.EnableContextRecycling}");
Console.WriteLine($"  KV-cache recycling:    {Configuration.EnableKVCacheRecycling}");
Console.WriteLine($"  Cache memory ratio:    {Configuration.MaxCachedContextMemoryRatio:P0} of device memory");
Console.WriteLine($"  Min context size:      {Configuration.MinContextSize} tokens");
Console.WriteLine($"  Token healing:         {Configuration.EnableContextTokenHealing}");
Console.WriteLine();

// ──────────────────────────────────────
// 2. Load a model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");

using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\r  Loading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Model loaded. Context length: {model.ContextLength}\n");

// ──────────────────────────────────────
// 3. Demonstrate KV-cache reuse with shared prefixes
// ──────────────────────────────────────
Console.WriteLine("=== KV-Cache Reuse Benchmark ===\n");

string sharedSystemPrompt = "You are a product support agent for Acme Corp. " +
    "Answer questions about Acme products using the following knowledge base:\n\n" +
    "The Acme Widget X100 is a portable device that weighs 250g. " +
    "It has a battery life of 12 hours and charges via USB-C. " +
    "The warranty covers manufacturing defects for 2 years. " +
    "Common issues include firmware update failures (solved by holding reset for 5 seconds) " +
    "and Bluetooth pairing problems (solved by clearing paired devices list).";

string[] questions = {
    "How long does the battery last?",
    "How do I fix a firmware update failure?",
    "What does the warranty cover?",
    "How do I fix Bluetooth issues?"
};

var chat = new SingleTurnConversation(model)
{
    SystemPrompt = sharedSystemPrompt,
    MaximumCompletionTokens = 128
};

// Suppress streaming output for benchmarking
var sw = new Stopwatch();

foreach (string question in questions)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write($"Q: {question}");
    Console.ResetColor();

    sw.Restart();
    var result = chat.Submit(question);
    sw.Stop();

    Console.WriteLine($"  ({sw.ElapsedMilliseconds} ms, {result.TokenGenerationRate:F1} tok/s)");

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"A: {result.Completion}\n");
    Console.ResetColor();
}

// ──────────────────────────────────────
// 4. Compare with recycling disabled
// ──────────────────────────────────────
Console.WriteLine("=== Without KV-Cache Recycling ===\n");

Configuration.EnableKVCacheRecycling = false;

// Clear cached data so we start fresh
model.ClearCache();

foreach (string question in questions)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write($"Q: {question}");
    Console.ResetColor();

    sw.Restart();
    var result = chat.Submit(question);
    sw.Stop();

    Console.WriteLine($"  ({sw.ElapsedMilliseconds} ms, {result.TokenGenerationRate:F1} tok/s)");
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"A: {result.Completion}\n");
    Console.ResetColor();
}

// Restore defaults
Configuration.EnableKVCacheRecycling = true;

Console.WriteLine("Done. Compare timings above: subsequent requests with shared " +
    "prefixes should be noticeably faster when KV-cache recycling is enabled.");

Step 4: Run the Benchmark

dotnet run

Expected output pattern:

Q: How long does the battery last?  (450 ms, 42.3 tok/s)
Q: How do I fix a firmware update failure?  (180 ms, 45.1 tok/s)   ← faster: prefix reused
Q: What does the warranty cover?  (165 ms, 44.8 tok/s)             ← faster: prefix reused
Q: How do I fix Bluetooth issues?  (170 ms, 44.5 tok/s)            ← faster: prefix reused

The first request computes the full KV-cache. Subsequent requests with the same system prompt reuse the cached prefix, so only the new question tokens are processed.

Tuning Cache Parameters

MaxCachedContextMemoryRatio

Controls the fraction of each device's total memory that can be used for cached contexts. The budget is computed per device (each GPU and CPU independently), so multi-GPU systems automatically get separate budgets.

Value	Approximate Budget (24 GB GPU)	Use Case
`0.05`	~1.2 GB	Memory-constrained, single large model
`0.15` (default)	~3.6 GB	General purpose
`0.25`	~6 GB	High-throughput servers with spare VRAM
`0.40`	~9.6 GB	Dedicated inference servers

// Conservative: keep more memory free for model loading
Configuration.MaxCachedContextMemoryRatio = 0.05;

// Aggressive: maximize cache reuse on a dedicated server
Configuration.MaxCachedContextMemoryRatio = 0.30;

MinContextSize

The smallest context the system will allocate. Values below 256 are clamped to 256:

// For a server handling many short queries
Configuration.MinContextSize = 256;

// For workloads where every request needs substantial context
Configuration.MinContextSize = 1024;

Smaller values reduce memory waste for short requests. Larger values avoid frequent re-allocation when requests grow.

When to Disable Caching

In most scenarios, keep both settings enabled. Consider disabling in these cases:

Scenario	Action
Extremely memory-constrained devices (< 4 GB RAM)	Set `MaxCachedContextMemoryRatio` to `0.05` or lower
One-shot batch processing (no request reuse)	`EnableContextRecycling = false` saves pool overhead
Every request has unique content (no shared prefixes)	KV-cache recycling still helps via context pooling, so keep it on

Clearing the Cache Manually

When you need to reclaim memory immediately:

// Release all cached contexts for a specific model
model.ClearCache();

This is useful when switching between different workloads or before loading a second model.

Common Issues

Problem	Cause	Fix
No speedup on subsequent requests	Different system prompts across requests	KV-cache reuse requires shared token prefixes. Use consistent system prompts
High memory usage	`MaxCachedContextMemoryRatio` too high	Reduce the ratio (e.g., `0.05`) to free device memory
Garbled output at context boundaries	Token healing disabled	Set `Configuration.EnableContextTokenHealing = true`
First request is slow, rest are fast	Expected behavior	The first request builds the KV-cache. Recycling benefits subsequent calls

Next Steps

Handle Long Inputs with Overflow Policies: managing inputs that exceed context length.
Distribute Large Models Across Multiple GPUs: split large models when VRAM is tight.
Save and Restore Conversation Sessions: persist conversation state across app restarts.

Table of Contents