Table of Contents

Optimize Memory with Context Recycling and KV-Cache Configuration

Every LLM inference allocates a context window that holds the KV-cache (key-value pairs from the attention mechanism). Creating and destroying these contexts for each request is expensive. LM-Kit.NET provides context recycling and KV-cache recycling that pool and reuse contexts across requests, dramatically reducing memory allocation overhead and improving throughput in server and batch scenarios.

This tutorial shows how to configure the caching system, tune its parameters for your workload, and monitor cache behavior.


Why Context Recycling Matters

Two real-world problems that context recycling solves:

  1. High-throughput API servers handling concurrent requests. Without recycling, each request allocates a new GPU context. On a server handling hundreds of requests per minute, allocation and deallocation become the bottleneck. Recycling reuses contexts, keeping latency low.
  2. Repetitive workloads with shared prefixes. Applications like document Q&A or chatbots process the same system prompt and document context repeatedly. KV-cache recycling detects the shared token prefix and skips re-computation, jumping straight to the new tokens.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 8 GB (16 GB recommended for large context sizes)
VRAM 4+ GB

Step 1: Create the Project

dotnet new console -n CacheOptimization
cd CacheOptimization
dotnet add package LM-Kit.NET

Step 2: Understand the Caching Architecture

                    Request arrives
                         │
                         ▼
               ┌───────────────────┐
               │  Context Pool     │
               │  (cached contexts)│
               └────────┬──────────┘
                        │
            ┌───────────┴───────────┐
            │ Exact size match?     │
            └───┬───────────────┬───┘
               Yes              No
                │                │
                ▼                ▼
        ┌──────────────┐  ┌───────────────────┐
        │ Reuse context│  │ Approximate match │
        │              │  │ (within 1.2x)     │
        └──────┬───────┘  └────────┬──────────┘
               │                   │
               └────────┬──────────┘
                        │
                        ▼
              ┌──────────────────┐
              │ KV-Cache prefix  │
              │ reuse (if match) │
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │ Inference runs   │
              │ (only new tokens)│
              └──────────────────┘
Configuration Purpose Default
EnableContextRecycling Pool contexts for reuse across requests true
EnableKVCacheRecycling Reuse KV-cache token prefixes between turns true
MaxCachedContextLength Maximum context token length to cache 4096
MinContextSize Smallest context size to allocate (floor: 256) 256
EnableContextTokenHealing Correct tokenization mismatches at context boundaries true

Step 3: Write the Program

using System.Diagnostics;
using System.Text;
using LMKit.Global;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the caching system
// ──────────────────────────────────────

// Context recycling: reuse allocated contexts across requests
Configuration.EnableContextRecycling = true;

// KV-cache recycling: detect shared token prefixes and skip recomputation
Configuration.EnableKVCacheRecycling = true;

// Maximum context length to cache (increase for long-context workloads)
Configuration.MaxCachedContextLength = 8192;

// Minimum context size (smaller values save memory for short requests)
Configuration.MinContextSize = 512;

// Token healing corrects boundary artifacts when contexts are reused
Configuration.EnableContextTokenHealing = true;

Console.WriteLine("Cache configuration:");
Console.WriteLine($"  Context recycling:     {Configuration.EnableContextRecycling}");
Console.WriteLine($"  KV-cache recycling:    {Configuration.EnableKVCacheRecycling}");
Console.WriteLine($"  Max cached context:    {Configuration.MaxCachedContextLength} tokens");
Console.WriteLine($"  Min context size:      {Configuration.MinContextSize} tokens");
Console.WriteLine($"  Token healing:         {Configuration.EnableContextTokenHealing}");
Console.WriteLine();

// ──────────────────────────────────────
// 2. Load a model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");

using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\r  Loading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Model loaded. Context length: {model.ContextLength}\n");

// ──────────────────────────────────────
// 3. Demonstrate KV-cache reuse with shared prefixes
// ──────────────────────────────────────
Console.WriteLine("=== KV-Cache Reuse Benchmark ===\n");

string sharedSystemPrompt = "You are a product support agent for Acme Corp. " +
    "Answer questions about Acme products using the following knowledge base:\n\n" +
    "The Acme Widget X100 is a portable device that weighs 250g. " +
    "It has a battery life of 12 hours and charges via USB-C. " +
    "The warranty covers manufacturing defects for 2 years. " +
    "Common issues include firmware update failures (solved by holding reset for 5 seconds) " +
    "and Bluetooth pairing problems (solved by clearing paired devices list).";

string[] questions = {
    "How long does the battery last?",
    "How do I fix a firmware update failure?",
    "What does the warranty cover?",
    "How do I fix Bluetooth issues?"
};

var chat = new SingleTurnConversation(model)
{
    SystemPrompt = sharedSystemPrompt,
    MaximumCompletionTokens = 128
};

// Suppress streaming output for benchmarking
var sw = new Stopwatch();

foreach (string question in questions)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write($"Q: {question}");
    Console.ResetColor();

    sw.Restart();
    var result = chat.Submit(question);
    sw.Stop();

    Console.WriteLine($"  ({sw.ElapsedMilliseconds} ms, {result.TokenGenerationRate:F1} tok/s)");

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.WriteLine($"A: {result.Completion}\n");
    Console.ResetColor();
}

// ──────────────────────────────────────
// 4. Compare with recycling disabled
// ──────────────────────────────────────
Console.WriteLine("=== Without KV-Cache Recycling ===\n");

Configuration.EnableKVCacheRecycling = false;

// Clear cached data so we start fresh
model.ClearCache();

foreach (string question in questions)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write($"Q: {question}");
    Console.ResetColor();

    sw.Restart();
    var result = chat.Submit(question);
    sw.Stop();

    Console.WriteLine($"  ({sw.ElapsedMilliseconds} ms, {result.TokenGenerationRate:F1} tok/s)");
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"A: {result.Completion}\n");
    Console.ResetColor();
}

// Restore defaults
Configuration.EnableKVCacheRecycling = true;

Console.WriteLine("Done. Compare timings above: subsequent requests with shared " +
    "prefixes should be noticeably faster when KV-cache recycling is enabled.");

Step 4: Run the Benchmark

dotnet run

Expected output pattern:

Q: How long does the battery last?  (450 ms, 42.3 tok/s)
Q: How do I fix a firmware update failure?  (180 ms, 45.1 tok/s)   ← faster: prefix reused
Q: What does the warranty cover?  (165 ms, 44.8 tok/s)             ← faster: prefix reused
Q: How do I fix Bluetooth issues?  (170 ms, 44.5 tok/s)            ← faster: prefix reused

The first request computes the full KV-cache. Subsequent requests with the same system prompt reuse the cached prefix, so only the new question tokens are processed.


Tuning Cache Parameters

MaxCachedContextLength

Controls the largest context that will be pooled for reuse:

Value Use Case
2048 Chatbots with short conversations
4096 (default) General purpose
8192 Document Q&A with moderate context
16384+ Long-context RAG workloads

Higher values use more memory but enable recycling for longer conversations.

MinContextSize

The smallest context the system will allocate. Values below 256 are clamped to 256:

// For a server handling many short queries
Configuration.MinContextSize = 256;

// For workloads where every request needs substantial context
Configuration.MinContextSize = 1024;

Smaller values reduce memory waste for short requests. Larger values avoid frequent re-allocation when requests grow.


When to Disable Caching

In most scenarios, keep both settings enabled. Consider disabling in these cases:

Scenario Action
Extremely memory-constrained devices (< 4 GB RAM) Set MaxCachedContextLength to a small value like 1024
One-shot batch processing (no request reuse) EnableContextRecycling = false saves pool overhead
Every request has unique content (no shared prefixes) KV-cache recycling still helps via context pooling, so keep it on

Clearing the Cache Manually

When you need to reclaim memory immediately:

// Release all cached contexts for a specific model
model.ClearCache();

This is useful when switching between different workloads or before loading a second model.


Common Issues

Problem Cause Fix
No speedup on subsequent requests Different system prompts across requests KV-cache reuse requires shared token prefixes. Use consistent system prompts
High memory usage MaxCachedContextLength too large Reduce to match your typical context size
Garbled output at context boundaries Token healing disabled Set Configuration.EnableContextTokenHealing = true
First request is slow, rest are fast Expected behavior The first request builds the KV-cache. Recycling benefits subsequent calls

Next Steps