Table of Contents

🚀 Understanding Context Windows (Context Length) for LLMs


📄 TL;DR

A context window is the maximum number of tokens a model can “see” at once (prompt + chat history + recalled memory + tool outputs + the tokens it is currently generating).

In LM-Kit.NET, MultiTurnConversation exposes:

  • ContextSize: the total token budget for the conversation window. (docs.lm-kit.com)
  • ContextRemainingSpace: how many tokens are still available right now. (docs.lm-kit.com)

You can also:

  • Override the context size at construction time (or let LM-Kit pick an optimal size). (docs.lm-kit.com)
  • Cap per-turn output with MaximumCompletionTokens (default 2048). (docs.lm-kit.com)
  • Inject long-term memory via Memory and cap it with MaximumRecallTokens. (docs.lm-kit.com)
  • Choose overflow behaviors using InferencePolicies (trim, shift KV cache, throw, stop generation). (docs.lm-kit.com)

🧠 What Exactly is a Context Window?

A model does not “read” your whole app or your whole conversation forever. It processes a sliding window of tokens.

Think of it like a desk:

  • Everything on the desk is visible to the model.
  • Anything that falls off the desk might be forgotten unless you re-inject it (summary, memory recall, retrieval, etc.).

A practical token accounting view:

\[ \text{ContextUsed} = \text{System} + \text{History} + \text{Tools} + \text{RecalledMemory} + \text{CurrentUser} + \text{GeneratedTokens} \]

If ContextUsed > ContextSize, you need a strategy.

LM-Kit explicitly models this in MultiTurnConversation with ContextSize and ContextRemainingSpace. (docs.lm-kit.com)


🛠️ Why Context Windows Matter

  1. Quality and coherence

    • Longer context helps keep instructions and earlier details consistent.
    • But if the window is full, important instructions may get trimmed out (which looks like “the model forgot”).
  2. Performance and memory

    • Larger context usually means more RAM usage and slower inference (KV cache grows with context).
    • LM-Kit exposes configuration like KV cache recycling for efficiency in .NET apps. (docs.lm-kit.com)
  3. Agent behavior

    • Planning, reflection, tool-calling, and memory recall all consume tokens.
    • If you do not budget for these, the “agentic” parts are the first to degrade.

🚧 What Happens When You Hit the Limit?

There are two common overflow problems:

1) Prompt too long (input overflow)

In LM-Kit, this is configurable via InferencePolicies.InputLengthOverflowPolicy. (docs.lm-kit.com)

Options include: (docs.lm-kit.com)

  • TrimAuto: automatically trims to fit
  • TrimStart: drop earliest tokens (often best for chat, keeps the most recent turns)
  • TrimEnd: drop the latest tokens
  • KVCacheShifting: shift KV cache to accommodate overflow
  • Throw: raise NotEnoughContextSizeException

2) Context fills up during generation (context overflow)

Controlled by InferencePolicies.ContextOverflowPolicy. (docs.lm-kit.com)

Options include: (docs.lm-kit.com)

  • KVCacheShifting: shifts cache contents to keep going
  • StopGeneration: stops and returns ContextSizeLimitExceeded

🔍 Practical Strategies That Actually Work

✅ Budget your output

In MultiTurnConversation, use MaximumCompletionTokens to cap how long the assistant can speak per turn (default 2048). (docs.lm-kit.com) If you intentionally cap output, you can continue later using ContinueLastAssistantResponseAsync. (docs.lm-kit.com)

✅ Use memory recall instead of stuffing the prompt

LM-Kit supports a long-term memory store:

  • Assign Memory (an AgentMemory implementation). (docs.lm-kit.com)
  • Control its budget with MaximumRecallTokens (defaults to ContextSize / 4, capped to ContextSize / 2). (docs.lm-kit.com)

✅ Summarize when the chat grows

LM-Kit includes a Summarizer that can summarize text and also attachments (including PDFs and Microsoft Office formats via Attachment). (docs.lm-kit.com) It also supports an overflow strategy for overly long inputs, with default RecursiveSummarize. (docs.lm-kit.com)

✅ Persist and resume sessions

MultiTurnConversation can persist and restore full chat sessions (save session to file/bytes, restore from them). (docs.lm-kit.com)


🛠️ Context Window Management in LM-Kit.NET

Example: set context size, budgets, and overflow policies

using LMKit;
using LMKit.TextGeneration;
using LMKit.Inference;

// Optional: performance-oriented global configuration
LMKit.Global.Configuration.EnableKVCacheRecycling = true; // :contentReference[oaicite:19]{index=19}

// Load a local model (ensure tensors/weights are loaded).
var lm = new LM("path/to/model.gguf", new LM.LoadingOptions { LoadTensors = true }); // :contentReference[oaicite:20]{index=20}

// Option A: let LM-Kit pick an optimal context size (contextSize = -1)
// Option B: force a context size (bounded by model limits)
using var chat = new MultiTurnConversation(lm, contextSize: 4096); // :contentReference[oaicite:21]{index=21}

// Set system prompt BEFORE the first user message
chat.SystemPrompt = "You are a concise technical assistant."; // :contentReference[oaicite:22]{index=22}

// Per-turn output budget
chat.MaximumCompletionTokens = 512; // :contentReference[oaicite:23]{index=23}

// Overflow handling policies
chat.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimStart;  // :contentReference[oaicite:24]{index=24}
chat.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;    // :contentReference[oaicite:25]{index=25}

Console.WriteLine($"ContextSize: {chat.ContextSize} tokens");                 // :contentReference[oaicite:26]{index=26}
Console.WriteLine($"ContextRemainingSpace: {chat.ContextRemainingSpace}");    // :contentReference[oaicite:27]{index=27}

while (true)
{
    Console.Write("> ");
    var prompt = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(prompt)) break;

    if (chat.ContextRemainingSpace < 256) // :contentReference[oaicite:28]{index=28}
        Console.WriteLine("[Heads up] Context is getting tight. Consider summarizing or saving the session.");

    var result = chat.Submit(prompt); // :contentReference[oaicite:29]{index=29}
    Console.WriteLine(result.Content);

    // If you capped MaximumCompletionTokens and want more, you can explicitly continue:
    // var more = await chat.ContinueLastAssistantResponseAsync();
    // Console.WriteLine(more.Content);
}

Example: summarizing older content to free space

using LMKit;
using LMKit.TextGeneration;

// Use a summarizer model (example from LM-Kit docs)
var model = LM.LoadFromModelID("lmkit-tasks:4b-preview"); // :contentReference[oaicite:30]{index=30}
var summarizer = new Summarizer(model)
{
    GenerateTitle = true,
    GenerateContent = true
};

var summary = summarizer.Summarize(longText);
Console.WriteLine("Title: " + summary.Title);
Console.WriteLine("Summary: " + summary.Content); // :contentReference[oaicite:31]{index=31}

You can then inject that summary back into your conversation as a “compressed memory” message (or store it in your AgentMemory implementation), instead of keeping the entire raw history.


📖 Key Terms

  • Context Window / Context Length: max tokens visible at once.
  • Token Budget: the slice of ContextSize reserved for each component (history, tools, memory recall, output).
  • KV Cache: internal attention cache that grows with context; LM-Kit supports overflow policies like KV cache shifting. (docs.lm-kit.com)
  • Recall Tokens: tokens injected from long-term memory, capped by MaximumRecallTokens. (docs.lm-kit.com)
  • Overflow Policy: what happens when input or generation exceeds the limit. (docs.lm-kit.com)


🌐 External Resources