Table of Contents

Handle Long Inputs with Overflow Policies

Real-world inputs are unpredictable. A customer support ticket might be 50 tokens or 50,000. A RAG system might inject more context than the model's window allows. LM-Kit.NET provides two policy mechanisms to handle these situations gracefully: InputLengthOverflowPolicy (what happens when the input exceeds the context window before generation starts) and ContextOverflowPolicy (what happens when the context fills up during generation). This tutorial shows how to configure both policies for different scenarios.


Why Overflow Policies Matter

Two enterprise problems that overflow policies solve:

  1. Preventing crashes in production pipelines. Without overflow policies, oversized inputs throw exceptions that crash batch processing jobs. A single malformed document or unexpectedly long ticket can halt an entire queue. Overflow policies let you handle these cases gracefully without losing the entire batch.
  2. Preserving the most relevant context. Trimming from the start keeps recent conversation context, while trimming from the end preserves the original question. The right policy depends on the use case, and choosing wrong means the model generates answers from the wrong part of the input.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n OverflowPolicyQuickstart
cd OverflowPolicyQuickstart
dotnet add package LM-Kit.NET

Step 2: Understanding the Default Behavior

using System.Text;
using LMKit.Model;
using LMKit.Inference;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Inspect default policies
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 256
};

Console.WriteLine("Default policies:");
Console.WriteLine($"  Input overflow:   {chat.InferencePolicies.InputLengthOverflowPolicy}");
Console.WriteLine($"  Context overflow: {chat.InferencePolicies.ContextOverflowPolicy}");

Expected output:

Default policies:
  Input overflow:   TrimAuto
  Context overflow: KVCacheShifting

Step 3: Configuring Policies for Different Scenarios

Each scenario calls for a different combination of policies. Here are three common patterns.

Scenario A: Long Document Summarization (Keep the Start)

When summarizing documents, the most important content is typically at the beginning. Trim from the end to preserve the introduction, abstract, or opening sections:

var summarizer = new SingleTurnConversation(model)
{
    SystemPrompt = "Summarize the following document concisely.",
    MaximumCompletionTokens = 512
};

// If the document is too long, trim from the end (keep the beginning)
summarizer.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimEnd;
summarizer.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.StopGeneration;

Scenario B: Conversational Assistant (Keep Recent Context)

In multi-turn conversations, recent messages are more relevant than earlier ones. Trim from the start to keep the most recent exchanges:

var assistant = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 1024
};

// If conversation history exceeds context, trim from the start (keep recent messages)
assistant.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimStart;
assistant.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;

Scenario C: Strict Validation (Throw on Overflow)

For classification or validation tasks where partial input would produce wrong results, throw an exception and let the caller decide how to handle it:

var validator = new SingleTurnConversation(model)
{
    SystemPrompt = "Classify the following text.",
    MaximumCompletionTokens = 64
};

// Throw an exception if input is too long (let the caller handle it)
validator.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.Throw;

try
{
    var result = validator.Submit(veryLongText);
    Console.WriteLine($"Classification: {result.Completion}");
}
catch (LMKit.Exceptions.NotEnoughContextSizeException ex)
{
    Console.WriteLine($"Input too long: {ex.Message}");
    Console.WriteLine("Consider splitting the input or using a model with a larger context window.");
}

Step 4: Monitoring Context Usage

Track how much of the context window is consumed during a conversation so you can take action before overflow occurs:

var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 512
};

chat.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimStart;

chat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    var result = chat.Submit(input);

    int remaining = chat.ContextRemainingSpace;
    int total = model.ContextLength;
    double usage = (double)(total - remaining) / total * 100;

    Console.WriteLine($"\n  [context: {usage:F0}% used, {remaining} tokens remaining]\n");
}

Input Length Overflow Policies

Policy When Input Is Too Long Best For
TrimAuto System chooses the best trim strategy General purpose (default)
TrimStart Removes oldest tokens Conversations (keep recent context)
TrimEnd Removes newest tokens Document processing (keep the beginning)
KVCacheShifting Shifts the KV cache window Long-running generation tasks
Throw Raises NotEnoughContextSizeException Strict validation, custom handling

Context Overflow Policies

Policy When Context Fills During Generation Best For
KVCacheShifting Dynamically shifts the cache Long responses (default)
StopGeneration Stops and returns what was generated Bounded output, predictable behavior

Common Issues

Problem Cause Fix
NotEnoughContextSizeException Input exceeds context with Throw policy Switch to TrimAuto or split input into smaller chunks
Response cuts off mid-sentence StopGeneration policy triggered Use KVCacheShifting or increase context size
Old messages forgotten in chat TrimStart removed early conversation Use AgentMemory for long-term recall across sessions
Garbled output after long conversation Context shifting artifacts Clear history periodically with chat.ClearHistory()

Next Steps