Handle Long Inputs with Overflow Policies

Real-world inputs are unpredictable. A customer support ticket might be 50 tokens or 50,000. A RAG system might inject more context than the model's window allows. LM-Kit.NET provides two policy mechanisms to handle these situations gracefully: InputLengthOverflowPolicy (what happens when the input exceeds the context window before generation starts) and ContextOverflowPolicy (what happens when the context fills up during generation). This tutorial shows how to configure both policies for different scenarios.

Why Overflow Policies Matter

Two enterprise problems that overflow policies solve:

Preventing crashes in production pipelines. Without overflow policies, oversized inputs throw exceptions that crash batch processing jobs. A single malformed document or unexpectedly long ticket can halt an entire queue. Overflow policies let you handle these cases gracefully without losing the entire batch.
Preserving the most relevant context. Trimming from the start keeps recent conversation context, while trimming from the end preserves the original question. The right policy depends on the use case, and choosing wrong means the model generates answers from the wrong part of the input.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	4+ GB
Disk	~3 GB free for model download

Step 1: Create the Project

dotnet new console -n OverflowPolicyQuickstart
cd OverflowPolicyQuickstart
dotnet add package LM-Kit.NET

Step 2: Understanding the Default Behavior

using System.Text;
using LMKit.Model;
using LMKit.Inference;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Inspect default policies
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 256
};

Console.WriteLine("Default policies:");
Console.WriteLine($"  Input overflow:   {chat.InferencePolicies.InputLengthOverflowPolicy}");
Console.WriteLine($"  Context overflow: {chat.InferencePolicies.ContextOverflowPolicy}");

Expected output:

Default policies:
  Input overflow:   TrimAuto
  Context overflow: KVCacheShifting

Step 3: Configuring Policies for Different Scenarios

Each scenario calls for a different combination of policies. Here are three common patterns.

Scenario A: Long Document Summarization (Keep the Start)

When summarizing documents, the most important content is typically at the beginning. Trim from the end to preserve the introduction, abstract, or opening sections:

using System.Text;
using LMKit.Model;
using LMKit.Inference;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var summarizer = new SingleTurnConversation(model)
{
    SystemPrompt = "Summarize the following document concisely.",
    MaximumCompletionTokens = 512
};

// If the document is too long, trim from the end (keep the beginning)
summarizer.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimEnd;
summarizer.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.StopGeneration;

Scenario B: Conversational Assistant (Keep Recent Context)

In multi-turn conversations, recent messages are more relevant than earlier ones. Trim from the start to keep the most recent exchanges:

using System.Text;
using LMKit.Model;
using LMKit.Inference;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var assistant = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 1024
};

// If conversation history exceeds context, trim from the start (keep recent messages)
assistant.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimStart;
assistant.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;

Scenario C: Strict Validation (Throw on Overflow)

For classification or validation tasks where partial input would produce wrong results, throw an exception and let the caller decide how to handle it:

using System.Text;
using LMKit.Model;
using LMKit.Inference;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var validator = new SingleTurnConversation(model)
{
    SystemPrompt = "Classify the following text.",
    MaximumCompletionTokens = 64
};

// Throw an exception if input is too long (let the caller handle it)
validator.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.Throw;

// Simulate a very long input that exceeds context window
string veryLongText = new string('x', 100_000);

try
{
    var result = validator.Submit(veryLongText);
    Console.WriteLine($"Classification: {result.Completion}");
}
catch (LMKit.Exceptions.NotEnoughContextSizeException ex)
{
    Console.WriteLine($"Input too long: {ex.Message}");
    Console.WriteLine("Consider splitting the input or using a model with a larger context window.");
}

Step 4: Monitoring Context Usage

Track how much of the context window is consumed during a conversation so you can take action before overflow occurs:

using System.Text;
using LMKit.Model;
using LMKit.Inference;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 512
};

chat.InferencePolicies.InputLengthOverflowPolicy = InputLengthOverflowPolicy.TrimStart;

chat.AfterTextCompletion += (_, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    var result = chat.Submit(input);

    int remaining = chat.ContextRemainingSpace;
    int total = model.ContextLength;
    double usage = (double)(total - remaining) / total * 100;

    Console.WriteLine($"\n  [context: {usage:F0}% used, {remaining} tokens remaining]\n");
}

Input Length Overflow Policies

Policy	When Input Is Too Long	Best For
`TrimAuto`	System chooses the best trim strategy	General purpose (default)
`TrimStart`	Removes oldest tokens	Conversations (keep recent context)
`TrimEnd`	Removes newest tokens	Document processing (keep the beginning)
`KVCacheShifting`	Shifts the KV cache window	Long-running generation tasks
`Throw`	Raises `NotEnoughContextSizeException`	Strict validation, custom handling

Context Overflow Policies

Policy	When Context Fills During Generation	Best For
`KVCacheShifting`	Dynamically shifts the cache	Long responses (default)
`StopGeneration`	Stops and returns what was generated	Bounded output, predictable behavior

Common Issues

Problem	Cause	Fix
`NotEnoughContextSizeException`	Input exceeds context with `Throw` policy	Switch to `TrimAuto` or split input into smaller chunks
Response cuts off mid-sentence	`StopGeneration` policy triggered	Use `KVCacheShifting` or increase context size
Old messages forgotten in chat	`TrimStart` removed early conversation	Use `AgentMemory` for long-term recall across sessions
Garbled output after long conversation	Context shifting artifacts	Clear history periodically with `chat.ClearHistory()`

Next Steps

Build a Conversational Assistant with Memory: persistent memory across sessions with AgentMemory.
Build a RAG Pipeline: inject external knowledge into the context window.
Summarize Documents and Text: document summarization with overflow handling.

Table of Contents