Add Resilience to Agents with Retry, Circuit Breaker, and Fallback Policies

Production agents face transient failures: out-of-memory conditions during inference, context overflow on unexpectedly long inputs, or hardware glitches on GPU workloads. The LMKit.Agents.Resilience namespace provides composable policies that wrap agent execution with retry logic, circuit breakers, timeouts, rate limiting, and fallback chains, so your application recovers gracefully instead of crashing.

For background on agent execution, see the AI Agent Execution glossary entry.

Why This Matters

Two production problems that resilience policies solve:

Transient GPU errors. CUDA out-of-memory or Vulkan device-lost errors happen under load spikes. A retry policy with exponential backoff gives the system time to release resources, turning a hard crash into a transparent recovery.
Cascading failures in multi-agent systems. When one agent in an orchestrated workflow fails repeatedly, it can block the entire pipeline. A circuit breaker detects the pattern and fails fast, letting the orchestrator route work to healthy agents or return a degraded response instead of timing out.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
VRAM	6+ GB

Step 1: Create the Project

dotnet new console -n ResilientAgent
cd ResilientAgent
dotnet add package LM-Kit.NET

Step 2: Understand the Policy Types

Every policy implements IResiliencePolicy and can be used standalone or composed together.

Policy	Purpose	When to use
`RetryPolicy`	Re-execute on failure with configurable delay and backoff	Transient errors (GPU hiccups, temporary resource exhaustion)
`CircuitBreakerPolicy`	Fail fast after repeated failures, then test recovery	Protect downstream resources from repeated failing calls
`TimeoutPolicy`	Cancel execution that exceeds a time budget	Prevent runaway inference on adversarial inputs
`RateLimitPolicy`	Throttle execution rate (token bucket)	Shared GPU environments, API rate limits
`BulkheadPolicy`	Limit concurrent executions	Prevent resource exhaustion from parallel requests
`FallbackPolicy<T>`	Return a fallback value on failure	Graceful degradation
`CompositePolicy`	Stack multiple policies (outermost executes first)	Production agents that need several protections

Step 3: Add Retry with Exponential Backoff

The simplest resilience pattern: retry a few times with increasing delays.

using System.Text;
using LMKit.Model;
using LMKit.Agents;
using LMKit.Agents.Resilience;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

var agent = Agent.CreateBuilder(model)
    .WithPersona("Assistant")
    .WithInstruction("You are a helpful assistant.")
    .Build();

// Retry up to 3 times with exponential backoff starting at 500ms
var retryPolicy = new RetryPolicy(maxRetries: 3)
    .WithExponentialBackoff(
        initialDelay: TimeSpan.FromMilliseconds(500),
        multiplier: 2.0,
        maxDelay: TimeSpan.FromSeconds(10))
    .WithJitter(0.1)
    .OnRetry((exception, attempt, delay) =>
    {
        Console.WriteLine($"[Retry {attempt}] {exception.GetType().Name}: {exception.Message}");
        Console.WriteLine($"  Waiting {delay.TotalMilliseconds:F0}ms before next attempt...");
    });

// Execute with the policy
AgentExecutionResult result = await retryPolicy.ExecuteAsync(
    ct => new AgentExecutor().ExecuteAsync(agent, "Summarize the benefits of local AI.", ct));

Console.WriteLine(result.Content);

Step 4: Add a Circuit Breaker

A circuit breaker tracks failures and trips open after a threshold, preventing repeated attempts against a failing component.

// Trip after 3 failures within the sampling window, recover after 30 seconds
var circuitBreaker = new CircuitBreakerPolicy(
        failureThreshold: 3,
        recoveryTime: TimeSpan.FromSeconds(30))
    .OnStateChange((oldState, newState) =>
    {
        Console.WriteLine($"[Circuit] {oldState} -> {newState}");
    });

try
{
    AgentExecutionResult result = await circuitBreaker.ExecuteAsync(
        ct => new AgentExecutor().ExecuteAsync(agent, "Hello!", ct));
    Console.WriteLine(result.Content);
}
catch (CircuitBreakerOpenException)
{
    Console.WriteLine("Circuit is open. The agent is temporarily unavailable.");
}

The three circuit states:

State	Behavior
Closed	Normal operation. Failures are counted.
Open	All calls fail immediately with `CircuitBreakerOpenException`.
HalfOpen	After `recoveryTime`, one test call is allowed. Success closes the circuit; failure reopens it.

Step 5: Compose Multiple Policies

Use CompositePolicy to stack policies. The outermost policy wraps the others.

// Compose: Timeout -> Retry -> Circuit Breaker
// Execution order: timeout wraps retry, which wraps circuit breaker, which wraps the action
var compositePolicy = new CompositePolicy()
    .Wrap(new TimeoutPolicy(TimeSpan.FromMinutes(2)))
    .Wrap(new RetryPolicy(maxRetries: 2)
        .WithExponentialBackoff(TimeSpan.FromMilliseconds(500)))
    .Wrap(circuitBreaker);

AgentExecutionResult result = await compositePolicy.ExecuteAsync(
    ct => new AgentExecutor().ExecuteAsync(agent, "Explain circuit breakers.", ct));

Console.WriteLine(result.Content);

You can also compose using extension methods:

var policy = new RetryPolicy(3)
    .WithTimeout(TimeSpan.FromMinutes(2))
    .WithCircuitBreaker(failureThreshold: 5, recoveryTime: TimeSpan.FromSeconds(30));

Step 6: Use ResilientAgentExecutor for Convenience

ResilientAgentExecutor wraps a standard AgentExecutor with a fluent resilience configuration.

using var resilientExecutor = new ResilientAgentExecutor()
    .WithRetry(maxRetries: 3, initialDelay: TimeSpan.FromMilliseconds(500), useExponentialBackoff: true)
    .WithTimeout(TimeSpan.FromMinutes(2))
    .WithCircuitBreaker(failureThreshold: 5, recoveryTime: TimeSpan.FromSeconds(30))
    .OnRetry((exception, attempt) =>
    {
        Console.WriteLine($"[Retry {attempt}] {exception.Message}");
    })
    .OnCircuitStateChange((oldState, newState) =>
    {
        Console.WriteLine($"[Circuit] {oldState} -> {newState}");
    });

AgentExecutionResult result = await resilientExecutor.ExecuteAsync(agent, "What is resilience?");
Console.WriteLine(result.Content);

Step 7: Build a Fallback Agent Chain

FallbackAgentExecutor tries agents in sequence. If the primary agent fails, it falls back to the next one.

// Primary: large, capable model
var primaryAgent = Agent.CreateBuilder(model)
    .WithPersona("Expert Assistant")
    .WithInstruction("Provide detailed, comprehensive answers.")
    .Build();

// Fallback: same model but simpler persona (or could use a smaller model)
var fallbackAgent = Agent.CreateBuilder(model)
    .WithPersona("Simple Assistant")
    .WithInstruction("Provide brief, direct answers.")
    .Build();

using var fallbackExecutor = new FallbackAgentExecutor()
    .AddAgent(primaryAgent)
    .AddAgent(fallbackAgent)
    .OnFallback((failedAgent, exception, attemptIndex) =>
    {
        Console.WriteLine($"[Fallback] Agent #{attemptIndex} failed: {exception.Message}");
        Console.WriteLine("  Trying next agent...");
    });

AgentExecutionResult result = await fallbackExecutor.ExecuteAsync("Explain quantum computing.");
Console.WriteLine(result.Content);

Step 8: Monitor Agent Health

AgentHealthCheck tracks success rates and latency to determine whether an agent is performing well.

var healthCheck = new AgentHealthCheck(agent)
    .WithSuccessRateThreshold(0.9)       // Degraded below 90% success
    .WithLatencyThreshold(TimeSpan.FromSeconds(10))  // Degraded above 10s average
    .WithSampleWindow(TimeSpan.FromMinutes(5))
    .WithMaxSamples(100);

// Record outcomes as executions happen
var sw = System.Diagnostics.Stopwatch.StartNew();
try
{
    var result = await new AgentExecutor().ExecuteAsync(agent, "Hello!");
    sw.Stop();
    healthCheck.RecordSuccess(sw.Elapsed);
}
catch (Exception ex)
{
    sw.Stop();
    healthCheck.RecordFailure(sw.Elapsed, ex);
}

// Query health status
HealthStatus status = healthCheck.GetStatus();
Console.WriteLine($"Health: {status.Status}");
Console.WriteLine($"Success rate: {status.SuccessRate:P0}");
Console.WriteLine($"Avg latency: {status.AverageLatency.TotalMilliseconds:F0}ms");
Console.WriteLine($"Healthy: {status.IsHealthy}");

The four health states:

State	Meaning
`Unknown`	Not enough samples to determine
`Healthy`	Meets all thresholds
`Degraded`	Below success rate or above latency threshold
`Unhealthy`	Significantly below thresholds or circuit breaker is open

Table of Contents