Add Resilience to Agents with Retry, Circuit Breaker, and Fallback Policies
Production agents face transient failures: out-of-memory conditions during inference, context overflow on unexpectedly long inputs, or hardware glitches on GPU workloads. The LMKit.Agents.Resilience namespace provides composable policies that wrap agent execution with retry logic, circuit breakers, timeouts, rate limiting, and fallback chains, so your application recovers gracefully instead of crashing.
For background on agent execution, see the AI Agent Execution glossary entry.
Why This Matters
Two production problems that resilience policies solve:
- Transient GPU errors. CUDA out-of-memory or Vulkan device-lost errors happen under load spikes. A retry policy with exponential backoff gives the system time to release resources, turning a hard crash into a transparent recovery.
- Cascading failures in multi-agent systems. When one agent in an orchestrated workflow fails repeatedly, it can block the entire pipeline. A circuit breaker detects the pattern and fails fast, letting the orchestrator route work to healthy agents or return a degraded response instead of timing out.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 6+ GB |
Step 1: Create the Project
dotnet new console -n ResilientAgent
cd ResilientAgent
dotnet add package LM-Kit.NET
Step 2: Understand the Policy Types
Every policy implements IResiliencePolicy and can be used standalone or composed together.
| Policy | Purpose | When to use |
|---|---|---|
RetryPolicy |
Re-execute on failure with configurable delay and backoff | Transient errors (GPU hiccups, temporary resource exhaustion) |
CircuitBreakerPolicy |
Fail fast after repeated failures, then test recovery | Protect downstream resources from repeated failing calls |
TimeoutPolicy |
Cancel execution that exceeds a time budget | Prevent runaway inference on adversarial inputs |
RateLimitPolicy |
Throttle execution rate (token bucket) | Shared GPU environments, API rate limits |
BulkheadPolicy |
Limit concurrent executions | Prevent resource exhaustion from parallel requests |
FallbackPolicy<T> |
Return a fallback value on failure | Graceful degradation |
CompositePolicy |
Stack multiple policies (outermost executes first) | Production agents that need several protections |
Step 3: Add Retry with Exponential Backoff
The simplest resilience pattern: retry a few times with increasing delays.
using System.Text;
using LMKit.Model;
using LMKit.Agents;
using LMKit.Agents.Resilience;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("qwen3.5:9b",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
var agent = Agent.CreateBuilder(model)
.WithPersona("Assistant")
.WithInstruction("You are a helpful assistant.")
.Build();
// Retry up to 3 times with exponential backoff starting at 500ms
var retryPolicy = new RetryPolicy(maxRetries: 3)
.WithExponentialBackoff(
initialDelay: TimeSpan.FromMilliseconds(500),
multiplier: 2.0,
maxDelay: TimeSpan.FromSeconds(10))
.WithJitter(0.1)
.OnRetry((exception, attempt, delay) =>
{
Console.WriteLine($"[Retry {attempt}] {exception.GetType().Name}: {exception.Message}");
Console.WriteLine($" Waiting {delay.TotalMilliseconds:F0}ms before next attempt...");
});
// Execute with the policy
AgentExecutionResult result = await retryPolicy.ExecuteAsync(
ct => new AgentExecutor().ExecuteAsync(agent, "Summarize the benefits of local AI.", ct));
Console.WriteLine(result.Content);
Step 4: Add a Circuit Breaker
A circuit breaker tracks failures and trips open after a threshold, preventing repeated attempts against a failing component.
// Trip after 3 failures within the sampling window, recover after 30 seconds
var circuitBreaker = new CircuitBreakerPolicy(
failureThreshold: 3,
recoveryTime: TimeSpan.FromSeconds(30))
.OnStateChange((oldState, newState) =>
{
Console.WriteLine($"[Circuit] {oldState} -> {newState}");
});
try
{
AgentExecutionResult result = await circuitBreaker.ExecuteAsync(
ct => new AgentExecutor().ExecuteAsync(agent, "Hello!", ct));
Console.WriteLine(result.Content);
}
catch (CircuitBreakerOpenException)
{
Console.WriteLine("Circuit is open. The agent is temporarily unavailable.");
}
The three circuit states:
| State | Behavior |
|---|---|
| Closed | Normal operation. Failures are counted. |
| Open | All calls fail immediately with CircuitBreakerOpenException. |
| HalfOpen | After recoveryTime, one test call is allowed. Success closes the circuit; failure reopens it. |
Step 5: Compose Multiple Policies
Use CompositePolicy to stack policies. The outermost policy wraps the others.
// Compose: Timeout -> Retry -> Circuit Breaker
// Execution order: timeout wraps retry, which wraps circuit breaker, which wraps the action
var compositePolicy = new CompositePolicy()
.Wrap(new TimeoutPolicy(TimeSpan.FromMinutes(2)))
.Wrap(new RetryPolicy(maxRetries: 2)
.WithExponentialBackoff(TimeSpan.FromMilliseconds(500)))
.Wrap(circuitBreaker);
AgentExecutionResult result = await compositePolicy.ExecuteAsync(
ct => new AgentExecutor().ExecuteAsync(agent, "Explain circuit breakers.", ct));
Console.WriteLine(result.Content);
You can also compose using extension methods:
var policy = new RetryPolicy(3)
.WithTimeout(TimeSpan.FromMinutes(2))
.WithCircuitBreaker(failureThreshold: 5, recoveryTime: TimeSpan.FromSeconds(30));
Step 6: Use ResilientAgentExecutor for Convenience
ResilientAgentExecutor wraps a standard AgentExecutor with a fluent resilience configuration.
using var resilientExecutor = new ResilientAgentExecutor()
.WithRetry(maxRetries: 3, initialDelay: TimeSpan.FromMilliseconds(500), useExponentialBackoff: true)
.WithTimeout(TimeSpan.FromMinutes(2))
.WithCircuitBreaker(failureThreshold: 5, recoveryTime: TimeSpan.FromSeconds(30))
.OnRetry((exception, attempt) =>
{
Console.WriteLine($"[Retry {attempt}] {exception.Message}");
})
.OnCircuitStateChange((oldState, newState) =>
{
Console.WriteLine($"[Circuit] {oldState} -> {newState}");
});
AgentExecutionResult result = await resilientExecutor.ExecuteAsync(agent, "What is resilience?");
Console.WriteLine(result.Content);
Step 7: Build a Fallback Agent Chain
FallbackAgentExecutor tries agents in sequence. If the primary agent fails, it falls back to the next one.
// Primary: large, capable model
var primaryAgent = Agent.CreateBuilder(model)
.WithPersona("Expert Assistant")
.WithInstruction("Provide detailed, comprehensive answers.")
.Build();
// Fallback: same model but simpler persona (or could use a smaller model)
var fallbackAgent = Agent.CreateBuilder(model)
.WithPersona("Simple Assistant")
.WithInstruction("Provide brief, direct answers.")
.Build();
using var fallbackExecutor = new FallbackAgentExecutor()
.AddAgent(primaryAgent)
.AddAgent(fallbackAgent)
.OnFallback((failedAgent, exception, attemptIndex) =>
{
Console.WriteLine($"[Fallback] Agent #{attemptIndex} failed: {exception.Message}");
Console.WriteLine(" Trying next agent...");
});
AgentExecutionResult result = await fallbackExecutor.ExecuteAsync("Explain quantum computing.");
Console.WriteLine(result.Content);
Step 8: Monitor Agent Health
AgentHealthCheck tracks success rates and latency to determine whether an agent is performing well.
var healthCheck = new AgentHealthCheck(agent)
.WithSuccessRateThreshold(0.9) // Degraded below 90% success
.WithLatencyThreshold(TimeSpan.FromSeconds(10)) // Degraded above 10s average
.WithSampleWindow(TimeSpan.FromMinutes(5))
.WithMaxSamples(100);
// Record outcomes as executions happen
var sw = System.Diagnostics.Stopwatch.StartNew();
try
{
var result = await new AgentExecutor().ExecuteAsync(agent, "Hello!");
sw.Stop();
healthCheck.RecordSuccess(sw.Elapsed);
}
catch (Exception ex)
{
sw.Stop();
healthCheck.RecordFailure(sw.Elapsed, ex);
}
// Query health status
HealthStatus status = healthCheck.GetStatus();
Console.WriteLine($"Health: {status.Status}");
Console.WriteLine($"Success rate: {status.SuccessRate:P0}");
Console.WriteLine($"Avg latency: {status.AverageLatency.TotalMilliseconds:F0}ms");
Console.WriteLine($"Healthy: {status.IsHealthy}");
The four health states:
| State | Meaning |
|---|---|
Unknown |
Not enough samples to determine |
Healthy |
Meets all thresholds |
Degraded |
Below success rate or above latency threshold |
Unhealthy |
Significantly below thresholds or circuit breaker is open |
What to Read Next
- Build a Resilient Production Agent: end-to-end production agent patterns
- Orchestrate Multi-Agent Workflows with Patterns: combine resilience with orchestration
- Monitor Agent Execution with Tracing: observe retry and circuit breaker events
- AI Agent Execution: execution model fundamentals