Add Telemetry and Observability to Your LM-Kit.NET Application
Running AI inference in production without visibility is like driving without a dashboard. You need to know how fast tokens are generated, how much latency each request adds, and whether your models are performing within acceptable bounds. This tutorial builds a telemetry layer that tracks token usage, generation speed, latency, and session-level statistics for your LM-Kit.NET application.
Why This Matters
Two production problems that observability solves:
- Detecting performance degradation before users notice. Token generation rates can drop when VRAM pressure increases, when KV-cache fills up, or when concurrent requests exceed capacity. Real-time metrics let you set alerts and scale before response times become unacceptable.
- Tracking resource consumption for capacity planning. Knowing your average prompt token count, completion token count, and generation rate per request lets you forecast hardware needs, estimate costs per query, and identify which prompts or workflows consume the most resources.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| VRAM | 4+ GB (for a 4B model) |
| Disk | ~3 GB free for model download |
Step 1: Create the Project
dotnet new console -n TelemetryQuickstart
cd TelemetryQuickstart
dotnet add package LM-Kit.NET
Step 2: Run Inference and Collect Metrics
LM-Kit.NET's TextGenerationResult provides built-in metrics after every inference call. Start by capturing these:
using System.Text;
using System.Diagnostics;
using LMKit.Model;
using LMKit.TextGeneration;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine("\n");
// ──────────────────────────────────────
// 2. Run inference and collect metrics
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
SystemPrompt = "You are a helpful assistant.",
MaximumCompletionTokens = 256
};
// Track performance with Stopwatch
var stopwatch = Stopwatch.StartNew();
var result = chat.Submit("Explain the benefits of local AI inference in three sentences.");
stopwatch.Stop();
// ──────────────────────────────────────
// 3. Report metrics
// ──────────────────────────────────────
Console.WriteLine($"Response: {result.Completion}\n");
Console.WriteLine("── Inference Metrics ──");
Console.WriteLine($" Prompt tokens: {result.PromptTokenCount}");
Console.WriteLine($" Completion tokens: {result.GeneratedTokenCount}");
Console.WriteLine($" Total tokens: {result.PromptTokenCount + result.GeneratedTokenCount}");
Console.WriteLine($" Generation rate: {result.TokenGenerationRate:F1} tokens/sec");
Console.WriteLine($" Wall-clock time: {stopwatch.ElapsedMilliseconds} ms");
Console.WriteLine($" Time to first tok: {result.TimeToFirstToken.TotalMilliseconds:F0} ms");
Step 3: Build a Custom Metrics Tracker
For production systems, accumulate metrics across multiple requests to compute session-level statistics:
public class InferenceMetricsTracker
{
private int _totalRequests;
private int _totalPromptTokens;
private int _totalCompletionTokens;
private double _totalLatencyMs;
public void Record(TextGenerationResult result, double latencyMs)
{
Interlocked.Increment(ref _totalRequests);
Interlocked.Add(ref _totalPromptTokens, result.PromptTokenCount);
Interlocked.Add(ref _totalCompletionTokens, result.GeneratedTokenCount);
Interlocked.Exchange(ref _totalLatencyMs, _totalLatencyMs + latencyMs);
}
public void PrintSummary()
{
Console.WriteLine("\n── Session Summary ──");
Console.WriteLine($" Total requests: {_totalRequests}");
Console.WriteLine($" Total prompt tokens: {_totalPromptTokens}");
Console.WriteLine($" Total gen tokens: {_totalCompletionTokens}");
Console.WriteLine($" Avg latency: {(_totalRequests > 0 ? _totalLatencyMs / _totalRequests : 0):F0} ms");
}
}
Wire the tracker into your inference loop:
var tracker = new InferenceMetricsTracker();
string[] prompts =
{
"What is retrieval-augmented generation?",
"Explain the difference between TCP and UDP.",
"List three benefits of running AI models locally."
};
foreach (string prompt in prompts)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.WriteLine($"\nPrompt: {prompt}");
Console.ResetColor();
var sw = Stopwatch.StartNew();
var result = chat.Submit(prompt);
sw.Stop();
tracker.Record(result, sw.ElapsedMilliseconds);
Console.WriteLine($" Response: {(result.Completion.Length > 100 ? result.Completion.Substring(0, 100) + "..." : result.Completion)}");
Console.ForegroundColor = ConsoleColor.DarkGray;
Console.WriteLine($" [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s, {sw.ElapsedMilliseconds} ms]");
Console.ResetColor();
}
tracker.PrintSummary();
Step 4: Monitor Inference Performance Over Time
For long-running services, track metrics with timestamps so you can detect trends and regressions:
public class TimestampedMetric
{
public DateTime Timestamp { get; init; }
public int PromptTokens { get; init; }
public int CompletionTokens { get; init; }
public double TokensPerSecond { get; init; }
public double LatencyMs { get; init; }
public double TimeToFirstTokenMs { get; init; }
}
public class PerformanceMonitor
{
private readonly List<TimestampedMetric> _metrics = new();
private readonly object _lock = new();
public void Record(TextGenerationResult result, double latencyMs)
{
var metric = new TimestampedMetric
{
Timestamp = DateTime.UtcNow,
PromptTokens = result.PromptTokenCount,
CompletionTokens = result.GeneratedTokenCount,
TokensPerSecond = result.TokenGenerationRate,
LatencyMs = latencyMs,
TimeToFirstTokenMs = result.TimeToFirstToken.TotalMilliseconds
};
lock (_lock)
{
_metrics.Add(metric);
}
}
public void PrintReport()
{
List<TimestampedMetric> snapshot;
lock (_lock)
{
snapshot = new List<TimestampedMetric>(_metrics);
}
if (snapshot.Count == 0)
{
Console.WriteLine("No metrics recorded.");
return;
}
Console.WriteLine("\n── Performance Report ──");
Console.WriteLine($" Total requests: {snapshot.Count}");
Console.WriteLine($" Avg tokens/sec: {snapshot.Average(m => m.TokensPerSecond):F1}");
Console.WriteLine($" Avg latency: {snapshot.Average(m => m.LatencyMs):F0} ms");
Console.WriteLine($" P95 latency: {Percentile(snapshot.Select(m => m.LatencyMs), 0.95):F0} ms");
Console.WriteLine($" Avg time to first token: {snapshot.Average(m => m.TimeToFirstTokenMs):F0} ms");
Console.WriteLine($" Total prompt tokens: {snapshot.Sum(m => m.PromptTokens)}");
Console.WriteLine($" Total completion tokens: {snapshot.Sum(m => m.CompletionTokens)}");
// Check for performance degradation
if (snapshot.Count >= 10)
{
var recentAvg = snapshot.TakeLast(5).Average(m => m.TokensPerSecond);
var overallAvg = snapshot.Average(m => m.TokensPerSecond);
if (recentAvg < overallAvg * 0.8)
{
Console.ForegroundColor = ConsoleColor.Yellow;
Console.WriteLine($"\n WARNING: Recent throughput ({recentAvg:F1} tok/s) is 20%+ below average ({overallAvg:F1} tok/s)");
Console.ResetColor();
}
}
}
private static double Percentile(IEnumerable<double> values, double percentile)
{
var sorted = values.OrderBy(v => v).ToList();
int index = (int)Math.Ceiling(percentile * sorted.Count) - 1;
return sorted[Math.Max(0, index)];
}
}
Step 5: Export Metrics to Console or Log Files
For integration with external monitoring systems, format metrics as structured output:
public static class MetricsExporter
{
/// <summary>
/// Writes a metric line in a structured format suitable for log aggregation tools.
/// </summary>
public static void LogMetric(TextGenerationResult result, double latencyMs, string requestId)
{
string logLine = string.Format(
"[{0:O}] request_id={1} prompt_tokens={2} completion_tokens={3} " +
"total_tokens={4} tokens_per_sec={5:F1} latency_ms={6:F0} ttft_ms={7:F0}",
DateTime.UtcNow,
requestId,
result.PromptTokenCount,
result.GeneratedTokenCount,
result.PromptTokenCount + result.GeneratedTokenCount,
result.TokenGenerationRate,
latencyMs,
result.TimeToFirstToken.TotalMilliseconds
);
Console.WriteLine(logLine);
// Optionally append to a file for external collection
// File.AppendAllText("inference_metrics.log", logLine + Environment.NewLine);
}
}
Usage in your inference loop:
string requestId = Guid.NewGuid().ToString("N")[..8];
var sw = Stopwatch.StartNew();
var result = chat.Submit("Summarize the key points of this document.");
sw.Stop();
MetricsExporter.LogMetric(result, sw.ElapsedMilliseconds, requestId);
Example output:
[2025-01-15T14:30:22.1234567Z] request_id=a1b2c3d4 prompt_tokens=42 completion_tokens=87 total_tokens=129 tokens_per_sec=38.4 latency_ms=2265 ttft_ms=312
Key Metrics Reference
| Metric | Source | What It Tells You |
|---|---|---|
PromptTokenCount |
TextGenerationResult |
Input size; affects latency and KV-cache usage |
GeneratedTokenCount |
TextGenerationResult |
Output size; correlates with generation time |
TokenGenerationRate |
TextGenerationResult |
Tokens per second; primary throughput indicator |
TimeToFirstToken |
TextGenerationResult |
Time until first token appears; perceived responsiveness |
| Wall-clock latency | Stopwatch |
End-to-end request duration including overhead |
GpuLayerCount |
LM |
Number of layers on GPU; affects speed |
ContextLength |
LM |
Maximum context window; limits prompt + completion size |
Common Issues
| Problem | Cause | Fix |
|---|---|---|
TokenGenerationRate drops over time |
KV-cache filling up in long conversations | Call chat.ClearHistory() periodically, or enable EnableKVCacheRecycling |
TimeToFirstToken is very high |
Large prompt or cold model start | Warm up the model with a short prompt at startup; reduce prompt length |
| Metrics vary widely between runs | System resource contention | Run benchmarks in isolation; average over multiple runs |
| Wall-clock time much higher than generation time | Overhead from tokenization or context assembly | Profile with Stopwatch around individual steps to isolate bottlenecks |
Next Steps
- Configure GPU Backends and Optimize Performance: maximize inference speed with proper backend selection.
- Build a Resilient Production Agent: add retry logic, timeouts, and circuit breakers to production agents.
- Load a Model and Generate Your First Response: get started with the basics.