Table of Contents

Add Telemetry and Observability to Your LM-Kit.NET Application

Running AI inference in production without visibility is like driving without a dashboard. You need to know how fast tokens are generated, how much latency each request adds, and whether your models are performing within acceptable bounds. This tutorial builds a telemetry layer that tracks token usage, generation speed, latency, and session-level statistics for your LM-Kit.NET application.


Why This Matters

Two production problems that observability solves:

  1. Detecting performance degradation before users notice. Token generation rates can drop when VRAM pressure increases, when KV-cache fills up, or when concurrent requests exceed capacity. Real-time metrics let you set alerts and scale before response times become unacceptable.
  2. Tracking resource consumption for capacity planning. Knowing your average prompt token count, completion token count, and generation rate per request lets you forecast hardware needs, estimate costs per query, and identify which prompts or workflows consume the most resources.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
VRAM 4+ GB (for a 4B model)
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n TelemetryQuickstart
cd TelemetryQuickstart
dotnet add package LM-Kit.NET

Step 2: Run Inference and Collect Metrics

LM-Kit.NET's TextGenerationResult provides built-in metrics after every inference call. Start by capturing these:

using System.Text;
using System.Diagnostics;
using LMKit.Model;
using LMKit.TextGeneration;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Load model
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 2. Run inference and collect metrics
// ──────────────────────────────────────
var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant.",
    MaximumCompletionTokens = 256
};

// Track performance with Stopwatch
var stopwatch = Stopwatch.StartNew();

var result = chat.Submit("Explain the benefits of local AI inference in three sentences.");

stopwatch.Stop();

// ──────────────────────────────────────
// 3. Report metrics
// ──────────────────────────────────────
Console.WriteLine($"Response: {result.Completion}\n");
Console.WriteLine("── Inference Metrics ──");
Console.WriteLine($"  Prompt tokens:     {result.PromptTokenCount}");
Console.WriteLine($"  Completion tokens: {result.GeneratedTokenCount}");
Console.WriteLine($"  Total tokens:      {result.PromptTokenCount + result.GeneratedTokenCount}");
Console.WriteLine($"  Generation rate:   {result.TokenGenerationRate:F1} tokens/sec");
Console.WriteLine($"  Wall-clock time:   {stopwatch.ElapsedMilliseconds} ms");
Console.WriteLine($"  Time to first tok: {result.TimeToFirstToken.TotalMilliseconds:F0} ms");

Step 3: Build a Custom Metrics Tracker

For production systems, accumulate metrics across multiple requests to compute session-level statistics:

public class InferenceMetricsTracker
{
    private int _totalRequests;
    private int _totalPromptTokens;
    private int _totalCompletionTokens;
    private double _totalLatencyMs;

    public void Record(TextGenerationResult result, double latencyMs)
    {
        Interlocked.Increment(ref _totalRequests);
        Interlocked.Add(ref _totalPromptTokens, result.PromptTokenCount);
        Interlocked.Add(ref _totalCompletionTokens, result.GeneratedTokenCount);
        Interlocked.Exchange(ref _totalLatencyMs, _totalLatencyMs + latencyMs);
    }

    public void PrintSummary()
    {
        Console.WriteLine("\n── Session Summary ──");
        Console.WriteLine($"  Total requests:      {_totalRequests}");
        Console.WriteLine($"  Total prompt tokens:  {_totalPromptTokens}");
        Console.WriteLine($"  Total gen tokens:     {_totalCompletionTokens}");
        Console.WriteLine($"  Avg latency:          {(_totalRequests > 0 ? _totalLatencyMs / _totalRequests : 0):F0} ms");
    }
}

Wire the tracker into your inference loop:

var tracker = new InferenceMetricsTracker();

string[] prompts =
{
    "What is retrieval-augmented generation?",
    "Explain the difference between TCP and UDP.",
    "List three benefits of running AI models locally."
};

foreach (string prompt in prompts)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.WriteLine($"\nPrompt: {prompt}");
    Console.ResetColor();

    var sw = Stopwatch.StartNew();
    var result = chat.Submit(prompt);
    sw.Stop();

    tracker.Record(result, sw.ElapsedMilliseconds);

    Console.WriteLine($"  Response: {(result.Completion.Length > 100 ? result.Completion.Substring(0, 100) + "..." : result.Completion)}");
    Console.ForegroundColor = ConsoleColor.DarkGray;
    Console.WriteLine($"  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s, {sw.ElapsedMilliseconds} ms]");
    Console.ResetColor();
}

tracker.PrintSummary();

Step 4: Monitor Inference Performance Over Time

For long-running services, track metrics with timestamps so you can detect trends and regressions:

public class TimestampedMetric
{
    public DateTime Timestamp { get; init; }
    public int PromptTokens { get; init; }
    public int CompletionTokens { get; init; }
    public double TokensPerSecond { get; init; }
    public double LatencyMs { get; init; }
    public double TimeToFirstTokenMs { get; init; }
}

public class PerformanceMonitor
{
    private readonly List<TimestampedMetric> _metrics = new();
    private readonly object _lock = new();

    public void Record(TextGenerationResult result, double latencyMs)
    {
        var metric = new TimestampedMetric
        {
            Timestamp = DateTime.UtcNow,
            PromptTokens = result.PromptTokenCount,
            CompletionTokens = result.GeneratedTokenCount,
            TokensPerSecond = result.TokenGenerationRate,
            LatencyMs = latencyMs,
            TimeToFirstTokenMs = result.TimeToFirstToken.TotalMilliseconds
        };

        lock (_lock)
        {
            _metrics.Add(metric);
        }
    }

    public void PrintReport()
    {
        List<TimestampedMetric> snapshot;
        lock (_lock)
        {
            snapshot = new List<TimestampedMetric>(_metrics);
        }

        if (snapshot.Count == 0)
        {
            Console.WriteLine("No metrics recorded.");
            return;
        }

        Console.WriteLine("\n── Performance Report ──");
        Console.WriteLine($"  Total requests:          {snapshot.Count}");
        Console.WriteLine($"  Avg tokens/sec:          {snapshot.Average(m => m.TokensPerSecond):F1}");
        Console.WriteLine($"  Avg latency:             {snapshot.Average(m => m.LatencyMs):F0} ms");
        Console.WriteLine($"  P95 latency:             {Percentile(snapshot.Select(m => m.LatencyMs), 0.95):F0} ms");
        Console.WriteLine($"  Avg time to first token: {snapshot.Average(m => m.TimeToFirstTokenMs):F0} ms");
        Console.WriteLine($"  Total prompt tokens:     {snapshot.Sum(m => m.PromptTokens)}");
        Console.WriteLine($"  Total completion tokens: {snapshot.Sum(m => m.CompletionTokens)}");

        // Check for performance degradation
        if (snapshot.Count >= 10)
        {
            var recentAvg = snapshot.TakeLast(5).Average(m => m.TokensPerSecond);
            var overallAvg = snapshot.Average(m => m.TokensPerSecond);

            if (recentAvg < overallAvg * 0.8)
            {
                Console.ForegroundColor = ConsoleColor.Yellow;
                Console.WriteLine($"\n  WARNING: Recent throughput ({recentAvg:F1} tok/s) is 20%+ below average ({overallAvg:F1} tok/s)");
                Console.ResetColor();
            }
        }
    }

    private static double Percentile(IEnumerable<double> values, double percentile)
    {
        var sorted = values.OrderBy(v => v).ToList();
        int index = (int)Math.Ceiling(percentile * sorted.Count) - 1;
        return sorted[Math.Max(0, index)];
    }
}

Step 5: Export Metrics to Console or Log Files

For integration with external monitoring systems, format metrics as structured output:

public static class MetricsExporter
{
    /// <summary>
    /// Writes a metric line in a structured format suitable for log aggregation tools.
    /// </summary>
    public static void LogMetric(TextGenerationResult result, double latencyMs, string requestId)
    {
        string logLine = string.Format(
            "[{0:O}] request_id={1} prompt_tokens={2} completion_tokens={3} " +
            "total_tokens={4} tokens_per_sec={5:F1} latency_ms={6:F0} ttft_ms={7:F0}",
            DateTime.UtcNow,
            requestId,
            result.PromptTokenCount,
            result.GeneratedTokenCount,
            result.PromptTokenCount + result.GeneratedTokenCount,
            result.TokenGenerationRate,
            latencyMs,
            result.TimeToFirstToken.TotalMilliseconds
        );

        Console.WriteLine(logLine);

        // Optionally append to a file for external collection
        // File.AppendAllText("inference_metrics.log", logLine + Environment.NewLine);
    }
}

Usage in your inference loop:

string requestId = Guid.NewGuid().ToString("N")[..8];

var sw = Stopwatch.StartNew();
var result = chat.Submit("Summarize the key points of this document.");
sw.Stop();

MetricsExporter.LogMetric(result, sw.ElapsedMilliseconds, requestId);

Example output:

[2025-01-15T14:30:22.1234567Z] request_id=a1b2c3d4 prompt_tokens=42 completion_tokens=87 total_tokens=129 tokens_per_sec=38.4 latency_ms=2265 ttft_ms=312

Key Metrics Reference

Metric Source What It Tells You
PromptTokenCount TextGenerationResult Input size; affects latency and KV-cache usage
GeneratedTokenCount TextGenerationResult Output size; correlates with generation time
TokenGenerationRate TextGenerationResult Tokens per second; primary throughput indicator
TimeToFirstToken TextGenerationResult Time until first token appears; perceived responsiveness
Wall-clock latency Stopwatch End-to-end request duration including overhead
GpuLayerCount LM Number of layers on GPU; affects speed
ContextLength LM Maximum context window; limits prompt + completion size

Common Issues

Problem Cause Fix
TokenGenerationRate drops over time KV-cache filling up in long conversations Call chat.ClearHistory() periodically, or enable EnableKVCacheRecycling
TimeToFirstToken is very high Large prompt or cold model start Warm up the model with a short prompt at startup; reduce prompt length
Metrics vary widely between runs System resource contention Run benchmarks in isolation; average over multiple runs
Wall-clock time much higher than generation time Overhead from tokenization or context assembly Profile with Stopwatch around individual steps to isolate bottlenecks

Next Steps