Table of Contents

Configure GPU Backends and Optimize Performance

Running LLMs efficiently requires matching your hardware to the right backend and tuning how the model is distributed across CPU and GPU. This tutorial builds a program that detects available hardware, configures the optimal backend, tunes GPU layer offloading, and benchmarks performance across your model catalog.


Why This Matters

Two enterprise deployment problems that proper backend configuration solves:

  1. Maximizing throughput on heterogeneous hardware. Production environments often include a mix of NVIDIA, AMD, and CPU-only machines. Selecting the correct backend per machine ensures every node runs at peak efficiency instead of falling back to slow CPU inference.
  2. Fitting larger models into limited VRAM. Partial GPU offloading lets you run a 12B model on an 8 GB GPU by keeping some layers on the CPU. Without tuning, the model either fails to load or runs entirely on CPU, wasting available GPU capacity.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 8 GB
VRAM (optional) 4+ GB for GPU acceleration
Disk ~3 GB free for model download

Step 1: Create the Project

dotnet new console -n GpuBackendOptimizer
cd GpuBackendOptimizer
dotnet add package LM-Kit.NET

For NVIDIA GPU acceleration, install a CUDA backend package. Vulkan is already included in the base LM-Kit.NET package.

# NVIDIA CUDA 12 (Windows)
dotnet add package LM-Kit.NET.Backend.Cuda12.Windows

# NVIDIA CUDA 12 (Linux x64)
dotnet add package LM-Kit.NET.Backend.Cuda12.Linux

# NVIDIA CUDA 12 (Linux ARM64)
dotnet add package LM-Kit.NET.Backend.Cuda12.linux-arm64

# NVIDIA CUDA 13 (Windows only)
dotnet add package LM-Kit.NET.Backend.Cuda13.Windows

Step 2: Detect Available Backends and GPU Information

Before configuring a backend, inspect what hardware is available on the current machine:

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Query hardware capabilities
// ──────────────────────────────────────
Console.WriteLine("=== Hardware Detection ===\n");

var deviceInfo = LM.DeviceConfiguration.GetAvailableDevices();

foreach (var device in deviceInfo)
{
    Console.WriteLine($"  Device: {device.Name}");
    Console.WriteLine($"  Type:   {device.Type}");
    Console.WriteLine($"  VRAM:   {device.TotalMemory / (1024.0 * 1024 * 1024):F1} GB");
    Console.WriteLine();
}

Console.WriteLine($"  CUDA available:   {LMKit.Global.Runtime.IsCudaAvailable}");
Console.WriteLine($"  Vulkan available: {LMKit.Global.Runtime.IsVulkanAvailable}");

This lets you make informed decisions about which backend to enable and how many layers to offload.


Step 3: Configure Backend Selection

LM-Kit.NET supports multiple backends. Enable the one that matches your hardware before initializing the runtime:

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the runtime backend
// ──────────────────────────────────────

// Enable CUDA acceleration (requires CUDA backend NuGet package)
LMKit.Global.Runtime.EnableCuda = true;

// Optional: enable Vulkan instead for cross-platform GPU support
// LMKit.Global.Runtime.EnableVulkan = true;

LMKit.Global.Runtime.Initialize();

// ──────────────────────────────────────
// 2. Load model with custom device configuration
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 28  // partial offload: keep some layers on CPU
    },
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  GPU layers offloaded: {model.GpuLayerCount}");
Console.WriteLine($"  Context length: {model.ContextLength} tokens");

Only enable one GPU backend at a time. If both CUDA and Vulkan are enabled, CUDA takes priority on NVIDIA hardware.


Step 4: Tune GPU Layer Offloading

Partial GPU offloading is the key to fitting larger models into limited VRAM. Instead of loading the entire model onto the GPU (which may fail if VRAM is insufficient), you specify how many transformer layers to place on the GPU. The remaining layers stay on the CPU.

// Full GPU offload (default behavior, uses all available VRAM)
var fullGpu = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue
};

// Partial offload: place 20 layers on GPU, rest on CPU
var partialGpu = new LM.DeviceConfiguration
{
    GpuLayerCount = 20
};

// CPU only: no GPU offloading
var cpuOnly = new LM.DeviceConfiguration
{
    GpuLayerCount = 0
};

// Load with partial offload
Console.WriteLine("Loading with partial GPU offload...");
using LM model = LM.LoadFromModelID("gemma3:12b",
    deviceConfiguration: partialGpu,
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"  Total layers: {model.LayerCount}");

Tuning strategy: start with GpuLayerCount = int.MaxValue. If loading fails with an out-of-memory error, reduce the count by 5 and retry. Continue until the model loads successfully. Each offloaded layer adds roughly (model file size) / (total layers) of VRAM usage.


Step 5: Memory Optimization with KV-Cache and Model Caching

Beyond backend selection, LM-Kit.NET provides several global configuration options that improve inference performance:

// Enable KV-cache recycling for faster subsequent inferences
LMKit.Global.Configuration.EnableKVCacheRecycling = true;

// Enable model caching to reuse loaded models
LMKit.Global.Configuration.EnableModelCache = true;

// Enable token healing for improved generation quality
LMKit.Global.Configuration.EnableTokenHealing = true;
Setting Effect When to Enable
EnableKVCacheRecycling Reuses key-value attention cache across inferences with shared prompt prefixes Multi-turn conversations, batch processing with similar prompts
EnableModelCache Keeps model weights in memory after disposal for faster reload Applications that load and unload the same model repeatedly
EnableTokenHealing Corrects tokenization artifacts at generation boundaries Always (improves output quality with negligible overhead)

Performance Scoring

Use the built-in performance scorer to evaluate which models will run well on your hardware:

var cards = ModelCard.GetPredefinedModelCards();
foreach (var card in cards.Take(10))
{
    float score = LM.DeviceConfiguration.GetPerformanceScore(card);
    Console.WriteLine($"{card.ModelID,-25} Score: {score:F2}  ({(score > 0.7f ? "Good" : score > 0.4f ? "Acceptable" : "Too slow")})");
}

A score above 0.7 indicates the model will run comfortably on your current hardware. Scores between 0.4 and 0.7 mean the model will work but may be slow. Below 0.4, consider a smaller model or upgrading your hardware.


Backend Comparison

Backend NuGet Package GPU Required Best For
CPU (SSE4) Included in LM-Kit.NET No Basic testing, low-end hardware
AVX2 LM-Kit.NET.Backend.Avx2.Windows No Modern CPUs with AVX2 support
CUDA 12 LM-Kit.NET.Backend.Cuda12.Windows / .Linux / .linux-arm64 NVIDIA GPU Production with NVIDIA GPUs
CUDA 13 LM-Kit.NET.Backend.Cuda13.Windows NVIDIA GPU Latest NVIDIA GPUs (Windows only, Linux coming soon)
Vulkan Included in LM-Kit.NET Any GPU Cross-platform GPU acceleration
Metal Built into macOS package Apple Silicon macOS native GPU

How to choose:

  • NVIDIA GPU available: Use CUDA 12 (broadest compatibility) or CUDA 13 (latest GPUs, Windows only).
  • AMD or Intel GPU: Vulkan is included in the base package and works automatically.
  • No GPU, modern CPU: Use AVX2 for 2-3x speedup over the default SSE4 backend.
  • macOS with Apple Silicon: Metal is included automatically; no extra package needed.
  • Mixed fleet: Install a CUDA package. The SDK automatically falls back to Vulkan on machines without NVIDIA GPUs (CUDA 13 → CUDA 12 → Vulkan → CPU).

Common Issues

Problem Cause Fix
OutOfMemoryException on load Model exceeds available VRAM Reduce GpuLayerCount or use a smaller model
Slow generation (~1 tok/s) Running on CPU without GPU backend Install the appropriate CUDA or Vulkan NuGet package
CUDA not detected Driver version mismatch Update NVIDIA drivers; CUDA 12 requires driver 525.60+
Vulkan crashes on startup Outdated GPU drivers Update GPU drivers to the latest version
Model loads but generation is slow Too few layers offloaded to GPU Increase GpuLayerCount until VRAM is nearly full
Performance score is 0 Model requires capabilities the hardware lacks Choose a model that fits within your VRAM budget

Next Steps