Configure GPU Backends and Optimize Performance

Running LLMs efficiently requires matching your hardware to the right backend and tuning how the model is distributed across CPU and GPU. This tutorial builds a program that detects available hardware, configures the optimal backend, tunes GPU layer offloading, and benchmarks performance across your model catalog.

Why This Matters

Two enterprise deployment problems that proper backend configuration solves:

Maximizing throughput on heterogeneous hardware. Production environments often include a mix of NVIDIA, AMD, and CPU-only machines. Selecting the correct backend per machine ensures every node runs at peak efficiency instead of falling back to slow CPU inference.
Fitting larger models into limited VRAM. Partial GPU offloading lets you run a 12B model on an 8 GB GPU by keeping some layers on the CPU. Without tuning, the model either fails to load or runs entirely on CPU, wasting available GPU capacity.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	8 GB
VRAM (optional)	4+ GB for GPU acceleration
Disk	~3 GB free for model download

Step 1: Create the Project

dotnet new console -n GpuBackendOptimizer
cd GpuBackendOptimizer
dotnet add package LM-Kit.NET

For NVIDIA GPU acceleration, install a CUDA backend package. Vulkan is already included in the base LM-Kit.NET package.

# NVIDIA CUDA 12 (Windows)
dotnet add package LM-Kit.NET.Backend.Cuda12.Windows

# NVIDIA CUDA 12 (Linux x64)
dotnet add package LM-Kit.NET.Backend.Cuda12.Linux

# NVIDIA CUDA 12 (Linux ARM64)
dotnet add package LM-Kit.NET.Backend.Cuda12.linux-arm64

# NVIDIA CUDA 13 (Windows only)
dotnet add package LM-Kit.NET.Backend.Cuda13.Windows

Step 2: Detect Available Backends and GPU Information

Before configuring a backend, inspect what hardware is available on the current machine:

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Query hardware capabilities
// ──────────────────────────────────────
Console.WriteLine("=== Hardware Detection ===\n");

var deviceInfo = LMKit.Hardware.Gpu.GpuDeviceInfo.Devices;

foreach (var device in deviceInfo)
{
    Console.WriteLine($"  Device: {device.Name}");
    Console.WriteLine($"  VRAM:   {device.TotalMemory / (1024.0 * 1024 * 1024):F1} GB");
    Console.WriteLine();
}

Console.WriteLine($"  CUDA available:   {LMKit.Global.Runtime.IsCudaAvailable}");
Console.WriteLine($"  Vulkan available: {LMKit.Global.Runtime.IsVulkanAvailable}");

This lets you make informed decisions about which backend to enable and how many layers to offload.

Step 3: Configure Backend Selection

LM-Kit.NET supports multiple backends. Enable the one that matches your hardware before initializing the runtime:

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the runtime backend
// ──────────────────────────────────────

// Enable CUDA acceleration (requires CUDA backend NuGet package)
LMKit.Global.Runtime.EnableCuda = true;

// Optional: enable Vulkan instead for cross-platform GPU support
// LMKit.Global.Runtime.EnableVulkan = true;

LMKit.Global.Runtime.Initialize();

// ──────────────────────────────────────
// 2. Load model with custom device configuration
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 28  // partial offload: keep some layers on CPU
    },
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  GPU layers offloaded: {model.GpuLayerCount}");
Console.WriteLine($"  Context length: {model.ContextLength} tokens");

Only enable one GPU backend at a time. If both CUDA and Vulkan are enabled, CUDA takes priority on NVIDIA hardware.

Step 4: Tune GPU Layer Offloading

Partial GPU offloading is the key to fitting larger models into limited VRAM. Instead of loading the entire model onto the GPU (which may fail if VRAM is insufficient), you specify how many transformer layers to place on the GPU. The remaining layers stay on the CPU.

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the runtime backend
// ──────────────────────────────────────

// Enable CUDA acceleration (requires CUDA backend NuGet package)
LMKit.Global.Runtime.EnableCuda = true;

// Optional: enable Vulkan instead for cross-platform GPU support
// LMKit.Global.Runtime.EnableVulkan = true;

LMKit.Global.Runtime.Initialize();

// ──────────────────────────────────────
// 2. Load model with custom device configuration
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 28  // partial offload: keep some layers on CPU
    },
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

// Full GPU offload (default behavior, uses all available VRAM)
var fullGpu = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue
};

// Partial offload: place 20 layers on GPU, rest on CPU
var partialGpu = new LM.DeviceConfiguration
{
    GpuLayerCount = 20
};

// CPU only: no GPU offloading
var cpuOnly = new LM.DeviceConfiguration
{
    GpuLayerCount = 0
};

// Load with partial offload
Console.WriteLine("Loading with partial GPU offload...");
using LM model = LM.LoadFromModelID("gemma3:12b",
    deviceConfiguration: partialGpu,
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"  Total layers: {model.LayerCount}");

Tuning strategy: start with GpuLayerCount = int.MaxValue. If loading fails with an out-of-memory error, reduce the count by 5 and retry. Continue until the model loads successfully. Each offloaded layer adds roughly (model file size) / (total layers) of VRAM usage.

Step 5: Memory Optimization with KV-Cache and Model Caching

Beyond backend selection, LM-Kit.NET provides several global configuration options that improve inference performance:

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the runtime backend
// ──────────────────────────────────────

// Enable CUDA acceleration (requires CUDA backend NuGet package)
LMKit.Global.Runtime.EnableCuda = true;

// Optional: enable Vulkan instead for cross-platform GPU support
// LMKit.Global.Runtime.EnableVulkan = true;

LMKit.Global.Runtime.Initialize();

// ──────────────────────────────────────
// 2. Load model with custom device configuration
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 28  // partial offload: keep some layers on CPU
    },
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

// Enable KV-cache recycling for faster subsequent inferences
LMKit.Global.Configuration.EnableKVCacheRecycling = true;

// Enable model caching to reuse loaded models
LMKit.Global.Configuration.EnableModelCache = true;

// Enable token healing for improved generation quality
LMKit.Global.Configuration.EnableTokenHealing = true;

Setting	Effect	When to Enable
`EnableKVCacheRecycling`	Reuses key-value attention cache across inferences with shared prompt prefixes	Multi-turn conversations, batch processing with similar prompts
`EnableModelCache`	Keeps model weights in memory after disposal for faster reload	Applications that load and unload the same model repeatedly
`EnableTokenHealing`	Corrects tokenization artifacts at generation boundaries	Always (improves output quality with negligible overhead)

Performance Scoring

Use the built-in performance scorer to evaluate which models will run well on your hardware:

using System.Text;
using LMKit.Model;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Configure the runtime backend
// ──────────────────────────────────────

// Enable CUDA acceleration (requires CUDA backend NuGet package)
LMKit.Global.Runtime.EnableCuda = true;

// Optional: enable Vulkan instead for cross-platform GPU support
// LMKit.Global.Runtime.EnableVulkan = true;

LMKit.Global.Runtime.Initialize();

// ──────────────────────────────────────
// 2. Load model with custom device configuration
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 28  // partial offload: keep some layers on CPU
    },
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  Downloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\r  Loading: {p * 100:F0}%   "); return true; });

var cards = ModelCard.GetPredefinedModelCards();
foreach (var card in cards.Take(10))
{
    float score = LM.DeviceConfiguration.GetPerformanceScore(card);
    Console.WriteLine($"{card.ModelID,-25} Score: {score:F2}  ({(score > 0.7f ? "Good" : score > 0.4f ? "Acceptable" : "Too slow")})");
}

A score above 0.7 indicates the model will run comfortably on your current hardware. Scores between 0.4 and 0.7 mean the model will work but may be slow. Below 0.4, consider a smaller model or upgrading your hardware.

Backend Comparison

Backend	NuGet Package	GPU Required	Best For
CPU (SSE4)	Included in LM-Kit.NET	No	Basic testing, low-end hardware
AVX2	LM-Kit.NET.Backend.Avx2.Windows	No	Modern CPUs with AVX2 support
CUDA 12	LM-Kit.NET.Backend.Cuda12.Windows / .Linux / .linux-arm64	NVIDIA GPU	Production with NVIDIA GPUs
CUDA 13	LM-Kit.NET.Backend.Cuda13.Windows	NVIDIA GPU	Latest NVIDIA GPUs (Windows only, Linux coming soon)
Vulkan	Included in LM-Kit.NET	Any GPU	Cross-platform GPU acceleration
Metal	Built into macOS package	Apple Silicon	macOS native GPU

How to choose:

NVIDIA GPU available: Use CUDA 12 (broadest compatibility) or CUDA 13 (latest GPUs, Windows only).
AMD or Intel GPU: Vulkan is included in the base package and works automatically.
No GPU, modern CPU: Use AVX2 for 2-3x speedup over the default SSE4 backend.
macOS with Apple Silicon: Metal is included automatically; no extra package needed.
Mixed fleet: Install a CUDA package. The SDK automatically falls back to Vulkan on machines without NVIDIA GPUs (CUDA 13 → CUDA 12 → Vulkan → CPU).

Common Issues

Problem	Cause	Fix
`OutOfMemoryException` on load	Model exceeds available VRAM	Reduce `GpuLayerCount` or use a smaller model
Slow generation (~1 tok/s)	Running on CPU without GPU backend	Install the appropriate CUDA or Vulkan NuGet package
CUDA not detected	Driver version mismatch	Update NVIDIA drivers; CUDA 12 requires driver 525.60+
Vulkan crashes on startup	Outdated GPU drivers	Update GPU drivers to the latest version
Model loads but generation is slow	Too few layers offloaded to GPU	Increase `GpuLayerCount` until VRAM is nearly full
Performance score is 0	Model requires capabilities the hardware lacks	Choose a model that fits within your VRAM budget

Next Steps

Load a Model and Generate Your First Response: get started with basic model loading and generation.
Build a RAG Pipeline Over Your Own Documents: ground model responses in your own data.
Add Telemetry and Observability: monitor inference performance in production.
Build and Deploy an Offline AI Application for Edge Environments: use backend configuration for edge and air-gapped deployments.

Table of Contents