Configure GPU Backends and Optimize Performance
Running LLMs efficiently requires matching your hardware to the right backend and tuning how the model is distributed across CPU and GPU. This tutorial builds a program that detects available hardware, configures the optimal backend, tunes GPU layer offloading, and benchmarks performance across your model catalog.
Why This Matters
Two enterprise deployment problems that proper backend configuration solves:
- Maximizing throughput on heterogeneous hardware. Production environments often include a mix of NVIDIA, AMD, and CPU-only machines. Selecting the correct backend per machine ensures every node runs at peak efficiency instead of falling back to slow CPU inference.
- Fitting larger models into limited VRAM. Partial GPU offloading lets you run a 12B model on an 8 GB GPU by keeping some layers on the CPU. Without tuning, the model either fails to load or runs entirely on CPU, wasting available GPU capacity.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 8 GB |
| VRAM (optional) | 4+ GB for GPU acceleration |
| Disk | ~3 GB free for model download |
Step 1: Create the Project
dotnet new console -n GpuBackendOptimizer
cd GpuBackendOptimizer
dotnet add package LM-Kit.NET
For NVIDIA GPU acceleration, install a CUDA backend package. Vulkan is already included in the base LM-Kit.NET package.
# NVIDIA CUDA 12 (Windows)
dotnet add package LM-Kit.NET.Backend.Cuda12.Windows
# NVIDIA CUDA 12 (Linux x64)
dotnet add package LM-Kit.NET.Backend.Cuda12.Linux
# NVIDIA CUDA 12 (Linux ARM64)
dotnet add package LM-Kit.NET.Backend.Cuda12.linux-arm64
# NVIDIA CUDA 13 (Windows only)
dotnet add package LM-Kit.NET.Backend.Cuda13.Windows
Step 2: Detect Available Backends and GPU Information
Before configuring a backend, inspect what hardware is available on the current machine:
using System.Text;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Query hardware capabilities
// ──────────────────────────────────────
Console.WriteLine("=== Hardware Detection ===\n");
var deviceInfo = LM.DeviceConfiguration.GetAvailableDevices();
foreach (var device in deviceInfo)
{
Console.WriteLine($" Device: {device.Name}");
Console.WriteLine($" Type: {device.Type}");
Console.WriteLine($" VRAM: {device.TotalMemory / (1024.0 * 1024 * 1024):F1} GB");
Console.WriteLine();
}
Console.WriteLine($" CUDA available: {LMKit.Global.Runtime.IsCudaAvailable}");
Console.WriteLine($" Vulkan available: {LMKit.Global.Runtime.IsVulkanAvailable}");
This lets you make informed decisions about which backend to enable and how many layers to offload.
Step 3: Configure Backend Selection
LM-Kit.NET supports multiple backends. Enable the one that matches your hardware before initializing the runtime:
using System.Text;
using LMKit.Model;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Configure the runtime backend
// ──────────────────────────────────────
// Enable CUDA acceleration (requires CUDA backend NuGet package)
LMKit.Global.Runtime.EnableCuda = true;
// Optional: enable Vulkan instead for cross-platform GPU support
// LMKit.Global.Runtime.EnableVulkan = true;
LMKit.Global.Runtime.Initialize();
// ──────────────────────────────────────
// 2. Load model with custom device configuration
// ──────────────────────────────────────
Console.WriteLine("Loading model...");
using LM model = LM.LoadFromModelID("gemma3:4b",
deviceConfiguration: new LM.DeviceConfiguration
{
GpuLayerCount = 28 // partial offload: keep some layers on CPU
},
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\r Downloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($" GPU layers offloaded: {model.GpuLayerCount}");
Console.WriteLine($" Context length: {model.ContextLength} tokens");
Only enable one GPU backend at a time. If both CUDA and Vulkan are enabled, CUDA takes priority on NVIDIA hardware.
Step 4: Tune GPU Layer Offloading
Partial GPU offloading is the key to fitting larger models into limited VRAM. Instead of loading the entire model onto the GPU (which may fail if VRAM is insufficient), you specify how many transformer layers to place on the GPU. The remaining layers stay on the CPU.
// Full GPU offload (default behavior, uses all available VRAM)
var fullGpu = new LM.DeviceConfiguration
{
GpuLayerCount = int.MaxValue
};
// Partial offload: place 20 layers on GPU, rest on CPU
var partialGpu = new LM.DeviceConfiguration
{
GpuLayerCount = 20
};
// CPU only: no GPU offloading
var cpuOnly = new LM.DeviceConfiguration
{
GpuLayerCount = 0
};
// Load with partial offload
Console.WriteLine("Loading with partial GPU offload...");
using LM model = LM.LoadFromModelID("gemma3:12b",
deviceConfiguration: partialGpu,
loadingProgress: p => { Console.Write($"\r Loading: {p * 100:F0}% "); return true; });
Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($" GPU layers: {model.GpuLayerCount}");
Console.WriteLine($" Total layers: {model.LayerCount}");
Tuning strategy: start with GpuLayerCount = int.MaxValue. If loading fails with an out-of-memory error, reduce the count by 5 and retry. Continue until the model loads successfully. Each offloaded layer adds roughly (model file size) / (total layers) of VRAM usage.
Step 5: Memory Optimization with KV-Cache and Model Caching
Beyond backend selection, LM-Kit.NET provides several global configuration options that improve inference performance:
// Enable KV-cache recycling for faster subsequent inferences
LMKit.Global.Configuration.EnableKVCacheRecycling = true;
// Enable model caching to reuse loaded models
LMKit.Global.Configuration.EnableModelCache = true;
// Enable token healing for improved generation quality
LMKit.Global.Configuration.EnableTokenHealing = true;
| Setting | Effect | When to Enable |
|---|---|---|
EnableKVCacheRecycling |
Reuses key-value attention cache across inferences with shared prompt prefixes | Multi-turn conversations, batch processing with similar prompts |
EnableModelCache |
Keeps model weights in memory after disposal for faster reload | Applications that load and unload the same model repeatedly |
EnableTokenHealing |
Corrects tokenization artifacts at generation boundaries | Always (improves output quality with negligible overhead) |
Performance Scoring
Use the built-in performance scorer to evaluate which models will run well on your hardware:
var cards = ModelCard.GetPredefinedModelCards();
foreach (var card in cards.Take(10))
{
float score = LM.DeviceConfiguration.GetPerformanceScore(card);
Console.WriteLine($"{card.ModelID,-25} Score: {score:F2} ({(score > 0.7f ? "Good" : score > 0.4f ? "Acceptable" : "Too slow")})");
}
A score above 0.7 indicates the model will run comfortably on your current hardware. Scores between 0.4 and 0.7 mean the model will work but may be slow. Below 0.4, consider a smaller model or upgrading your hardware.
Backend Comparison
| Backend | NuGet Package | GPU Required | Best For |
|---|---|---|---|
| CPU (SSE4) | Included in LM-Kit.NET | No | Basic testing, low-end hardware |
| AVX2 | LM-Kit.NET.Backend.Avx2.Windows | No | Modern CPUs with AVX2 support |
| CUDA 12 | LM-Kit.NET.Backend.Cuda12.Windows / .Linux / .linux-arm64 | NVIDIA GPU | Production with NVIDIA GPUs |
| CUDA 13 | LM-Kit.NET.Backend.Cuda13.Windows | NVIDIA GPU | Latest NVIDIA GPUs (Windows only, Linux coming soon) |
| Vulkan | Included in LM-Kit.NET | Any GPU | Cross-platform GPU acceleration |
| Metal | Built into macOS package | Apple Silicon | macOS native GPU |
How to choose:
- NVIDIA GPU available: Use CUDA 12 (broadest compatibility) or CUDA 13 (latest GPUs, Windows only).
- AMD or Intel GPU: Vulkan is included in the base package and works automatically.
- No GPU, modern CPU: Use AVX2 for 2-3x speedup over the default SSE4 backend.
- macOS with Apple Silicon: Metal is included automatically; no extra package needed.
- Mixed fleet: Install a CUDA package. The SDK automatically falls back to Vulkan on machines without NVIDIA GPUs (CUDA 13 → CUDA 12 → Vulkan → CPU).
Common Issues
| Problem | Cause | Fix |
|---|---|---|
OutOfMemoryException on load |
Model exceeds available VRAM | Reduce GpuLayerCount or use a smaller model |
| Slow generation (~1 tok/s) | Running on CPU without GPU backend | Install the appropriate CUDA or Vulkan NuGet package |
| CUDA not detected | Driver version mismatch | Update NVIDIA drivers; CUDA 12 requires driver 525.60+ |
| Vulkan crashes on startup | Outdated GPU drivers | Update GPU drivers to the latest version |
| Model loads but generation is slow | Too few layers offloaded to GPU | Increase GpuLayerCount until VRAM is nearly full |
| Performance score is 0 | Model requires capabilities the hardware lacks | Choose a model that fits within your VRAM budget |
Next Steps
- Load a Model and Generate Your First Response: get started with basic model loading and generation.
- Build a RAG Pipeline Over Your Own Documents: ground model responses in your own data.
- Add Telemetry and Observability: monitor inference performance in production.