Configure GPU Backends

GPU acceleration is the single most impactful setting for inference performance. A model that takes several seconds per response on CPU can generate tokens in milliseconds when offloaded to a supported GPU. This guide explains how to install, enable, and verify each backend so you can get the best throughput from your hardware.

TL;DR

# Install the base SDK (includes CPU, AVX, and Vulkan backends)
dotnet add package LM-Kit.NET

# For NVIDIA GPUs, add a CUDA package matching your platform
dotnet add package LM-Kit.NET.Backend.Cuda13.Windows     # Windows x64
# or .Linux / .linux-arm64 / .Cuda12.*

using LMKit.Global;
using LMKit.Model;

// The SDK auto-selects the best backend: CUDA 13 → CUDA 12 → Vulkan → CPU
Runtime.Initialize();

Console.WriteLine($"Backend: {Runtime.Backend}");  // e.g. "Cuda12", "Vulkan", "Avx2"

Vulkan is included in the base LM-Kit.NET package. No extra install needed.
CUDA requires a separate backend package (see table below).
Metal (macOS) is included automatically.
If CUDA is installed but no NVIDIA GPU is found, the SDK falls back to Vulkan automatically.

Prerequisites

Requirement	Details
LM-Kit.NET	Installed via NuGet (`LM-Kit.NET` package)
.NET	.NET 8.0 or later (or .NET Standard 2.0 compatible)
GPU (optional)	NVIDIA GPU with CUDA 12+ drivers, or any GPU with Vulkan 1.2+ support
CUDA Toolkit (Linux only)	Required for CUDA backends on Linux. See Linux CUDA Toolkit Setup below.
macOS (optional)	Apple Silicon or AMD GPU for Metal acceleration

Important

Linux users: The CUDA backend NuGet packages for Linux ship only the LM-Kit backend binaries. They do not bundle the CUDA runtime libraries (libcudart, libcublas, etc.). You must install the NVIDIA CUDA Toolkit on your system before using CUDA acceleration. See Linux CUDA Toolkit Setup for instructions.

Platform Support

Platform	Architectures	Status
Windows	x64, ARM64	Fully supported
Linux	x64, ARM64	Fully supported
macOS	Universal (Apple Silicon + Intel)	Fully supported

Available Backends

LM-Kit.NET selects the optimal backend automatically at startup and keeps it for the lifetime of the process. The table below lists every supported backend.

Backend	Description	Platform	NuGet Package
CPU (SSE4)	Default fallback. Works on any modern x64 processor.	Windows x64, Linux x64	Included in `LM-Kit.NET`
AVX / AVX2	Leverages wider SIMD registers for faster CPU inference.	Windows x64, Linux x64	Included in `LM-Kit.NET`
CPU (ARM Neon)	Default CPU backend on ARM64. Uses Neon SIMD intrinsics.	Windows ARM64, Linux ARM64	Included in `LM-Kit.NET`
CUDA 12	NVIDIA GPU acceleration using CUDA 12.x drivers.	Windows x64, Linux x64, Linux ARM64	`LM-Kit.NET.Backend.Cuda12.Windows` / `.Linux` / `.linux-arm64`
CUDA 13	NVIDIA GPU acceleration using CUDA 13.x drivers.	Windows x64, Linux x64, Linux ARM64	`LM-Kit.NET.Backend.Cuda13.Windows` / `.Linux` / `.linux-arm64`
Vulkan	Cross-platform GPU acceleration (NVIDIA, AMD, Intel, Qualcomm Adreno).	Windows x64, Windows ARM64, Linux x64, Linux ARM64	Included in `LM-Kit.NET`
Metal	Apple GPU acceleration via the Metal API.	macOS	Included in `LM-Kit.NET`

Note: On macOS, Metal is enabled automatically when a compatible GPU is present. The EnableCuda and EnableVulkan properties have no effect on macOS.

Automatic Backend Selection

LM-Kit.NET automatically selects the best available backend at startup. You do not need to pick one manually. The SDK evaluates backends using three conditions:

Runtime files are available. The matching NuGet backend package is installed.
Backend is not disabled in code. EnableCuda and EnableVulkan have not been set to false.
A compatible device exists on the system. At least one GPU can be loaded by the backend.

When all three conditions are met, the backend is selected. When multiple GPU backends qualify, the SDK follows this priority order:

CUDA 13  →  CUDA 12  →  Vulkan  →  CPU (AVX2 / AVX / SSE4)

Key behavior: If a CUDA backend package is installed but no NVIDIA GPU is found on the system, LM-Kit.NET automatically falls back to Vulkan (provided Vulkan is not explicitly disabled in code). This means you can safely install a CUDA backend package even on machines that may not have an NVIDIA GPU. The SDK will select the next best option.

Step 1: Install the Backend NuGet Package

The base LM-Kit.NET package includes CPU, AVX, and Vulkan support. For CUDA GPU acceleration you must add the matching CUDA backend package.

CUDA 12 (NVIDIA)

# Windows x64
Install-Package LM-Kit.NET.Backend.Cuda12.Windows

# Linux x64
Install-Package LM-Kit.NET.Backend.Cuda12.Linux

# Linux ARM64 (NVIDIA Jetson, Grace, etc.)
Install-Package LM-Kit.NET.Backend.Cuda12.linux-arm64

CUDA 13 (NVIDIA)

# Windows x64
Install-Package LM-Kit.NET.Backend.Cuda13.Windows

# Linux x64
Install-Package LM-Kit.NET.Backend.Cuda13.Linux

# Linux ARM64 (NVIDIA Jetson, Grace, etc.)
Install-Package LM-Kit.NET.Backend.Cuda13.linux-arm64

Tip: You can install multiple CUDA backend packages in the same project. LM-Kit.NET will automatically select the highest priority backend that matches your hardware. If no NVIDIA GPU is detected, the SDK falls back to Vulkan, then CPU.

Linux CUDA Toolkit Setup (Required)

On Windows, the CUDA backend NuGet packages bundle all required CUDA runtime libraries. No extra steps are needed.

On Linux, the NuGet packages ship only the LM-Kit backend shared libraries (.so files). The CUDA runtime (libcudart.so.12, libcublas.so.12, etc.) is not included because CUDA Toolkit installation varies across Linux distributions. You must install the NVIDIA CUDA Toolkit on the host machine so that the system linker can resolve these dependencies.

Install the CUDA Toolkit

Follow the instructions for your distribution. Only the runtime libraries are required; the full development toolkit is optional.

Ubuntu / Debian:

# Add NVIDIA package repository (example for Ubuntu 22.04, CUDA 12.6)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update

# Install the CUDA runtime (minimal install, no compiler needed)
sudo apt-get install -y cuda-cudart-12-6 libcublas-12-6

RHEL / Fedora / Rocky Linux:

# Add NVIDIA repository (example for RHEL 9)
sudo dnf config-manager --add-repo https://developer.download.nvidia.com/compute/cuda/repos/rhel9/x86_64/cuda-rhel9.repo

# Install CUDA runtime
sudo dnf install -y cuda-cudart-12-6 libcublas-12-6

Any distribution (NVIDIA runfile installer):

Download the CUDA Toolkit from the NVIDIA CUDA Toolkit Downloads page and follow the guided installer for your platform.

Docker Containers

For containerized deployments, use an NVIDIA CUDA base image or install the CUDA runtime in your Dockerfile:

# Option 1 (recommended): Start from NVIDIA CUDA base image
FROM nvidia/cuda:12.6.3-runtime-ubuntu22.04
# Install .NET runtime on top
RUN apt-get update && apt-get install -y dotnet-runtime-8.0

# Option 2: Add CUDA runtime to a .NET image
FROM mcr.microsoft.com/dotnet/aspnet:8.0
RUN apt-get update && apt-get install -y wget && \
    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
    dpkg -i cuda-keyring_1.1-1_all.deb && \
    apt-get update && \
    apt-get install -y cuda-cudart-12-6 libcublas-12-6 && \
    rm -rf /var/lib/apt/lists/*

Note: When running Docker containers with GPU access, use the NVIDIA Container Toolkit and launch containers with --gpus all.

Verify the CUDA Runtime

After installing the CUDA Toolkit, confirm the runtime libraries are available:

# Check that libcudart is resolvable
ldconfig -p | grep libcudart
# Expected output: libcudart.so.12 (libc6,x86-64) => /usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so.12

# Check NVIDIA driver
nvidia-smi

If ldconfig does not find libcudart.so.12, add the CUDA library path to your linker configuration:

echo "/usr/local/cuda-12.6/lib64" | sudo tee /etc/ld.so.conf.d/cuda.conf
sudo ldconfig

Step 2: Enable the Backend in Code

Backend flags must be set before the runtime is initialized. Once Initialize() has been called (or any operation triggers automatic initialization), these properties become immutable.

using LMKit.Global;

// Enable CUDA (default is true, shown here for clarity)
Runtime.EnableCuda = true;

// Enable Vulkan (default is true)
Runtime.EnableVulkan = true;

// Initialize the runtime explicitly
Runtime.Initialize();

// Check which backend was selected
Console.WriteLine($"Active backend: {Runtime.Backend}");

To force CPU-only inference, disable both GPU backends before initialization:

Runtime.EnableCuda = false;
Runtime.EnableVulkan = false;
Runtime.Initialize();

// Runtime.Backend will be CPU, Avx, or Avx2
Console.WriteLine($"Backend: {Runtime.Backend}");

Step 3: Verify GPU Availability

After initialization you can query whether GPU offload is available and enumerate the detected devices.

using LMKit.Global;
using LMKit.Hardware.Gpu;

Runtime.Initialize();

Console.WriteLine($"Backend:     {Runtime.Backend}");
Console.WriteLine($"GPU support: {Runtime.HasGpuSupport}");
Console.WriteLine($"GPU devices: {GpuDeviceInfo.Devices.Count}");

foreach (var device in GpuDeviceInfo.Devices)
{
    double totalGB = device.TotalMemorySize / (1024.0 * 1024.0 * 1024.0);
    double freeGB  = device.FreeMemorySize  / (1024.0 * 1024.0 * 1024.0);

    Console.WriteLine($"  [{device.DeviceNumber}] {device.DeviceName}");
    Console.WriteLine($"      Type:        {device.DeviceType}");
    Console.WriteLine($"      Total VRAM:  {totalGB:F1} GB");
    Console.WriteLine($"      Free VRAM:   {freeGB:F1} GB");
}

Step 4: Control GPU Layer Offloading

By default, LM-Kit.NET offloads all model layers to the GPU when GPU support is available. You can limit the number of offloaded layers through LM.DeviceConfiguration. This is useful when a model is too large to fit entirely in VRAM.

using LMKit.Model;

// Offload all layers (default behavior)
using LM model = LM.LoadFromModelID("gemma4:e4b");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");

// Offload only 20 layers to save VRAM
using LM partialGpu = LM.LoadFromModelID("gemma4:e4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 20
    });

Console.WriteLine($"GPU layers (partial): {partialGpu.GpuLayerCount}");

Choosing the Right Layer Count

int.MaxValue (default): Offload everything. LM-Kit.NET will automatically reduce the count if VRAM is insufficient.
A specific number (e.g., 20): Useful when you want to share VRAM with other processes or run multiple models concurrently.
0: Force all computation to the CPU, even when a GPU backend is active.

Step 5: Multi-GPU Configuration

When multiple GPUs are detected, LM-Kit.NET selects the device with the most available memory as the main GPU. You can override this by specifying a GpuDeviceInfo instance.

using LMKit.Hardware.Gpu;
using LMKit.Model;

// Pick a specific GPU by device number
var device = GpuDeviceInfo.GetDeviceFromNumber(1);

using LM model = LM.LoadFromModelID("gemma4:e4b",
    deviceConfiguration: new LM.DeviceConfiguration(device));

Console.WriteLine($"Running on: {device.DeviceName}");

When to Use CPU vs. GPU

Scenario	Recommendation
Small models (under 3B parameters)	CPU or AVX2 is often fast enough.
Medium models (4B to 8B parameters)	GPU recommended. A card with 6+ GB VRAM provides smooth inference.
Large models (12B+ parameters)	GPU strongly recommended. 10+ GB VRAM for full offload.
No dedicated GPU available	CPU with AVX2 gives the best software-only performance.
Shared GPU environment	Use partial layer offloading to limit VRAM consumption.
Batch processing or server workloads	CUDA typically provides the highest throughput for NVIDIA hardware.
Cross-vendor GPU support	Vulkan works across NVIDIA, AMD, and Intel GPUs. No extra package needed.
Mixed deployment (some machines with NVIDIA GPUs, some without)	Install a CUDA package. The SDK falls back to Vulkan automatically on machines without NVIDIA GPUs.

Troubleshooting

Problem	Solution
`Runtime.Backend` shows `CPU` despite installing a GPU package	Verify that the correct NuGet backend package is installed and that your GPU drivers meet the minimum version. Check that `EnableCuda` and `EnableVulkan` are not set to `false`.
`HasGpuSupport` is `false`	The GPU driver may be outdated, or `EnableCuda` / `EnableVulkan` was set to `false` before initialization.
Out-of-memory when loading a large model	Lower `GpuLayerCount` to offload fewer layers, or switch to a smaller quantization variant.
`InvalidOperationException` when setting `EnableCuda`	The runtime has already been initialized. Backend flags must be set before any LM-Kit.NET operation.
Slow inference despite GPU backend	Ensure the model actually has GPU layers. Check `model.GpuLayerCount` after loading.
CUDA installed but Vulkan is selected	No NVIDIA GPU was detected. The SDK automatically fell back to Vulkan. Install NVIDIA drivers or verify that the GPU is recognized by the system.
CUDA backend fails on Linux with missing `libcudart.so.12`	The CUDA runtime is not installed on the system. Install the NVIDIA CUDA Toolkit for your distribution. See Linux CUDA Toolkit Setup.

Quick Reference

using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;

// 1. Configure backend (before initialization)
Runtime.EnableCuda = true;
Runtime.Initialize();

// 2. Verify
Console.WriteLine($"Backend: {Runtime.Backend}");
Console.WriteLine($"GPU support: {Runtime.HasGpuSupport}");

// 3. Load model with GPU acceleration
using LM model = LM.LoadFromModelID("gemma4:e4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nGPU layers: {model.GpuLayerCount}");

Next Steps

Distributed Inference Across Multiple GPUs: split large models across multiple GPUs for better throughput.
Understanding Model Loading and Caching: learn about download behavior, caching, and model properties.
Your First AI Agent: build a working agent with tools.
Choosing the Right Model: select the best model for your hardware and use case.
Configure GPU Backends and Optimize Performance (How-To): advanced tuning and performance optimization.

Table of Contents