Distributed Inference Across Multiple GPUs

When a model is too large for a single GPU, or when you want to maximize throughput on a multi-GPU system, LM-Kit.NET can split the model across all available GPUs automatically. This guide explains how distributed inference works, when to enable it, and how to configure it.

TL;DR

using LMKit.Global;
using LMKit.Model;

// Enable multi-GPU distribution (set before loading any model)
Configuration.FavorDistributedInference = true;

// Load a large model. Layers are split across all detected GPUs.
using LM model = LM.LoadFromModelID("gemma4:26b-a4b");

Default is off. Single-GPU avoids inter-GPU communication overhead.
Enable it when your model exceeds the VRAM of a single GPU, or you have multiple GPUs and want to maximize throughput.

What Is Distributed Inference?

Distributed inference splits a model's tensor computations across multiple GPUs instead of running everything on a single device. The underlying engine divides the model layers (or individual tensor rows) among the available GPUs, so each device processes a portion of every inference pass.

Single GPU (default): All model layers run on one GPU. Inter-GPU communication overhead is zero, which is optimal for models that fit comfortably in a single GPU's VRAM.

Distributed across multiple GPUs: Model layers are spread across two or more GPUs. This lets you load models that exceed the VRAM of any single card, and can improve throughput on large models by parallelizing computation.

When to Use Distributed Inference

Scenario	Recommendation
Model fits in a single GPU	Keep distributed inference disabled (default). Single-GPU avoids inter-GPU communication overhead.
Model exceeds single GPU VRAM	Enable distributed inference to split the model across multiple GPUs.
Server with multiple GPUs running large models	Enable distributed inference for maximum utilization.
GPUs connected via slow bus (e.g., PCIe without NVLink)	Test both modes. The communication overhead may negate the parallelism benefit.
Small models (under 8B parameters)	Keep disabled. Small models rarely benefit from multi-GPU distribution.

Prerequisites

Requirement	Details
LM-Kit.NET	Installed via NuGet (`LM-Kit.NET` package)
.NET	.NET 8.0 or later
Multiple GPUs	Two or more GPUs detected by the active backend (CUDA, Vulkan, or Metal)

Step 1: Enable Distributed Inference

Set Configuration.FavorDistributedInference to true before loading any model. This is a global setting that affects all subsequent model loads.

using LMKit.Global;
using LMKit.Model;

// Enable distributed inference across all available GPUs
Configuration.FavorDistributedInference = true;

// Load a large model. The SDK will split it across available GPUs.
using LM model = LM.LoadFromModelID("gemma4:26b-a4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nModel loaded: {model.Name}");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");

When FavorDistributedInference is true, the SDK distributes model layers evenly across all detected GPUs using the backend's default splitting strategy.

When FavorDistributedInference is false (the default), all computation runs on a single GPU, which is the one with the most available memory.

Step 2: Verify Multi-GPU Detection

Before enabling distributed inference, verify that multiple GPUs are detected:

using LMKit.Global;
using LMKit.Hardware.Gpu;

Runtime.Initialize();

Console.WriteLine($"Backend:     {Runtime.Backend}");
Console.WriteLine($"GPU support: {Runtime.HasGpuSupport}");
Console.WriteLine($"GPU devices: {GpuDeviceInfo.Devices.Count}");

foreach (var device in GpuDeviceInfo.Devices)
{
    double totalGB = device.TotalMemorySize / (1024.0 * 1024.0 * 1024.0);
    double freeGB  = device.FreeMemorySize  / (1024.0 * 1024.0 * 1024.0);

    Console.WriteLine($"  [{device.DeviceNumber}] {device.DeviceName}");
    Console.WriteLine($"      Total VRAM:  {totalGB:F1} GB");
    Console.WriteLine($"      Free VRAM:   {freeGB:F1} GB");
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("\nOnly one GPU detected. Distributed inference has no effect with a single GPU.");
}

Distributed inference requires at least two GPUs. With a single GPU, the setting has no observable effect.

Step 3: Combine with Device Configuration

You can combine distributed inference with other device configuration options such as GpuLayerCount and MainGpu:

using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;

// Enable distributed inference
Configuration.FavorDistributedInference = true;

// Choose a specific GPU as the main device (handles scratch/small tensors)
var mainDevice = GpuDeviceInfo.GetDeviceFromNumber(0);

using LM model = LM.LoadFromModelID("gemma4:26b-a4b",
    deviceConfiguration: new LM.DeviceConfiguration(mainDevice)
    {
        GpuLayerCount = int.MaxValue  // Offload all layers (default)
    },
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nMain GPU: {mainDevice.DeviceName}");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");

MainGpu: Designates which GPU handles scratch buffers and small tensors. By default, the GPU with the most free memory is selected.
GpuLayerCount: Controls the total number of layers offloaded to GPU(s). With distributed inference, these layers are split across all GPUs rather than placed on a single device.

How It Works Internally

LM-Kit.NET uses a tensor distribution mechanism built on top of the llama.cpp inference engine. Here is what happens when a model is loaded with distributed inference enabled:

Device detection. The SDK enumerates all GPUs available through the active backend (CUDA, Vulkan, or Metal).
Split computation. The model layers are divided across the detected GPUs proportionally. Each GPU receives a subset of the model's transformer layers and the corresponding portion of the KV cache.
Inference execution. During each forward pass, each GPU processes its assigned layers. Intermediate results are transferred between GPUs as needed.

The SDK supports up to 16 GPUs simultaneously.

Layer Splitting vs. Row Splitting

The distribution engine supports two modes internally:

Layer splitting (default): Different transformer layers are assigned to different GPUs. GPU 0 might process layers 0 through 15, while GPU 1 processes layers 16 through 31. This is conceptually similar to pipeline parallelism.
Row splitting: Individual tensor rows within each layer are divided across GPUs, so every GPU participates in computing every layer. This is conceptually similar to tensor parallelism.

The SDK selects the appropriate mode automatically based on the hardware configuration.

Quick Reference

using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;

// 1. Enable distributed inference (before loading models)
Configuration.FavorDistributedInference = true;

// 2. Initialize the runtime
Runtime.Initialize();

// 3. Verify multiple GPUs are detected
Console.WriteLine($"Backend: {Runtime.Backend}");
Console.WriteLine($"GPUs:    {GpuDeviceInfo.Devices.Count}");

// 4. Load a large model (layers are split across GPUs)
using LM model = LM.LoadFromModelID("gemma4:26b-a4b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nGPU layers: {model.GpuLayerCount}");

Troubleshooting

Problem	Solution
Model still loads on a single GPU	Verify that `Configuration.FavorDistributedInference = true` is set before any model load. Check that multiple GPUs are detected with `GpuDeviceInfo.Devices.Count`.
Out-of-memory with distributed inference	The combined VRAM across all GPUs may still be insufficient. Lower `GpuLayerCount` to keep some layers on CPU.
Slower performance with distributed inference	Inter-GPU communication adds overhead. For small models that fit on a single GPU, disable distributed inference.
Only one GPU detected	Verify that all GPUs have up-to-date drivers and are visible to the active backend. On CUDA, run `nvidia-smi` to check.

Next Steps

Configure GPU Backends: install and configure GPU backends for single-GPU and multi-GPU setups.
Understanding Model Loading and Caching: learn about download behavior, caching, and model properties.
Configure GPU Backends and Optimize Performance (How-To): advanced tuning and performance benchmarking.

Table of Contents