Table of Contents

Distributed Inference Across Multiple GPUs

When a model is too large for a single GPU, or when you want to maximize throughput on a multi-GPU system, LM-Kit.NET can split the model across all available GPUs automatically. This guide explains how distributed inference works, when to enable it, and how to configure it.


TL;DR

using LMKit.Global;
using LMKit.Model;

// Enable multi-GPU distribution (set before loading any model)
Configuration.FavorDistributedInference = true;

// Load a large model. Layers are split across all detected GPUs.
using LM model = LM.LoadFromModelID("gemma3:27b");
  • Default is off. Single-GPU avoids inter-GPU communication overhead.
  • Enable it when your model exceeds the VRAM of a single GPU, or you have multiple GPUs and want to maximize throughput.
  • LM-Kit Server enables it by default via "EnableDistributedInference": true in appsettings.json.

What Is Distributed Inference?

Distributed inference splits a model's tensor computations across multiple GPUs instead of running everything on a single device. The underlying engine divides the model layers (or individual tensor rows) among the available GPUs, so each device processes a portion of every inference pass.

Single GPU (default): All model layers run on one GPU. Inter-GPU communication overhead is zero, which is optimal for models that fit comfortably in a single GPU's VRAM.

Distributed across multiple GPUs: Model layers are spread across two or more GPUs. This lets you load models that exceed the VRAM of any single card, and can improve throughput on large models by parallelizing computation.


When to Use Distributed Inference

Scenario Recommendation
Model fits in a single GPU Keep distributed inference disabled (default). Single-GPU avoids inter-GPU communication overhead.
Model exceeds single GPU VRAM Enable distributed inference to split the model across multiple GPUs.
Server with multiple GPUs running large models Enable distributed inference for maximum utilization.
GPUs connected via slow bus (e.g., PCIe without NVLink) Test both modes. The communication overhead may negate the parallelism benefit.
Small models (under 8B parameters) Keep disabled. Small models rarely benefit from multi-GPU distribution.

Prerequisites

Requirement Details
LM-Kit.NET Installed via NuGet (LM-Kit.NET package)
.NET .NET 8.0 or later
Multiple GPUs Two or more GPUs detected by the active backend (CUDA, Vulkan, or Metal)

Step 1: Enable Distributed Inference

Set Configuration.FavorDistributedInference to true before loading any model. This is a global setting that affects all subsequent model loads.

using LMKit.Global;
using LMKit.Model;

// Enable distributed inference across all available GPUs
Configuration.FavorDistributedInference = true;

// Load a large model. The SDK will split it across available GPUs.
using LM model = LM.LoadFromModelID("gemma3:27b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nModel loaded: {model.Name}");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");

When FavorDistributedInference is true, the SDK distributes model layers evenly across all detected GPUs using the backend's default splitting strategy.

When FavorDistributedInference is false (the default), all computation runs on a single GPU, which is the one with the most available memory.


Step 2: Verify Multi-GPU Detection

Before enabling distributed inference, verify that multiple GPUs are detected:

using LMKit.Global;
using LMKit.Hardware.Gpu;

Runtime.Initialize();

Console.WriteLine($"Backend:     {Runtime.Backend}");
Console.WriteLine($"GPU support: {Runtime.HasGpuSupport}");
Console.WriteLine($"GPU devices: {GpuDeviceInfo.Devices.Count}");

foreach (var device in GpuDeviceInfo.Devices)
{
    double totalGB = device.TotalMemorySize / (1024.0 * 1024.0 * 1024.0);
    double freeGB  = device.FreeMemorySize  / (1024.0 * 1024.0 * 1024.0);

    Console.WriteLine($"  [{device.DeviceNumber}] {device.DeviceName}");
    Console.WriteLine($"      Total VRAM:  {totalGB:F1} GB");
    Console.WriteLine($"      Free VRAM:   {freeGB:F1} GB");
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("\nOnly one GPU detected. Distributed inference has no effect with a single GPU.");
}

Distributed inference requires at least two GPUs. With a single GPU, the setting has no observable effect.


Step 3: Combine with Device Configuration

You can combine distributed inference with other device configuration options such as GpuLayerCount and MainGpu:

using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;

// Enable distributed inference
Configuration.FavorDistributedInference = true;

// Choose a specific GPU as the main device (handles scratch/small tensors)
var mainDevice = GpuDeviceInfo.GetDeviceFromNumber(0);

using LM model = LM.LoadFromModelID("gemma3:27b",
    deviceConfiguration: new LM.DeviceConfiguration(mainDevice)
    {
        GpuLayerCount = int.MaxValue  // Offload all layers (default)
    },
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nMain GPU: {mainDevice.DeviceName}");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");
  • MainGpu: Designates which GPU handles scratch buffers and small tensors. By default, the GPU with the most free memory is selected.
  • GpuLayerCount: Controls the total number of layers offloaded to GPU(s). With distributed inference, these layers are split across all GPUs rather than placed on a single device.

How It Works Internally

LM-Kit.NET uses a tensor distribution mechanism built on top of the llama.cpp inference engine. Here is what happens when a model is loaded with distributed inference enabled:

  1. Device detection. The SDK enumerates all GPUs available through the active backend (CUDA, Vulkan, or Metal).

  2. Split computation. The model layers are divided across the detected GPUs proportionally. Each GPU receives a subset of the model's transformer layers and the corresponding portion of the KV cache.

  3. Inference execution. During each forward pass, each GPU processes its assigned layers. Intermediate results are transferred between GPUs as needed.

The SDK supports up to 16 GPUs simultaneously.

Layer Splitting vs. Row Splitting

The distribution engine supports two modes internally:

  • Layer splitting (default): Different transformer layers are assigned to different GPUs. GPU 0 might process layers 0 through 15, while GPU 1 processes layers 16 through 31. This is conceptually similar to pipeline parallelism.

  • Row splitting: Individual tensor rows within each layer are divided across GPUs, so every GPU participates in computing every layer. This is conceptually similar to tensor parallelism.

The SDK selects the appropriate mode automatically based on the hardware configuration.


LM-Kit Server Configuration

LM-Kit Server enables distributed inference by default. You can control this through the appsettings.json configuration file:

{
  "Hardware": {
    "EnableCuda": true,
    "EnableDistributedInference": true,
    "MainGpu": -1
  }
}
Setting Type Default Description
EnableDistributedInference bool true Enables distributed inference for all models loaded by the server.
MainGpu int -1 GPU device index for the main GPU. -1 means auto-select (the GPU with the most free memory).

Quick Reference

using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;

// 1. Enable distributed inference (before loading models)
Configuration.FavorDistributedInference = true;

// 2. Initialize the runtime
Runtime.Initialize();

// 3. Verify multiple GPUs are detected
Console.WriteLine($"Backend: {Runtime.Backend}");
Console.WriteLine($"GPUs:    {GpuDeviceInfo.Devices.Count}");

// 4. Load a large model (layers are split across GPUs)
using LM model = LM.LoadFromModelID("gemma3:27b",
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%"); return true; });

Console.WriteLine($"\nGPU layers: {model.GpuLayerCount}");

Troubleshooting

Problem Solution
Model still loads on a single GPU Verify that Configuration.FavorDistributedInference = true is set before any model load. Check that multiple GPUs are detected with GpuDeviceInfo.Devices.Count.
Out-of-memory with distributed inference The combined VRAM across all GPUs may still be insufficient. Lower GpuLayerCount to keep some layers on CPU.
Slower performance with distributed inference Inter-GPU communication adds overhead. For small models that fit on a single GPU, disable distributed inference.
Only one GPU detected Verify that all GPUs have up-to-date drivers and are visible to the active backend. On CUDA, run nvidia-smi to check.

Next Steps

Share