Distribute Large Models Across Multiple GPUs

When a model is too large for a single GPU's VRAM, you can split it across multiple GPUs so that each device handles a portion of the computation. LM-Kit.NET provides automatic multi-GPU distribution through global configuration and per-model device settings. This tutorial shows how to enumerate your GPUs, enable distributed inference, and control which GPU handles the primary workload.

Why Multi-GPU Distribution Matters

Two real-world problems that multi-GPU inference solves:

Running large models that exceed single-GPU memory. A 27B parameter model at Q4 quantization requires ~16 GB of VRAM. If your workstation has two 12 GB GPUs, splitting the model across both lets you run it without falling back to CPU layers.
Maximizing throughput on multi-GPU workstations. Data centers and workstations with multiple GPUs can parallelize layer computation, reducing time-to-first-token for large models.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
GPUs	2+ CUDA or Vulkan GPUs
VRAM	Combined VRAM must exceed model size
LM-Kit.NET backend	CUDA or Vulkan (CPU-only builds do not support multi-GPU)

Step 1: Create the Project

dotnet new console -n MultiGpuDemo
cd MultiGpuDemo
dotnet add package LM-Kit.NET

Step 2: Understand the Multi-GPU Architecture

┌─────────────────┐     ┌─────────────────┐
│    GPU 0        │     │    GPU 1        │
│  Layers 0..15   │     │  Layers 16..31  │
│  (scratch GPU)  │     │                 │
└────────┬────────┘     └────────┬────────┘
         │                       │
         └───────────┬───────────┘
                     │
              ┌──────┴──────┐
              │  LM Engine  │
              │  (unified)  │
              └─────────────┘

Component	Purpose
`GpuDeviceInfo.Devices`	Lists all available GPUs with memory information
`LM.DeviceConfiguration`	Per-model GPU settings (main GPU, layer count)
`Configuration.FavorDistributedInference`	Global toggle for automatic multi-GPU splitting
`LM.GpuLayerCount`	Read-only count of layers loaded into VRAM
`LM.MainGpu`	Read-only ID of the primary GPU for scratch operations

Step 3: Write the Program

using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");

foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
    double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
    double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);

    Console.WriteLine($"  GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"    Type:   {gpu.DeviceType}");
    Console.WriteLine($"    VRAM:   {freeGb:F1} GB free / {totalGb:F1} GB total");
    Console.WriteLine();
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
    return;
}

// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;

Console.WriteLine("Distributed inference: ENABLED\n");

// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
    GpuLayerCount = int.MaxValue  // offload as many layers as VRAM allows
};

Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");

// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");

using LM model = LM.LoadFromModelID("gemma3:27b",
    deviceConfiguration: deviceConfig,
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\r  Loading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Model loaded successfully.");
Console.WriteLine($"  GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"  Main GPU:   {model.MainGpu}");
Console.WriteLine($"  Context:    {model.ContextLength} tokens\n");

// ──────────────────────────────────────
// 5. Run inference
// ──────────────────────────────────────
var chat = new SingleTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant. Keep answers concise.",
    MaximumCompletionTokens = 256
};

Console.WriteLine("Multi-GPU inference ready. Type a prompt (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    chat.AfterTextCompletion += OnTextGenerated;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    var result = chat.Submit(input);

    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");

    chat.AfterTextCompletion -= OnTextGenerated;
}

static void OnTextGenerated(object? sender, LMKit.TextGeneration.Events.AfterTextCompletionEventArgs e)
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
}

Step 4: Run the Demo

dotnet run

Expected output on a dual-GPU system:

Available GPU devices:

  GPU 0: NVIDIA RTX 4090
    Type:   Cuda
    VRAM:   22.1 GB free / 24.0 GB total

  GPU 1: NVIDIA RTX 3090
    Type:   Cuda
    VRAM:   22.8 GB free / 24.0 GB total

Distributed inference: ENABLED

Main GPU set to device 0

Loading model (distributed across GPUs)...
  Loading: 100%
  Model loaded successfully.
  GPU layers: 50
  Main GPU:   0
  Context:    8192 tokens

Choosing Which GPU is the Main GPU

The main GPU handles scratch memory and small tensors. Choose the GPU with the highest bandwidth or the most free memory:

using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");

foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
    double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
    double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);

    Console.WriteLine($"  GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"    Type:   {gpu.DeviceType}");
    Console.WriteLine($"    VRAM:   {freeGb:F1} GB free / {totalGb:F1} GB total");
    Console.WriteLine();
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
    return;
}

// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;

Console.WriteLine("Distributed inference: ENABLED\n");

// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
    GpuLayerCount = int.MaxValue  // offload as many layers as VRAM allows
};

Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");

// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");

// Automatically pick the GPU with the most free memory (default behavior)
var config1 = new LM.DeviceConfiguration();

// Or pick a specific GPU by device number
var config2 = new LM.DeviceConfiguration(GpuDeviceInfo.GetDeviceFromNumber(1));

// Or set it directly
var config3 = new LM.DeviceConfiguration { MainGpu = 1 };

Controlling GPU Layer Offloading

GpuLayerCount controls how many model layers are placed on the GPU(s). The rest fall back to CPU:

Value	Behavior
`int.MaxValue` (default)	Offload all layers that fit in combined VRAM
`0`	Force CPU-only inference
`20`	Offload exactly 20 layers to GPU(s)

// Partial offload: keep some layers on CPU to save VRAM for context
var config = new LM.DeviceConfiguration { GpuLayerCount = 28 };

When FavorDistributedInference is enabled, the offloaded layers are split across all available GPUs automatically.

Performance Considerations

Factor	Impact
GPU interconnect	NVLink gives best multi-GPU performance. PCIe has higher transfer latency
Asymmetric GPUs	The smaller GPU may bottleneck. Reduce `GpuLayerCount` to avoid overloading it
Context size	Larger contexts consume more VRAM per GPU. Monitor free memory
Small models	Multi-GPU overhead outweighs benefits for models under 7B parameters

Tip: For models that fit in a single GPU, keep FavorDistributedInference = false (the default) to avoid inter-GPU communication overhead.

Checking Hardware Performance Score

Use DeviceConfiguration.GetPerformanceScore to evaluate whether a model is a good fit for your hardware:

using LMKit.Hardware;
using LMKit.Model;

// Check performance score for a model card (before downloading)
var card = ModelCard.GetPredefinedModelCards().First(c => c.ModelID == "gemma3:27b");
float score = DeviceConfiguration.GetPerformanceScore(card);
Console.WriteLine($"Performance score: {score:F2}");
// 1.0 = model fits entirely in GPU with room to spare
// 0.5 = model fits but VRAM is tight
// 0.0 = model cannot fit in available VRAM

You can also check before downloading:

using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");

foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
    double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
    double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);

    Console.WriteLine($"  GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"    Type:   {gpu.DeviceType}");
    Console.WriteLine($"    VRAM:   {freeGb:F1} GB free / {totalGb:F1} GB total");
    Console.WriteLine();
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
    return;
}

// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;

Console.WriteLine("Distributed inference: ENABLED\n");

// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
    GpuLayerCount = int.MaxValue  // offload as many layers as VRAM allows
};

Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");

// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");

using LM model = LM.LoadFromModelID("gemma3:27b",
    deviceConfiguration: deviceConfig,
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\r  Loading: {progress * 100:F0}%   ");
        return true;
    });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 5. Check performance metrics
// ──────────────────────────────────────
float score = DeviceConfiguration.GetPerformanceScore(ModelCard.GetPredefinedModelCardByModelID("gemma3:27b"));
int optimalCtx = DeviceConfiguration.GetOptimalContextSize();
Console.WriteLine($"Score: {score:F2}, Optimal context: {optimalCtx} tokens");

Common Issues

Problem	Cause	Fix
Only one GPU is used	`FavorDistributedInference` not enabled	Set `Configuration.FavorDistributedInference = true` before loading
Out of memory on load	Combined VRAM insufficient	Reduce `GpuLayerCount` to offload fewer layers, or use a smaller quantization
Slow token generation	PCIe bottleneck between GPUs	Consider using NVLink or reducing model size to fit one GPU
`GpuDeviceInfo.Devices` is empty	No GPU backend loaded	Ensure you are using the CUDA or Vulkan build of LM-Kit.NET

Next Steps

Configure GPU Backends and Optimize Performance: backend selection and GPU tuning.
Quantize a Model for Edge Deployment: reduce model size to fit in fewer GPUs.
Load a Model and Generate Your First Response: single-GPU model loading basics.

Table of Contents