Table of Contents

Distribute Large Models Across Multiple GPUs

When a model is too large for a single GPU's VRAM, you can split it across multiple GPUs so that each device handles a portion of the computation. LM-Kit.NET provides automatic multi-GPU distribution through global configuration and per-model device settings. This tutorial shows how to enumerate your GPUs, enable distributed inference, and control which GPU handles the primary workload.


Why Multi-GPU Distribution Matters

Two real-world problems that multi-GPU inference solves:

  1. Running large models that exceed single-GPU memory. A 27B parameter model at Q4 quantization requires ~16 GB of VRAM. If your workstation has two 12 GB GPUs, splitting the model across both lets you run it without falling back to CPU layers.
  2. Maximizing throughput on multi-GPU workstations. Data centers and workstations with multiple GPUs can parallelize layer computation, reducing time-to-first-token for large models.

Prerequisites

Requirement Minimum
.NET SDK 8.0+
GPUs 2+ CUDA or Vulkan GPUs
VRAM Combined VRAM must exceed model size
LM-Kit.NET backend CUDA or Vulkan (CPU-only builds do not support multi-GPU)

Step 1: Create the Project

dotnet new console -n MultiGpuDemo
cd MultiGpuDemo
dotnet add package LM-Kit.NET

Step 2: Understand the Multi-GPU Architecture

┌─────────────────┐     ┌─────────────────┐
│    GPU 0        │     │    GPU 1        │
│  Layers 0..15   │     │  Layers 16..31  │
│  (scratch GPU)  │     │                 │
└────────┬────────┘     └────────┬────────┘
         │                       │
         └───────────┬───────────┘
                     │
              ┌──────┴──────┐
              │  LM Engine  │
              │  (unified)  │
              └─────────────┘
Component Purpose
GpuDeviceInfo.Devices Lists all available GPUs with memory information
LM.DeviceConfiguration Per-model GPU settings (main GPU, layer count)
Configuration.FavorDistributedInference Global toggle for automatic multi-GPU splitting
LM.GpuLayerCount Read-only count of layers loaded into VRAM
LM.MainGpu Read-only ID of the primary GPU for scratch operations

Step 3: Write the Program

using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");

foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
    double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
    double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);

    Console.WriteLine($"  GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"    Type:   {gpu.DeviceType}");
    Console.WriteLine($"    VRAM:   {freeGb:F1} GB free / {totalGb:F1} GB total");
    Console.WriteLine();
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
    return;
}

// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;

Console.WriteLine("Distributed inference: ENABLED\n");

// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
    GpuLayerCount = int.MaxValue  // offload as many layers as VRAM allows
};

Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");

// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");

using LM model = LM.LoadFromModelID("gemma3:27b",
    deviceConfiguration: deviceConfig,
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\r  Loading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n  Model loaded successfully.");
Console.WriteLine($"  GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"  Main GPU:   {model.MainGpu}");
Console.WriteLine($"  Context:    {model.ContextLength} tokens\n");

// ──────────────────────────────────────
// 5. Run inference
// ──────────────────────────────────────
var chat = new SingleTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant. Keep answers concise.",
    MaximumCompletionTokens = 256
};

Console.WriteLine("Multi-GPU inference ready. Type a prompt (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    chat.AfterTextCompletion += OnTextGenerated;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    var result = chat.Submit(input);

    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");

    chat.AfterTextCompletion -= OnTextGenerated;
}

static void OnTextGenerated(object? sender, LMKit.TextGeneration.Events.AfterTextCompletionEventArgs e)
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
}

Step 4: Run the Demo

dotnet run

Expected output on a dual-GPU system:

Available GPU devices:

  GPU 0: NVIDIA RTX 4090
    Type:   Cuda
    VRAM:   22.1 GB free / 24.0 GB total

  GPU 1: NVIDIA RTX 3090
    Type:   Cuda
    VRAM:   22.8 GB free / 24.0 GB total

Distributed inference: ENABLED

Main GPU set to device 0

Loading model (distributed across GPUs)...
  Loading: 100%
  Model loaded successfully.
  GPU layers: 50
  Main GPU:   0
  Context:    8192 tokens

Choosing Which GPU is the Main GPU

The main GPU handles scratch memory and small tensors. Choose the GPU with the highest bandwidth or the most free memory:

using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");

foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
    double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
    double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);

    Console.WriteLine($"  GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"    Type:   {gpu.DeviceType}");
    Console.WriteLine($"    VRAM:   {freeGb:F1} GB free / {totalGb:F1} GB total");
    Console.WriteLine();
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
    return;
}

// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;

Console.WriteLine("Distributed inference: ENABLED\n");

// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
    GpuLayerCount = int.MaxValue  // offload as many layers as VRAM allows
};

Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");

// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");

// Automatically pick the GPU with the most free memory (default behavior)
var config1 = new LM.DeviceConfiguration();

// Or pick a specific GPU by device number
var config2 = new LM.DeviceConfiguration(GpuDeviceInfo.GetDeviceFromNumber(1));

// Or set it directly
var config3 = new LM.DeviceConfiguration { MainGpu = 1 };

Controlling GPU Layer Offloading

GpuLayerCount controls how many model layers are placed on the GPU(s). The rest fall back to CPU:

Value Behavior
int.MaxValue (default) Offload all layers that fit in combined VRAM
0 Force CPU-only inference
20 Offload exactly 20 layers to GPU(s)
// Partial offload: keep some layers on CPU to save VRAM for context
var config = new LM.DeviceConfiguration { GpuLayerCount = 28 };

When FavorDistributedInference is enabled, the offloaded layers are split across all available GPUs automatically.


Performance Considerations

Factor Impact
GPU interconnect NVLink gives best multi-GPU performance. PCIe has higher transfer latency
Asymmetric GPUs The smaller GPU may bottleneck. Reduce GpuLayerCount to avoid overloading it
Context size Larger contexts consume more VRAM per GPU. Monitor free memory
Small models Multi-GPU overhead outweighs benefits for models under 7B parameters

Tip: For models that fit in a single GPU, keep FavorDistributedInference = false (the default) to avoid inter-GPU communication overhead.


Checking Hardware Performance Score

Use DeviceConfiguration.GetPerformanceScore to evaluate whether a model is a good fit for your hardware:

using LMKit.Hardware;
using LMKit.Model;

// Check performance score for a model card (before downloading)
var card = ModelCard.GetPredefinedModelCards().First(c => c.ModelID == "gemma3:27b");
float score = DeviceConfiguration.GetPerformanceScore(card);
Console.WriteLine($"Performance score: {score:F2}");
// 1.0 = model fits entirely in GPU with room to spare
// 0.5 = model fits but VRAM is tight
// 0.0 = model cannot fit in available VRAM

You can also check before downloading:

using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");

foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
    double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
    double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);

    Console.WriteLine($"  GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"    Type:   {gpu.DeviceType}");
    Console.WriteLine($"    VRAM:   {freeGb:F1} GB free / {totalGb:F1} GB total");
    Console.WriteLine();
}

if (GpuDeviceInfo.Devices.Count < 2)
{
    Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
    return;
}

// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;

Console.WriteLine("Distributed inference: ENABLED\n");

// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
    GpuLayerCount = int.MaxValue  // offload as many layers as VRAM allows
};

Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");

// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");

using LM model = LM.LoadFromModelID("gemma3:27b",
    deviceConfiguration: deviceConfig,
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
            Console.Write($"\r  Downloading: {(double)bytesRead / contentLength.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\r  Loading: {progress * 100:F0}%   ");
        return true;
    });
Console.WriteLine("\n");

// ──────────────────────────────────────
// 5. Check performance metrics
// ──────────────────────────────────────
float score = DeviceConfiguration.GetPerformanceScore(ModelCard.GetPredefinedModelCardByModelID("gemma3:27b"));
int optimalCtx = DeviceConfiguration.GetOptimalContextSize();
Console.WriteLine($"Score: {score:F2}, Optimal context: {optimalCtx} tokens");

Common Issues

Problem Cause Fix
Only one GPU is used FavorDistributedInference not enabled Set Configuration.FavorDistributedInference = true before loading
Out of memory on load Combined VRAM insufficient Reduce GpuLayerCount to offload fewer layers, or use a smaller quantization
Slow token generation PCIe bottleneck between GPUs Consider using NVLink or reducing model size to fit one GPU
GpuDeviceInfo.Devices is empty No GPU backend loaded Ensure you are using the CUDA or Vulkan build of LM-Kit.NET

Next Steps