Distribute Large Models Across Multiple GPUs
When a model is too large for a single GPU's VRAM, you can split it across multiple GPUs so that each device handles a portion of the computation. LM-Kit.NET provides automatic multi-GPU distribution through global configuration and per-model device settings. This tutorial shows how to enumerate your GPUs, enable distributed inference, and control which GPU handles the primary workload.
Why Multi-GPU Distribution Matters
Two real-world problems that multi-GPU inference solves:
- Running large models that exceed single-GPU memory. A 27B parameter model at Q4 quantization requires ~16 GB of VRAM. If your workstation has two 12 GB GPUs, splitting the model across both lets you run it without falling back to CPU layers.
- Maximizing throughput on multi-GPU workstations. Data centers and workstations with multiple GPUs can parallelize layer computation, reducing time-to-first-token for large models.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| GPUs | 2+ CUDA or Vulkan GPUs |
| VRAM | Combined VRAM must exceed model size |
| LM-Kit.NET backend | CUDA or Vulkan (CPU-only builds do not support multi-GPU) |
Step 1: Create the Project
dotnet new console -n MultiGpuDemo
cd MultiGpuDemo
dotnet add package LM-Kit.NET
Step 2: Understand the Multi-GPU Architecture
┌─────────────────┐ ┌─────────────────┐
│ GPU 0 │ │ GPU 1 │
│ Layers 0..15 │ │ Layers 16..31 │
│ (scratch GPU) │ │ │
└────────┬────────┘ └────────┬────────┘
│ │
└───────────┬───────────┘
│
┌──────┴──────┐
│ LM Engine │
│ (unified) │
└─────────────┘
| Component | Purpose |
|---|---|
GpuDeviceInfo.Devices |
Lists all available GPUs with memory information |
LM.DeviceConfiguration |
Per-model GPU settings (main GPU, layer count) |
Configuration.FavorDistributedInference |
Global toggle for automatic multi-GPU splitting |
LM.GpuLayerCount |
Read-only count of layers loaded into VRAM |
LM.MainGpu |
Read-only ID of the primary GPU for scratch operations |
Step 3: Write the Program
using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");
foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);
Console.WriteLine($" GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
Console.WriteLine($" Type: {gpu.DeviceType}");
Console.WriteLine($" VRAM: {freeGb:F1} GB free / {totalGb:F1} GB total");
Console.WriteLine();
}
if (GpuDeviceInfo.Devices.Count < 2)
{
Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
return;
}
// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;
Console.WriteLine("Distributed inference: ENABLED\n");
// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
GpuLayerCount = int.MaxValue // offload as many layers as VRAM allows
};
Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");
// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");
using LM model = LM.LoadFromModelID("gemma3:27b",
deviceConfiguration: deviceConfig,
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
},
loadingProgress: progress =>
{
Console.Write($"\r Loading: {progress * 100:F0}% ");
return true;
});
Console.WriteLine($"\n Model loaded successfully.");
Console.WriteLine($" GPU layers: {model.GpuLayerCount}");
Console.WriteLine($" Main GPU: {model.MainGpu}");
Console.WriteLine($" Context: {model.ContextLength} tokens\n");
// ──────────────────────────────────────
// 5. Run inference
// ──────────────────────────────────────
var chat = new SingleTurnConversation(model)
{
SystemPrompt = "You are a helpful assistant. Keep answers concise.",
MaximumCompletionTokens = 256
};
Console.WriteLine("Multi-GPU inference ready. Type a prompt (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("You: ");
Console.ResetColor();
string? input = Console.ReadLine();
if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
chat.AfterTextCompletion += OnTextGenerated;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();
var result = chat.Submit(input);
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
chat.AfterTextCompletion -= OnTextGenerated;
}
static void OnTextGenerated(object? sender, LMKit.TextGeneration.Events.AfterTextCompletionEventArgs e)
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
}
Step 4: Run the Demo
dotnet run
Expected output on a dual-GPU system:
Available GPU devices:
GPU 0: NVIDIA RTX 4090
Type: Cuda
VRAM: 22.1 GB free / 24.0 GB total
GPU 1: NVIDIA RTX 3090
Type: Cuda
VRAM: 22.8 GB free / 24.0 GB total
Distributed inference: ENABLED
Main GPU set to device 0
Loading model (distributed across GPUs)...
Loading: 100%
Model loaded successfully.
GPU layers: 50
Main GPU: 0
Context: 8192 tokens
Choosing Which GPU is the Main GPU
The main GPU handles scratch memory and small tensors. Choose the GPU with the highest bandwidth or the most free memory:
using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");
foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);
Console.WriteLine($" GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
Console.WriteLine($" Type: {gpu.DeviceType}");
Console.WriteLine($" VRAM: {freeGb:F1} GB free / {totalGb:F1} GB total");
Console.WriteLine();
}
if (GpuDeviceInfo.Devices.Count < 2)
{
Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
return;
}
// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;
Console.WriteLine("Distributed inference: ENABLED\n");
// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
GpuLayerCount = int.MaxValue // offload as many layers as VRAM allows
};
Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");
// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");
// Automatically pick the GPU with the most free memory (default behavior)
var config1 = new LM.DeviceConfiguration();
// Or pick a specific GPU by device number
var config2 = new LM.DeviceConfiguration(GpuDeviceInfo.GetDeviceFromNumber(1));
// Or set it directly
var config3 = new LM.DeviceConfiguration { MainGpu = 1 };
Controlling GPU Layer Offloading
GpuLayerCount controls how many model layers are placed on the GPU(s). The rest fall back to CPU:
| Value | Behavior |
|---|---|
int.MaxValue (default) |
Offload all layers that fit in combined VRAM |
0 |
Force CPU-only inference |
20 |
Offload exactly 20 layers to GPU(s) |
// Partial offload: keep some layers on CPU to save VRAM for context
var config = new LM.DeviceConfiguration { GpuLayerCount = 28 };
When FavorDistributedInference is enabled, the offloaded layers are split across all available GPUs automatically.
Performance Considerations
| Factor | Impact |
|---|---|
| GPU interconnect | NVLink gives best multi-GPU performance. PCIe has higher transfer latency |
| Asymmetric GPUs | The smaller GPU may bottleneck. Reduce GpuLayerCount to avoid overloading it |
| Context size | Larger contexts consume more VRAM per GPU. Monitor free memory |
| Small models | Multi-GPU overhead outweighs benefits for models under 7B parameters |
Tip: For models that fit in a single GPU, keep FavorDistributedInference = false (the default) to avoid inter-GPU communication overhead.
Checking Hardware Performance Score
Use DeviceConfiguration.GetPerformanceScore to evaluate whether a model is a good fit for your hardware:
using LMKit.Hardware;
using LMKit.Model;
// Check performance score for a model card (before downloading)
var card = ModelCard.GetPredefinedModelCards().First(c => c.ModelID == "gemma3:27b");
float score = DeviceConfiguration.GetPerformanceScore(card);
Console.WriteLine($"Performance score: {score:F2}");
// 1.0 = model fits entirely in GPU with room to spare
// 0.5 = model fits but VRAM is tight
// 0.0 = model cannot fit in available VRAM
You can also check before downloading:
using System.Text;
using LMKit.Global;
using LMKit.Hardware.Gpu;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// ──────────────────────────────────────
// 1. Enumerate available GPUs
// ──────────────────────────────────────
Console.WriteLine("Available GPU devices:\n");
foreach (GpuDeviceInfo gpu in GpuDeviceInfo.Devices)
{
double totalGb = gpu.TotalMemorySize / (1024.0 * 1024 * 1024);
double freeGb = gpu.FreeMemorySize / (1024.0 * 1024 * 1024);
Console.WriteLine($" GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
Console.WriteLine($" Type: {gpu.DeviceType}");
Console.WriteLine($" VRAM: {freeGb:F1} GB free / {totalGb:F1} GB total");
Console.WriteLine();
}
if (GpuDeviceInfo.Devices.Count < 2)
{
Console.WriteLine("This demo requires at least 2 GPUs. Exiting.");
return;
}
// ──────────────────────────────────────
// 2. Enable distributed inference
// ──────────────────────────────────────
// When enabled, the runtime automatically distributes model layers
// across all available GPUs instead of loading onto a single device.
Configuration.FavorDistributedInference = true;
Console.WriteLine("Distributed inference: ENABLED\n");
// ──────────────────────────────────────
// 3. Configure the primary GPU
// ──────────────────────────────────────
// The main GPU handles scratch buffers and small tensors.
// By default, LM-Kit selects the GPU with the most free memory.
// You can override this if you want a specific device.
var deviceConfig = new LM.DeviceConfiguration
{
MainGpu = GpuDeviceInfo.Devices[0].DeviceNumber,
GpuLayerCount = int.MaxValue // offload as many layers as VRAM allows
};
Console.WriteLine($"Main GPU set to device {deviceConfig.MainGpu}\n");
// ──────────────────────────────────────
// 4. Load a large model with multi-GPU distribution
// ──────────────────────────────────────
Console.WriteLine("Loading model (distributed across GPUs)...");
using LM model = LM.LoadFromModelID("gemma3:27b",
deviceConfiguration: deviceConfig,
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
Console.Write($"\r Downloading: {(double)bytesRead / contentLength.Value * 100:F1}% ");
return true;
},
loadingProgress: progress =>
{
Console.Write($"\r Loading: {progress * 100:F0}% ");
return true;
});
Console.WriteLine("\n");
// ──────────────────────────────────────
// 5. Check performance metrics
// ──────────────────────────────────────
float score = DeviceConfiguration.GetPerformanceScore(ModelCard.GetPredefinedModelCardByModelID("gemma3:27b"));
int optimalCtx = DeviceConfiguration.GetOptimalContextSize();
Console.WriteLine($"Score: {score:F2}, Optimal context: {optimalCtx} tokens");
Common Issues
| Problem | Cause | Fix |
|---|---|---|
| Only one GPU is used | FavorDistributedInference not enabled |
Set Configuration.FavorDistributedInference = true before loading |
| Out of memory on load | Combined VRAM insufficient | Reduce GpuLayerCount to offload fewer layers, or use a smaller quantization |
| Slow token generation | PCIe bottleneck between GPUs | Consider using NVLink or reducing model size to fit one GPU |
GpuDeviceInfo.Devices is empty |
No GPU backend loaded | Ensure you are using the CUDA or Vulkan build of LM-Kit.NET |
Next Steps
- Configure GPU Backends and Optimize Performance: backend selection and GPU tuning.
- Quantize a Model for Edge Deployment: reduce model size to fit in fewer GPUs.
- Load a Model and Generate Your First Response: single-GPU model loading basics.