Understanding Distributed Inference in LM-Kit.NET

TL;DR

Distributed inference is the technique of splitting a large language model across multiple GPUs so that the combined memory and compute power of all devices is used for a single inference pass. This allows running models that exceed the memory of any single GPU. In LM-Kit.NET, the LM.TensorDistribution class controls how model layers (or rows) are distributed across devices, LM.DeviceConfiguration configures GPU selection and offloading, and the Configuration.FavorDistributedInference global flag enables automatic multi-GPU spreading. Combined with the GpuDeviceInfo and DeviceConfiguration utility classes, LM-Kit.NET provides full control over GPU-aware model loading for production deployments.

What is Distributed Inference?

Definition: Distributed inference refers to running a single model across multiple processing devices (typically GPUs) during the inference phase. Unlike distributed training, which splits data batches across devices to learn weights faster, distributed inference splits the model itself so that each device holds a portion of the model's layers or rows and processes its share of the computation for every token generated.

Why Distributed Inference Matters

Run Larger Models: A 27B-parameter model in Q4 quantization may require 16+ GB of VRAM. If your best GPU has only 12 GB, distributing across two GPUs lets you run the model that would otherwise not fit.
Improve Throughput: Splitting model layers across GPUs can reduce per-token latency when the bottleneck is memory bandwidth rather than computation.
Maximize Hardware Investment: Use all available GPUs rather than leaving secondary cards idle.
Production Scalability: Deploy on multi-GPU workstations or servers with mixed GPU configurations (e.g., one high-end and one mid-range card).

Distribution Strategies

Layer Splitting (Default)

The model's transformer layers are partitioned across GPUs sequentially. GPU 0 runs layers 0 through N, GPU 1 runs layers N+1 through M, and so on. Each token passes through GPUs in sequence, with intermediate activations transferred between devices.

Model: 32 transformer layers
GPU 0 (12 GB): Layers 0-15   ──activations──>  GPU 1 (12 GB): Layers 16-31

Token flow:
  Input --> [GPU 0: layers 0-15] --> transfer --> [GPU 1: layers 16-31] --> Output

Characteristics:

Simple to configure (just set proportions per GPU)
Sequential dependency: GPUs cannot work in parallel on the same token
Inter-GPU transfer overhead is proportional to the hidden dimension size
Best when GPUs are connected via high-bandwidth links (NVLink, PCIe Gen4/Gen5)

Row Splitting

Each transformer layer is split by rows across GPUs, so every GPU participates in every layer. This is closer to tensor parallelism and enables more balanced compute distribution.

Model: 32 transformer layers, each split across 2 GPUs

Every layer:
  GPU 0: rows 0-N/2    ──sync──>  Combined result
  GPU 1: rows N/2-N    ──sync──>

Characteristics:

Higher inter-GPU communication (synchronization at every layer boundary)
More balanced compute distribution across GPUs
Better suited for GPUs with similar capabilities

Practical Application in LM-Kit.NET SDK

LM.DeviceConfiguration

The LM.DeviceConfiguration class is passed to model constructors and LM.LoadFromModelID() to control GPU behavior:

MainGpu: The primary GPU for scratch operations and small tensors. Auto-selected using GpuDeviceInfo.GetBestGpuDeviceId() by default.
GpuLayerCount: How many model layers to offload to GPU. Default: int.MaxValue (all layers on GPU). Set to 0 for CPU-only inference.

LM.TensorDistribution

The LM.TensorDistribution class defines how model layers or rows map to multiple GPUs:

Constructor: TensorDistribution(IEnumerable<float> splits, bool rowMode = false). Each splits[i] is the proportion of the model to place on GPU i.
RowMode: When false (default), the model is split by layers. When true, each layer is split by rows (tensor parallelism).
Length: Number of GPU slots (equals the maximum device count from the backend).

Configuration.FavorDistributedInference

A global static flag (LMKit.Global.Configuration.FavorDistributedInference). When set to true, the default tensor distribution automatically spreads the model across all available GPUs. When false (default), inference runs on a single GPU, avoiding inter-GPU communication overhead.

GpuDeviceInfo

The GpuDeviceInfo class in LMKit.Hardware.Gpu provides runtime GPU discovery:

Devices: All detected GPU devices on the system.
DeviceName / DeviceDescription: Human-readable GPU identification.
TotalMemorySize / FreeMemorySize: Memory capacity and current availability.
DeviceType: The GPU backend type (CUDA, Vulkan, Metal).

DeviceConfiguration (Static Utility)

The DeviceConfiguration static class in LMKit.Hardware provides GPU-aware recommendations:

GetOptimalContextSize(): Recommend the best context window size based on available GPU memory.
GetPerformanceScore(LM model): Score from 0 to 1 indicating how well a model fits the available GPU resources.

Code Example

Automatic Multi-GPU Distribution

using LMKit.Model;
using LMKit.Global;

// Enable automatic multi-GPU distribution
Configuration.FavorDistributedInference = true;

// Load a large model; it will spread across all available GPUs
var model = LM.LoadFromModelID("gemma4:26b-a4b");

Manual Layer-Split Distribution

using LMKit.Model;
using LMKit.Hardware.Gpu;

// Inspect available GPUs
foreach (var gpu in GpuDeviceInfo.Devices)
{
    Console.WriteLine($"GPU {gpu.DeviceNumber}: {gpu.DeviceName}");
    Console.WriteLine($"  VRAM: {gpu.TotalMemorySize / (1024 * 1024 * 1024.0):F1} GB");
    Console.WriteLine($"  Free: {gpu.FreeMemorySize / (1024 * 1024 * 1024.0):F1} GB");
}

// Define custom distribution: 60% on GPU 0, 40% on GPU 1
var distribution = new LM.TensorDistribution(
    new float[] { 0.6f, 0.4f },
    rowMode: false // Layer splitting
);

var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = 0
};

// Load model with explicit distribution
var model = new LM(
    new Uri("https://huggingface.co/lm-kit/gemma-3-27b-it-lmk/resolve/main/gemma-4-26B-A4B-it-Q4_K_M.lmk"),
    deviceConfiguration: deviceConfig
);

Row-Split (Tensor Parallelism) Distribution

using LMKit.Model;

// Split each layer's rows across two equal GPUs
var distribution = new LM.TensorDistribution(
    new float[] { 0.5f, 0.5f },
    rowMode: true // Row splitting (tensor parallelism)
);

var deviceConfig = new LM.DeviceConfiguration
{
    MainGpu = 0
};

var model = new LM(modelUri, deviceConfiguration: deviceConfig);

GPU-Aware Context Size Selection

using LMKit.Model;
using LMKit.Hardware;
using LMKit.TextGeneration;

var model = LM.LoadFromModelID("gemma4:e4b");

// Get optimal context size based on available GPU memory
int contextSize = DeviceConfiguration.GetOptimalContextSize(model);
Console.WriteLine($"Optimal context size: {contextSize} tokens");

// Check how well the model fits the GPU
float score = DeviceConfiguration.GetPerformanceScore(model);
Console.WriteLine($"Performance score: {score:P0}");

// Use the recommended context size
using var chat = new MultiTurnConversation(model)
{
    MaximumContextLength = contextSize
};

When to Use Distributed Inference

Scenario	Recommendation
Model fits in a single GPU	Use single GPU (no distribution overhead)
Model exceeds single GPU by a small margin	Offload some layers to CPU (`GpuLayerCount`)
Model significantly exceeds single GPU	Distribute across multiple GPUs (layer split)
Two GPUs with similar specs and fast interconnect	Consider row splitting for balanced compute
Mixed GPU sizes (e.g., 24 GB + 8 GB)	Layer split with proportional allocation (0.75, 0.25)
CPU-only deployment	Set `GpuLayerCount = 0`

Performance Considerations

Inter-GPU bandwidth: Layer splitting transfers activations between GPUs at each split point. PCIe Gen4 x16 provides ~32 GB/s; NVLink provides ~600 GB/s. High bandwidth reduces the overhead significantly.
Memory overhead: Each GPU needs memory for its portion of the model plus KV-cache and scratch buffers.
Diminishing returns: Adding more GPUs increases communication overhead. Two GPUs is the most common and effective configuration.
CPU fallback: If only one GPU is available and the model does not fit, use GpuLayerCount to run some layers on CPU. This is slower but avoids out-of-memory errors.

Key Terms

Distributed Inference: Running a single model across multiple GPUs during the inference phase.
Layer Splitting: Partitioning the model by transformer layers across GPUs, with sequential token flow.
Row Splitting (Tensor Parallelism): Partitioning each layer by rows across GPUs, with synchronization at every layer boundary.
Tensor Distribution: The mapping (LM.TensorDistribution) that specifies what proportion of the model each GPU holds.
GPU Offloading: Moving some or all model layers from CPU to GPU memory for faster computation.
Main GPU: The primary device handling scratch operations and small tensor allocations.
VRAM (Video RAM): GPU memory that must accommodate model weights, KV-cache, and computation buffers.

LM.DeviceConfiguration: GPU selection and layer offloading configuration
LM.TensorDistribution: Multi-GPU distribution mapping
GpuDeviceInfo: Runtime GPU discovery and memory querying
DeviceConfiguration: GPU-aware context size and performance recommendations
Configuration: Global settings including FavorDistributedInference
LM: Model class with device configuration in constructors

Inference: The token generation process that distributed inference accelerates
Quantization: Reducing model size to fit more layers per GPU
KV-Cache: Per-token memory that grows with context length and consumes GPU VRAM
Context Windows: Larger context windows require more GPU memory
Large Language Model (LLM): The models that benefit most from multi-GPU distribution
Weights: The model parameters distributed across GPUs

External Resources

Efficient Large Language Model Inference with Limited Memory (Sheng et al., 2023): Techniques for fitting large models on constrained hardware
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism (Shoeybi et al., 2019): Foundational work on tensor and pipeline parallelism
Distributed Inference Getting Started: LM-Kit.NET setup guide for multi-GPU inference
Distribute Models Across Multiple GPUs (How-To): Step-by-step multi-GPU configuration

Summary

Distributed inference enables running large language models across multiple GPUs when a single device lacks sufficient memory. In LM-Kit.NET, the LM.TensorDistribution class maps model layers or rows to GPUs with configurable proportions, while LM.DeviceConfiguration controls GPU selection and offloading. The Configuration.FavorDistributedInference flag provides a simple one-line toggle for automatic multi-GPU spreading. For production deployments, the GpuDeviceInfo class discovers available hardware at runtime, and the DeviceConfiguration utility recommends optimal context sizes based on free GPU memory. Layer splitting is the default and most common strategy; row splitting provides an alternative for balanced workloads on similar GPUs.

Table of Contents