Offload MoE Expert Weights to CPU with Tensor Overrides

Mixture of Experts (MoE) models like GLM 4.7 Flash (30B total parameters, 64 experts) deliver exceptional quality but require significant VRAM to load all expert weights. With tensor overrides, you can keep attention layers and the router on GPU while offloading the large expert FFN weights to CPU. Only 2 of 64 experts activate per token, so the CPU handles a small fraction of the computation while the GPU handles the latency-sensitive parts.

Why This Matters

MoE models store dozens of expert subnetworks, but only a few activate per token. Loading all experts into VRAM wastes expensive GPU memory on weights that are idle most of the time. Tensor overrides solve this by letting you place specific tensors on CPU based on regex pattern matching against tensor names, keeping the GPU focused on the work that benefits most from parallel computation.

Without tensor overrides: GLM 4.7 Flash Q4 requires ~17 GB VRAM (all 64 experts on GPU). With tensor overrides: Expert FFN weights move to CPU, reducing GPU usage to ~3 GB for attention layers, router, and KV cache. The model runs on a GPU with as little as 6 GB VRAM.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	16 GB (experts live in system memory)
VRAM	6+ GB (attention layers, router, KV cache)
GPU backend	CUDA 12/13 or Vulkan enabled

Step 1: Understand Tensor Names in MoE Models

MoE models use a naming convention for their tensors. Expert FFN weights typically follow patterns like:

blk.0.ffn_gate_exps.weight    # Expert gate weights in block 0
blk.0.ffn_up_exps.weight      # Expert up-projection in block 0
blk.0.ffn_down_exps.weight    # Expert down-projection in block 0
blk.1.ffn_gate_exps.weight    # Same pattern in block 1
...

The key suffix is _exps (short for "experts"). Attention layers use names like blk.0.attn_q.weight, blk.0.attn_k.weight, etc. The regex pattern \.ffn_.*_exps\.weight matches all expert FFN weights across all blocks.

Step 2: Offload All Expert Weights to CPU

The simplest configuration offloads every expert FFN weight to CPU while keeping everything else on GPU:

using LMKit.Model;

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,  // all layers on GPU
    TensorOverrides = new List<LM.TensorOverride>
    {
        // Move all expert FFN weights to CPU
        LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
    }
};

using LM model = new LM(
    new Uri("https://huggingface.co/lm-kit/glm-4.7-flash-gguf/resolve/main/GLM-4.7-Flash-64x2.6B-Q4_K_M.gguf"),
    deviceConfiguration: config,
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\nModel loaded on GPU with experts on CPU");

Key points:

GpuLayerCount = int.MaxValue tells LM-Kit.NET to place all layers on GPU by default.
The TensorOverride.Cpu(...) override then moves matching tensors back to CPU.
Overrides are applied in order; the first matching pattern wins for each tensor.

Step 3: Validate Memory Fit Before Loading

Use MemoryEstimation.FitParameters with tensor overrides to verify that the configuration fits your hardware before committing to a full load:

using LMKit.Hardware;
using LMKit.Model;

string modelPath = "path/to/GLM-4.7-Flash-64x2.6B-Q4_K_M.lmk";

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,
    TensorOverrides = new List<LM.TensorOverride>
    {
        LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
    }
};

var fit = MemoryEstimation.FitParameters(
    modelPath,
    contextSize: 0,  // auto-detect maximum context
    deviceConfiguration: config);

if (fit.Success)
{
    Console.WriteLine($"Fits! Context: {fit.ContextSize} tokens, GPU layers: {fit.GpuLayerCount}");
}
else
{
    Console.WriteLine("Does not fit. Try reducing context or offloading more tensors.");
}

Step 4: Selectively Offload Specific Layers

For more control, you can offload experts from specific transformer blocks. This is useful when you want to keep early layers fully on GPU (they process every token first) and offload later layers:

using LMKit.Model;

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,
    TensorOverrides = new List<LM.TensorOverride>
    {
        // Keep experts in blocks 0-9 on GPU (early layers)
        // Offload experts in blocks 10+ to CPU
        LM.TensorOverride.Cpu(@"blk\.([1-9]\d|[1-9][0-9])\.ffn_.*_exps\.weight")
    }
};

using LM model = new LM(modelUri, deviceConfiguration: config);

The first matching override wins for each tensor. If no override matches, the tensor stays on its default device (GPU when GpuLayerCount covers that layer).

Step 5: Combine with Multi-GPU

On systems with multiple GPUs, you can distribute work across devices while offloading experts to CPU:

using LMKit.Model;

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,
    TensorOverrides = new List<LM.TensorOverride>
    {
        // Place attention layers for early blocks on GPU 0
        LM.TensorOverride.Gpu(@"blk\.(0|1|2|3|4)\.attn", gpuIndex: 0),
        // Place attention layers for later blocks on GPU 1
        LM.TensorOverride.Gpu(@"blk\.([5-9]|[1-9]\d)\.attn", gpuIndex: 1),
        // Offload all expert weights to CPU
        LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
    }
};

using LM model = new LM(modelUri, deviceConfiguration: config);

Step 6: Use with Inference

Once loaded, the model behaves identically to a fully GPU-loaded model. LM-Kit.NET handles the CPU/GPU data movement transparently:

using LMKit.TextGeneration;

var chat = new MultiTurnConversation(model);
chat.SystemPrompt = "You are a helpful assistant.";

var response = chat.Submit("Explain quantum computing in simple terms.", CancellationToken.None);
Console.WriteLine(response);

Performance Characteristics

Configuration	VRAM Usage	Prompt Processing	Token Generation	Best For
Full GPU	~17 GB	Fastest	Fastest	GPUs with 24+ GB VRAM
Experts on CPU	~3 GB	Slower (CPU experts)	Moderate	GPUs with 6-16 GB VRAM
Partial layer offload	Varies	Moderate	Moderate	When tensor patterns are unknown
CPU only	0 GB	Slowest	Slowest	No GPU available

Tensor overrides provide a better trade-off than partial layer offloading for MoE models because they keep all attention and routing computation on GPU. Since only 2 of 64 experts activate per token, the CPU workload stays small.

Common Tensor Patterns

Pattern	What It Matches	Use Case
`\.ffn_.*_exps\.weight`	All expert FFN weights	Full expert offloading
`blk\.(0\\|1\\|2)\.ffn_.*_exps`	Experts in specific blocks	Selective offloading
`\.attn`	Attention layers	Move attention to specific GPU
`\.ffn_gate_exps`	Expert gate weights only	Partial expert offloading

Troubleshooting

Problem	Cause	Fix
Model still uses full VRAM	Pattern does not match any tensors	Verify your regex matches the model's tensor naming convention
`InvalidOperationException` on GPU device	GPU index does not exist	Check available GPUs with `GpuDeviceInfo.Devices`
Slow prompt processing	Expert computation on CPU is the bottleneck	Ensure you have enough system RAM; consider offloading fewer layers
Out of memory on CPU	System RAM insufficient for expert weights	Reduce the number of experts offloaded or use a smaller model

Next Steps

Configure GPU Backends and Optimize Performance: set up CUDA, Vulkan, or Metal before using tensor overrides.
Distribute Large Models Across Multiple GPUs: combine tensor overrides with multi-GPU distribution.
Estimating Memory and Context Size: use MemoryEstimation.FitParameters with tensor overrides for pre-flight validation.
What Happens When a Model Does Not Fit in GPU Memory?: overview of all memory management strategies.

Table of Contents