Offload MoE Expert Weights to CPU with Tensor Overrides
Mixture of Experts (MoE) models like GLM 4.7 Flash (30B total parameters, 64 experts) deliver exceptional quality but require significant VRAM to load all expert weights. With tensor overrides, you can keep attention layers and the router on GPU while offloading the large expert FFN weights to CPU. Only 2 of 64 experts activate per token, so the CPU handles a small fraction of the computation while the GPU handles the latency-sensitive parts.
Why This Matters
MoE models store dozens of expert subnetworks, but only a few activate per token. Loading all experts into VRAM wastes expensive GPU memory on weights that are idle most of the time. Tensor overrides solve this by letting you place specific tensors on CPU based on regex pattern matching against tensor names, keeping the GPU focused on the work that benefits most from parallel computation.
Without tensor overrides: GLM 4.7 Flash Q4 requires ~17 GB VRAM (all 64 experts on GPU). With tensor overrides: Expert FFN weights move to CPU, reducing GPU usage to ~3 GB for attention layers, router, and KV cache. The model runs on a GPU with as little as 6 GB VRAM.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 16 GB (experts live in system memory) |
| VRAM | 6+ GB (attention layers, router, KV cache) |
| GPU backend | CUDA 12/13 or Vulkan enabled |
Step 1: Understand Tensor Names in MoE Models
MoE models use a naming convention for their tensors. Expert FFN weights typically follow patterns like:
blk.0.ffn_gate_exps.weight # Expert gate weights in block 0
blk.0.ffn_up_exps.weight # Expert up-projection in block 0
blk.0.ffn_down_exps.weight # Expert down-projection in block 0
blk.1.ffn_gate_exps.weight # Same pattern in block 1
...
The key suffix is _exps (short for "experts"). Attention layers use names like blk.0.attn_q.weight, blk.0.attn_k.weight, etc. The regex pattern \.ffn_.*_exps\.weight matches all expert FFN weights across all blocks.
Step 2: Offload All Expert Weights to CPU
The simplest configuration offloads every expert FFN weight to CPU while keeping everything else on GPU:
using LMKit.Model;
var config = new LM.DeviceConfiguration
{
GpuLayerCount = int.MaxValue, // all layers on GPU
TensorOverrides = new List<LM.TensorOverride>
{
// Move all expert FFN weights to CPU
LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
}
};
using LM model = new LM(
new Uri("https://huggingface.co/lm-kit/glm-4.7-flash-gguf/resolve/main/GLM-4.7-Flash-64x2.6B-Q4_K_M.gguf"),
deviceConfiguration: config,
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine($"\nModel loaded on GPU with experts on CPU");
Key points:
GpuLayerCount = int.MaxValuetells LM-Kit.NET to place all layers on GPU by default.- The
TensorOverride.Cpu(...)override then moves matching tensors back to CPU. - Overrides are applied in order; the first matching pattern wins for each tensor.
Step 3: Validate Memory Fit Before Loading
Use MemoryEstimation.FitParameters with tensor overrides to verify that the configuration fits your hardware before committing to a full load:
using LMKit.Hardware;
using LMKit.Model;
string modelPath = "path/to/GLM-4.7-Flash-64x2.6B-Q4_K_M.lmk";
var config = new LM.DeviceConfiguration
{
GpuLayerCount = int.MaxValue,
TensorOverrides = new List<LM.TensorOverride>
{
LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
}
};
var fit = MemoryEstimation.FitParameters(
modelPath,
contextSize: 0, // auto-detect maximum context
deviceConfiguration: config);
if (fit.Success)
{
Console.WriteLine($"Fits! Context: {fit.ContextSize} tokens, GPU layers: {fit.GpuLayerCount}");
}
else
{
Console.WriteLine("Does not fit. Try reducing context or offloading more tensors.");
}
Step 4: Selectively Offload Specific Layers
For more control, you can offload experts from specific transformer blocks. This is useful when you want to keep early layers fully on GPU (they process every token first) and offload later layers:
using LMKit.Model;
var config = new LM.DeviceConfiguration
{
GpuLayerCount = int.MaxValue,
TensorOverrides = new List<LM.TensorOverride>
{
// Keep experts in blocks 0-9 on GPU (early layers)
// Offload experts in blocks 10+ to CPU
LM.TensorOverride.Cpu(@"blk\.([1-9]\d|[1-9][0-9])\.ffn_.*_exps\.weight")
}
};
using LM model = new LM(modelUri, deviceConfiguration: config);
The first matching override wins for each tensor. If no override matches, the tensor stays on its default device (GPU when GpuLayerCount covers that layer).
Step 5: Combine with Multi-GPU
On systems with multiple GPUs, you can distribute work across devices while offloading experts to CPU:
using LMKit.Model;
var config = new LM.DeviceConfiguration
{
GpuLayerCount = int.MaxValue,
TensorOverrides = new List<LM.TensorOverride>
{
// Place attention layers for early blocks on GPU 0
LM.TensorOverride.Gpu(@"blk\.(0|1|2|3|4)\.attn", gpuIndex: 0),
// Place attention layers for later blocks on GPU 1
LM.TensorOverride.Gpu(@"blk\.([5-9]|[1-9]\d)\.attn", gpuIndex: 1),
// Offload all expert weights to CPU
LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
}
};
using LM model = new LM(modelUri, deviceConfiguration: config);
Step 6: Use with Inference
Once loaded, the model behaves identically to a fully GPU-loaded model. LM-Kit.NET handles the CPU/GPU data movement transparently:
using LMKit.TextGeneration;
var chat = new MultiTurnConversation(model);
chat.SystemPrompt = "You are a helpful assistant.";
var response = chat.Submit("Explain quantum computing in simple terms.", CancellationToken.None);
Console.WriteLine(response);
Performance Characteristics
| Configuration | VRAM Usage | Prompt Processing | Token Generation | Best For |
|---|---|---|---|---|
| Full GPU | ~17 GB | Fastest | Fastest | GPUs with 24+ GB VRAM |
| Experts on CPU | ~3 GB | Slower (CPU experts) | Moderate | GPUs with 6-16 GB VRAM |
| Partial layer offload | Varies | Moderate | Moderate | When tensor patterns are unknown |
| CPU only | 0 GB | Slowest | Slowest | No GPU available |
Tensor overrides provide a better trade-off than partial layer offloading for MoE models because they keep all attention and routing computation on GPU. Since only 2 of 64 experts activate per token, the CPU workload stays small.
Common Tensor Patterns
| Pattern | What It Matches | Use Case |
|---|---|---|
\.ffn_.*_exps\.weight |
All expert FFN weights | Full expert offloading |
blk\.(0\|1\|2)\.ffn_.*_exps |
Experts in specific blocks | Selective offloading |
\.attn |
Attention layers | Move attention to specific GPU |
\.ffn_gate_exps |
Expert gate weights only | Partial expert offloading |
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| Model still uses full VRAM | Pattern does not match any tensors | Verify your regex matches the model's tensor naming convention |
InvalidOperationException on GPU device |
GPU index does not exist | Check available GPUs with GpuDeviceInfo.Devices |
| Slow prompt processing | Expert computation on CPU is the bottleneck | Ensure you have enough system RAM; consider offloading fewer layers |
| Out of memory on CPU | System RAM insufficient for expert weights | Reduce the number of experts offloaded or use a smaller model |
Next Steps
- Configure GPU Backends and Optimize Performance: set up CUDA, Vulkan, or Metal before using tensor overrides.
- Distribute Large Models Across Multiple GPUs: combine tensor overrides with multi-GPU distribution.
- Estimating Memory and Context Size: use
MemoryEstimation.FitParameterswith tensor overrides for pre-flight validation. - What Happens When a Model Does Not Fit in GPU Memory?: overview of all memory management strategies.