Table of Contents

Offload MoE Expert Weights to CPU with Tensor Overrides

Mixture of Experts (MoE) models like GLM 4.7 Flash (30B total parameters, 64 experts) deliver exceptional quality but require significant VRAM to load all expert weights. With tensor overrides, you can keep attention layers and the router on GPU while offloading the large expert FFN weights to CPU. Only 2 of 64 experts activate per token, so the CPU handles a small fraction of the computation while the GPU handles the latency-sensitive parts.


Why This Matters

MoE models store dozens of expert subnetworks, but only a few activate per token. Loading all experts into VRAM wastes expensive GPU memory on weights that are idle most of the time. Tensor overrides solve this by letting you place specific tensors on CPU based on regex pattern matching against tensor names, keeping the GPU focused on the work that benefits most from parallel computation.

Without tensor overrides: GLM 4.7 Flash Q4 requires ~17 GB VRAM (all 64 experts on GPU). With tensor overrides: Expert FFN weights move to CPU, reducing GPU usage to ~3 GB for attention layers, router, and KV cache. The model runs on a GPU with as little as 6 GB VRAM.


Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 16 GB (experts live in system memory)
VRAM 6+ GB (attention layers, router, KV cache)
GPU backend CUDA 12/13 or Vulkan enabled

Step 1: Understand Tensor Names in MoE Models

MoE models use a naming convention for their tensors. Expert FFN weights typically follow patterns like:

blk.0.ffn_gate_exps.weight    # Expert gate weights in block 0
blk.0.ffn_up_exps.weight      # Expert up-projection in block 0
blk.0.ffn_down_exps.weight    # Expert down-projection in block 0
blk.1.ffn_gate_exps.weight    # Same pattern in block 1
...

The key suffix is _exps (short for "experts"). Attention layers use names like blk.0.attn_q.weight, blk.0.attn_k.weight, etc. The regex pattern \.ffn_.*_exps\.weight matches all expert FFN weights across all blocks.


Step 2: Offload All Expert Weights to CPU

The simplest configuration offloads every expert FFN weight to CPU while keeping everything else on GPU:

using LMKit.Model;

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,  // all layers on GPU
    TensorOverrides = new List<LM.TensorOverride>
    {
        // Move all expert FFN weights to CPU
        LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
    }
};

using LM model = new LM(
    new Uri("https://huggingface.co/lm-kit/glm-4.7-flash-gguf/resolve/main/GLM-4.7-Flash-64x2.6B-Q4_K_M.gguf"),
    deviceConfiguration: config,
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\nModel loaded on GPU with experts on CPU");

Key points:

  • GpuLayerCount = int.MaxValue tells LM-Kit.NET to place all layers on GPU by default.
  • The TensorOverride.Cpu(...) override then moves matching tensors back to CPU.
  • Overrides are applied in order; the first matching pattern wins for each tensor.

Step 3: Validate Memory Fit Before Loading

Use MemoryEstimation.FitParameters with tensor overrides to verify that the configuration fits your hardware before committing to a full load:

using LMKit.Hardware;
using LMKit.Model;

string modelPath = "path/to/GLM-4.7-Flash-64x2.6B-Q4_K_M.lmk";

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,
    TensorOverrides = new List<LM.TensorOverride>
    {
        LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
    }
};

var fit = MemoryEstimation.FitParameters(
    modelPath,
    contextSize: 0,  // auto-detect maximum context
    deviceConfiguration: config);

if (fit.Success)
{
    Console.WriteLine($"Fits! Context: {fit.ContextSize} tokens, GPU layers: {fit.GpuLayerCount}");
}
else
{
    Console.WriteLine("Does not fit. Try reducing context or offloading more tensors.");
}

Step 4: Selectively Offload Specific Layers

For more control, you can offload experts from specific transformer blocks. This is useful when you want to keep early layers fully on GPU (they process every token first) and offload later layers:

using LMKit.Model;

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,
    TensorOverrides = new List<LM.TensorOverride>
    {
        // Keep experts in blocks 0-9 on GPU (early layers)
        // Offload experts in blocks 10+ to CPU
        LM.TensorOverride.Cpu(@"blk\.([1-9]\d|[1-9][0-9])\.ffn_.*_exps\.weight")
    }
};

using LM model = new LM(modelUri, deviceConfiguration: config);

The first matching override wins for each tensor. If no override matches, the tensor stays on its default device (GPU when GpuLayerCount covers that layer).


Step 5: Combine with Multi-GPU

On systems with multiple GPUs, you can distribute work across devices while offloading experts to CPU:

using LMKit.Model;

var config = new LM.DeviceConfiguration
{
    GpuLayerCount = int.MaxValue,
    TensorOverrides = new List<LM.TensorOverride>
    {
        // Place attention layers for early blocks on GPU 0
        LM.TensorOverride.Gpu(@"blk\.(0|1|2|3|4)\.attn", gpuIndex: 0),
        // Place attention layers for later blocks on GPU 1
        LM.TensorOverride.Gpu(@"blk\.([5-9]|[1-9]\d)\.attn", gpuIndex: 1),
        // Offload all expert weights to CPU
        LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
    }
};

using LM model = new LM(modelUri, deviceConfiguration: config);

Step 6: Use with Inference

Once loaded, the model behaves identically to a fully GPU-loaded model. LM-Kit.NET handles the CPU/GPU data movement transparently:

using LMKit.TextGeneration;

var chat = new MultiTurnConversation(model);
chat.SystemPrompt = "You are a helpful assistant.";

var response = chat.Submit("Explain quantum computing in simple terms.", CancellationToken.None);
Console.WriteLine(response);

Performance Characteristics

Configuration VRAM Usage Prompt Processing Token Generation Best For
Full GPU ~17 GB Fastest Fastest GPUs with 24+ GB VRAM
Experts on CPU ~3 GB Slower (CPU experts) Moderate GPUs with 6-16 GB VRAM
Partial layer offload Varies Moderate Moderate When tensor patterns are unknown
CPU only 0 GB Slowest Slowest No GPU available

Tensor overrides provide a better trade-off than partial layer offloading for MoE models because they keep all attention and routing computation on GPU. Since only 2 of 64 experts activate per token, the CPU workload stays small.


Common Tensor Patterns

Pattern What It Matches Use Case
\.ffn_.*_exps\.weight All expert FFN weights Full expert offloading
blk\.(0\|1\|2)\.ffn_.*_exps Experts in specific blocks Selective offloading
\.attn Attention layers Move attention to specific GPU
\.ffn_gate_exps Expert gate weights only Partial expert offloading

Troubleshooting

Problem Cause Fix
Model still uses full VRAM Pattern does not match any tensors Verify your regex matches the model's tensor naming convention
InvalidOperationException on GPU device GPU index does not exist Check available GPUs with GpuDeviceInfo.Devices
Slow prompt processing Expert computation on CPU is the bottleneck Ensure you have enough system RAM; consider offloading fewer layers
Out of memory on CPU System RAM insufficient for expert weights Reduce the number of experts offloaded or use a smaller model

Next Steps

Share