👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/local-inference/multi-gpu/moe_expert_cpu_offload
MoE Expert CPU Offload for C# .NET Applications
🎯 Purpose of the Demo
An interactive console app that loads a Mixture-of-Experts model larger than a single GPU's VRAM by splitting tensors across CPU and GPU with regex-based placement rules. Demonstrates LM.DeviceConfiguration.TensorOverrides, the same llama.cpp feature that makes a 30B MoE fit on a 24 GB consumer GPU.
All inference runs on-device.
👥 Industry Target Audience
- Teams running 20B+ MoE models on prosumer or workstation hardware.
- Multi-GPU hosts wanting explicit per-tensor placement.
- Inference platform engineers writing per-customer placement policies.
- Latency / throughput investigators benchmarking offload patterns.
🚀 Problem Solved
A 30B MoE has a few hot tensors (attention blocks) and many cold tensors (expert weights). Naively loading on a 24 GB GPU OOMs. With LM.TensorOverride.Cpu() you push the cold experts to system RAM, keep the hot tensors on GPU, and trade a small PCIe penalty for a model that actually fits.
💻 Application Overview
Interactive menu (no command-line arguments) with three modes:
| Mode | What it does |
|---|---|
| Load | Pick an MoE model from the list (gptoss:20b, qwen3.6:35b-a3b, qwen3.5:35b-a3b, gemma4:26b-a4b, glm4.7-flash) or type a custom id. Loads with the CPU-experts + GPU-hot override list. |
| Chat | Free-form streamed prompt on the loaded model. Reports throughput. |
| Bench | Run the standard prompt N times and report mean tokens/sec. |
| Quit | Exit. |
At startup the demo enumerates GpuDeviceInfo.Devices and sets Configuration.FavorDistributedInference = (gpuCount > 1) so multi-GPU hosts split automatically.
✨ Key Features
LM.TensorOverride.Cpu(pattern)/Gpu(pattern, gpuIndex): regex-based placement.Configuration.FavorDistributedInference: split across visible GPUs.LM.DeviceConfiguration.AutoFitToVram = true: retry-on-OOM behavior.- Reports
LayerCount,GpuLayerCount,ParameterCount,TokenGenerationRate.
🧠 Models (all current MoE)
gptoss:20b— GPT-OSS 20B MoEqwen3.6:35b-a3b— Qwen 3.6 35B-A3B MoE (newest)qwen3.5:35b-a3b— Qwen 3.5 35B-A3B MoEgemma4:26b-a4b— Gemma 4 26B-A4B MoEglm4.7-flash— GLM 4.7 Flash MoE
🛠️ Getting Started
📋 Prerequisites
- .NET 8.0 or later
- At least one GPU
- Enough system RAM to hold the expert tensors (varies by model)
- Free disk space for the model file (10-25 GB)
▶️ Running the Application
git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/local-inference/multi-gpu/moe_expert_cpu_offload
dotnet run
Pick a mode from the menu.
🔧 Troubleshooting
- OOM during load: the experts pattern did not match. Check the model's tensor names; for some architectures the regex must be relaxed.
- Throughput is low:
Configuration.FavorDistributedInferencedefaults tofalse; the demo enables it when more than one GPU is visible.