👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/local-inference/multi-gpu/moe_expert_cpu_offload

MoE Expert CPU Offload for C# .NET Applications

🎯 Purpose of the Demo

An interactive console app that loads a Mixture-of-Experts model larger than a single GPU's VRAM by splitting tensors across CPU and GPU with regex-based placement rules. Demonstrates LM.DeviceConfiguration.TensorOverrides, the same llama.cpp feature that makes a 30B MoE fit on a 24 GB consumer GPU.

All inference runs on-device.

👥 Industry Target Audience

Teams running 20B+ MoE models on prosumer or workstation hardware.
Multi-GPU hosts wanting explicit per-tensor placement.
Inference platform engineers writing per-customer placement policies.
Latency / throughput investigators benchmarking offload patterns.

🚀 Problem Solved

A 30B MoE has a few hot tensors (attention blocks) and many cold tensors (expert weights). Naively loading on a 24 GB GPU OOMs. With LM.TensorOverride.Cpu() you push the cold experts to system RAM, keep the hot tensors on GPU, and trade a small PCIe penalty for a model that actually fits.

💻 Application Overview

Interactive menu (no command-line arguments) with three modes:

Mode	What it does
Load	Pick an MoE model from the list (gptoss:20b, qwen3.6:35b-a3b, qwen3.5:35b-a3b, gemma4:26b-a4b, glm4.7-flash) or type a custom id. Loads with the CPU-experts + GPU-hot override list.
Chat	Free-form streamed prompt on the loaded model. Reports throughput.
Bench	Run the standard prompt N times and report mean tokens/sec.
Quit	Exit.

At startup the demo enumerates GpuDeviceInfo.Devices and sets Configuration.FavorDistributedInference = (gpuCount > 1) so multi-GPU hosts split automatically.

✨ Key Features

LM.TensorOverride.Cpu(pattern) / Gpu(pattern, gpuIndex): regex-based placement.
Configuration.FavorDistributedInference: split across visible GPUs.
LM.DeviceConfiguration.AutoFitToVram = true: retry-on-OOM behavior.
Reports LayerCount, GpuLayerCount, ParameterCount, TokenGenerationRate.

🧠 Models (all current MoE)

gptoss:20b — GPT-OSS 20B MoE
qwen3.6:35b-a3b — Qwen 3.6 35B-A3B MoE (newest)
qwen3.5:35b-a3b — Qwen 3.5 35B-A3B MoE
gemma4:26b-a4b — Gemma 4 26B-A4B MoE
glm4.7-flash — GLM 4.7 Flash MoE

🛠️ Getting Started

📋 Prerequisites

.NET 8.0 or later
At least one GPU
Enough system RAM to hold the expert tensors (varies by model)
Free disk space for the model file (10-25 GB)

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/local-inference/multi-gpu/moe_expert_cpu_offload
dotnet run

Pick a mode from the menu.

🔧 Troubleshooting

OOM during load: the experts pattern did not match. Check the model's tensor names; for some architectures the regex must be relaxed.
Throughput is low: Configuration.FavorDistributedInference defaults to false; the demo enables it when more than one GPU is visible.

Table of Contents