Table of Contents

👉 Try the demo: https://github.com/LM-Kit/lm-kit-net-samples/tree/main/console_net/local-inference/multi-gpu/moe_expert_cpu_offload

MoE Expert CPU Offload for C# .NET Applications


🎯 Purpose of the Demo

An interactive console app that loads a Mixture-of-Experts model larger than a single GPU's VRAM by splitting tensors across CPU and GPU with regex-based placement rules. Demonstrates LM.DeviceConfiguration.TensorOverrides, the same llama.cpp feature that makes a 30B MoE fit on a 24 GB consumer GPU.

All inference runs on-device.


👥 Industry Target Audience

  • Teams running 20B+ MoE models on prosumer or workstation hardware.
  • Multi-GPU hosts wanting explicit per-tensor placement.
  • Inference platform engineers writing per-customer placement policies.
  • Latency / throughput investigators benchmarking offload patterns.

🚀 Problem Solved

A 30B MoE has a few hot tensors (attention blocks) and many cold tensors (expert weights). Naively loading on a 24 GB GPU OOMs. With LM.TensorOverride.Cpu() you push the cold experts to system RAM, keep the hot tensors on GPU, and trade a small PCIe penalty for a model that actually fits.


💻 Application Overview

Interactive menu (no command-line arguments) with three modes:

Mode What it does
Load Pick an MoE model from the list (gptoss:20b, qwen3.6:35b-a3b, qwen3.5:35b-a3b, gemma4:26b-a4b, glm4.7-flash) or type a custom id. Loads with the CPU-experts + GPU-hot override list.
Chat Free-form streamed prompt on the loaded model. Reports throughput.
Bench Run the standard prompt N times and report mean tokens/sec.
Quit Exit.

At startup the demo enumerates GpuDeviceInfo.Devices and sets Configuration.FavorDistributedInference = (gpuCount > 1) so multi-GPU hosts split automatically.

✨ Key Features

  • LM.TensorOverride.Cpu(pattern) / Gpu(pattern, gpuIndex): regex-based placement.
  • Configuration.FavorDistributedInference: split across visible GPUs.
  • LM.DeviceConfiguration.AutoFitToVram = true: retry-on-OOM behavior.
  • Reports LayerCount, GpuLayerCount, ParameterCount, TokenGenerationRate.

🧠 Models (all current MoE)

  • gptoss:20b — GPT-OSS 20B MoE
  • qwen3.6:35b-a3b — Qwen 3.6 35B-A3B MoE (newest)
  • qwen3.5:35b-a3b — Qwen 3.5 35B-A3B MoE
  • gemma4:26b-a4b — Gemma 4 26B-A4B MoE
  • glm4.7-flash — GLM 4.7 Flash MoE

🛠️ Getting Started

📋 Prerequisites

  • .NET 8.0 or later
  • At least one GPU
  • Enough system RAM to hold the expert tensors (varies by model)
  • Free disk space for the model file (10-25 GB)

▶️ Running the Application

git clone https://github.com/LM-Kit/lm-kit-net-samples
cd lm-kit-net-samples/console_net/local-inference/multi-gpu/moe_expert_cpu_offload
dotnet run

Pick a mode from the menu.

🔧 Troubleshooting

  • OOM during load: the experts pattern did not match. Check the model's tensor names; for some architectures the regex must be relaxed.
  • Throughput is low: Configuration.FavorDistributedInference defaults to false; the demo enables it when more than one GPU is visible.
Share