What Happens When a Model Does Not Fit in My GPU Memory?
TL;DR
LM-Kit.NET does not crash. If a model is too large for your GPU, you have several options: partially offload some layers to GPU and keep the rest on CPU, offload MoE expert weights to CPU with tensor overrides, reduce the context size to free memory for model weights, or fall back to CPU-only inference. The MemoryEstimation.FitParameters() API tells you exactly what fits before you attempt to load.
Check Before You Load
The safest approach is to check compatibility before loading the model. The MemoryEstimation API probes your current GPU memory without loading the full model weights:
using LMKit.Model;
var fit = MemoryEstimation.FitParameters(
modelPath: "path/to/gemma3-12b-Q4_K_M.lmk",
contextSize: 0 // 0 = auto-detect maximum context
);
if (fit.Success)
{
Console.WriteLine($"Fits! Context: {fit.ContextSize} tokens, GPU layers: {fit.GpuLayerCount}");
}
else
{
Console.WriteLine("Does not fit with current memory. Try a smaller model or reduce context.");
}
The API accounts for memory already consumed by other applications and the operating system, so it gives you a realistic answer for the current state of your machine.
Option 1: Partial GPU Offloading
If the full model does not fit in VRAM, you can offload only a portion of the model layers to the GPU. The remaining layers stay on CPU. This is slower than full GPU offloading, but significantly faster than CPU-only inference:
using LMKit.Model;
// Let FitParameters tell you how many layers fit
var fit = MemoryEstimation.FitParameters(
modelPath: "path/to/gemma3-12b-Q4_K_M.lmk",
contextSize: 8192
);
// Use the recommended GPU layer count
var loadingOptions = new LMLoadingOptions
{
GpuLayerCount = fit.GpuLayerCount // e.g., 20 out of 40 total layers
};
using LM model = new LM(modelUri, loadingOptions: loadingOptions);
The more layers on GPU, the faster the inference. Even offloading 50% of layers to GPU provides a meaningful speed improvement over pure CPU.
Option 2: Reduce Context Size
Memory usage has two components: model weights and the KV cache (which grows with context size). If you reduce the context window, more memory becomes available for model layers:
using LMKit.Model;
// Check with a large context
var fitLarge = MemoryEstimation.FitParameters(modelPath, contextSize: 16384);
// GPU layers: 15
// Check with a smaller context
var fitSmall = MemoryEstimation.FitParameters(modelPath, contextSize: 4096);
// GPU layers: 30 (more layers fit because less memory used by KV cache)
A context of 4,096 tokens is sufficient for most single-turn tasks, short conversations, and structured extraction. Only increase context when your use case requires processing long documents or extended multi-turn dialogues.
Option 3: CPU-Only Fallback
If no GPU is available or the model does not fit even partially, LM-Kit.NET falls back to CPU inference automatically. You can also force CPU-only mode:
var loadingOptions = new LMLoadingOptions
{
GpuLayerCount = 0 // Force all layers on CPU
};
using LM model = new LM(modelUri, loadingOptions: loadingOptions);
CPU inference is slower, but it works with any amount of system RAM. For models under 3B parameters, CPU performance is often sufficient for interactive use.
Option 4: Offload MoE Expert Weights to CPU (Tensor Overrides)
For Mixture of Experts (MoE) models like GLM 4.7 Flash, tensor overrides offer a smarter alternative to partial layer offloading. MoE models contain dozens of expert subnetworks, but only 2 activate per token. You can offload the idle expert weights to CPU while keeping attention layers and the router on GPU:
using LMKit.Model;
var config = new LM.DeviceConfiguration
{
GpuLayerCount = int.MaxValue,
TensorOverrides = new List<LM.TensorOverride>
{
LM.TensorOverride.Cpu(@"\.ffn_.*_exps\.weight")
}
};
using LM model = new LM(modelUri, deviceConfiguration: config);
This reduces GPU usage from ~17 GB to ~3 GB for GLM 4.7 Flash, making it runnable on GPUs with 6+ GB VRAM. Since only a few experts compute per token, the CPU handles a small fraction of the total work. For a full walkthrough, see Offload MoE Expert Weights to CPU with Tensor Overrides.
Option 5: Use a Smaller Model
If none of the above options meet your performance requirements, switch to a smaller model. The quality difference between adjacent model sizes is often smaller than expected:
| If this does not fit | Try this instead | Quality impact |
|---|---|---|
gemma3:27b (17.1 GB) |
gemma3:12b (7.9 GB) |
Moderate. Still strong for most tasks. |
qwen3.5:27b (18.0 GB) |
qwen3.5:9b (7.0 GB) |
Small. 9B is the most popular production size. |
qwen3.5:9b (7.0 GB) |
qwen3.5:4b (3.5 GB) |
Noticeable on complex reasoning. Good for agents and extraction. |
qwen3.5:4b (3.5 GB) |
qwen3.5:2b (2.0 GB) |
Larger drop. Suitable for simple chat and classification. |
Multi-GPU Distribution
For very large models that exceed a single GPU, LM-Kit.NET supports distributing model layers across multiple GPUs. This lets you run models that would not fit on any single GPU in your system. See Distributed Inference Across Multiple GPUs for setup instructions.
📚 Related Content
- Offload MoE Expert Weights to CPU with Tensor Overrides: Full walkthrough for running large MoE models on limited VRAM.
- How do I choose the right model size for my hardware?: Match model sizes to your available memory with the quick decision table.
- Do I need a GPU to run AI models with LM-Kit.NET?: Understand when CPU is enough and when GPU matters.
- Estimating Memory and Context Size: Full walkthrough of the MemoryEstimation API with advanced usage patterns.
- Distributed Inference Across Multiple GPUs: Split large models across multiple GPUs when one is not enough.
- How fast is local inference compared to cloud APIs?: Understand the performance impact of partial offloading vs full GPU.