Estimating Memory and Context Size
Before loading a model in production, you need to answer two questions: does this model fit on my hardware? and what is the largest context size I can use? LM-Kit.NET provides MemoryEstimation.FitParameters() to answer both in a single call, without loading the full model weights.
TL;DR
using LMKit.Hardware;
// Auto-detect the maximum context size and GPU layers that fit
var result = MemoryEstimation.FitParameters("path/to/model.gguf");
if (result.Success)
{
Console.WriteLine($"Context size: {result.ContextSize} tokens");
Console.WriteLine($"GPU layers: {result.GpuLayerCount}");
}
else
{
Console.WriteLine("Model does not fit on this hardware.");
}
- Pass
contextSize: 0(the default) to auto-detect the maximum context size. - Pass a specific value like
contextSize: 8192to check whether that size fits. - The estimation runs without loading the full model, making it fast and safe for pre-flight checks.
Prerequisites
| Requirement | Details |
|---|---|
| LM-Kit.NET | Installed via NuGet (LM-Kit.NET package) |
| .NET | .NET 8.0 or later (or .NET Standard 2.0 compatible) |
| Completed | Understanding Model Loading and Caching |
Why Memory Estimation Matters
When you load a model, memory is consumed by three things:
- Model weights: the parameters stored in the GGUF file.
- KV cache: grows with context size. A 32K context uses significantly more memory than a 4K context.
- Compute buffers: scratch space used by the inference engine for tensor operations.
If the total exceeds your available VRAM (or RAM for CPU-only setups), loading fails or performance degrades severely as layers spill to system memory.
MemoryEstimation.FitParameters() probes all three components using the same native engine that will run inference. It returns the largest context size and GPU layer count that fit your actual available memory, accounting for other processes that may be using the GPU.
How It Differs from GetPerformanceScore
You may have seen DeviceConfiguration.GetPerformanceScore() in the Choosing the Right Model guide. Here is how the two approaches compare:
GetPerformanceScore |
MemoryEstimation.FitParameters |
|
|---|---|---|
| What it returns | A 0.0 to 1.0 score (rough fit estimate) | Exact context size and GPU layer count |
| How it works | Heuristic based on file size vs. total VRAM | Native memory probing across all devices |
| Accounts for KV cache | No | Yes |
| Accounts for other GPU usage | No | Yes (probes current available memory) |
| Speed | Instant (no file I/O) | Fast (reads model metadata, does not load weights) |
| Best for | Quick filtering of the model catalog | Pre-flight validation before loading |
Use GetPerformanceScore to narrow down model candidates. Use MemoryEstimation.FitParameters to validate and configure the final choice.
Step 1: Find the Maximum Context Size
Pass contextSize: 0 to let the fitter determine the largest context that fits in your available memory.
using LMKit.Hardware;
var result = MemoryEstimation.FitParameters(
"path/to/gemma-3-12b-it-Q4_K_M.gguf",
contextSize: 0);
if (result.Success)
{
Console.WriteLine($"Max context size: {result.ContextSize} tokens");
Console.WriteLine($"GPU layers: {result.GpuLayerCount}");
}
else
{
Console.WriteLine("This model does not fit on this hardware, even with minimal context.");
}
When contextSize is 0, the fitter starts from the model's native context length and reduces it until it fits, stopping at minimumContextSize (default: 2048). If even the minimum does not fit, Success is false.
Step 2: Check a Specific Context Size
If your application requires a specific context size (for example, 16K tokens for RAG), pass it explicitly.
using LMKit.Hardware;
var result = MemoryEstimation.FitParameters(
"path/to/gemma-3-12b-it-Q4_K_M.gguf",
contextSize: 16384);
if (result.Success)
{
Console.WriteLine($"16K context fits with {result.GpuLayerCount} GPU layers.");
}
else
{
Console.WriteLine("16K context does not fit. Try a smaller model or reduce context size.");
}
The fitter will attempt to use exactly the requested context size. If it does not fit, it reduces the context down to minimumContextSize. Check result.ContextSize to see what actually fits.
Step 3: Use with a Loaded Model
If you already have a loaded LM instance, pass it directly. The fitter inherits the model's current GPU and device configuration.
using LMKit.Hardware;
using LMKit.Model;
using LM model = LM.LoadFromModelID("gemma3:12b");
var result = MemoryEstimation.FitParameters(model, contextSize: 0);
if (result.Success)
{
Console.WriteLine($"Max context for loaded model: {result.ContextSize} tokens");
Console.WriteLine($"GPU layers: {result.GpuLayerCount}");
}
This overload uses the model's existing MainGpu and GpuLayerCount settings, so the result reflects the same device configuration the model is already using.
Step 4: Test Different Hardware Configurations
You can pass a custom DeviceConfiguration to simulate different hardware scenarios without changing the global settings.
CPU-only estimation
using LMKit.Hardware;
using LMKit.Model;
var cpuOnly = new LM.DeviceConfiguration { GpuLayerCount = 0 };
var result = MemoryEstimation.FitParameters(
"path/to/model.gguf",
contextSize: 4096,
deviceConfiguration: cpuOnly);
if (result.Success)
{
Console.WriteLine($"CPU-only: context {result.ContextSize}, GPU layers {result.GpuLayerCount}");
}
Specific GPU selection
using LMKit.Hardware;
using LMKit.Model;
var gpu1 = new LM.DeviceConfiguration { MainGpu = 1 };
var result = MemoryEstimation.FitParameters(
"path/to/model.gguf",
contextSize: 0,
deviceConfiguration: gpu1);
Step 5: Set a Minimum Context Floor
The minimumContextSize parameter prevents the fitter from reducing context below a usable threshold. The default is 2048 tokens. Raise it if your application has a hard minimum.
using LMKit.Hardware;
// Require at least 8K context. If 8K doesn't fit, report failure.
var result = MemoryEstimation.FitParameters(
"path/to/model.gguf",
contextSize: 0,
minimumContextSize: 8192);
if (!result.Success)
{
Console.WriteLine("Cannot fit this model with at least 8K context on this hardware.");
}
Complete Example: Pre-Flight Check Before Loading
This example shows a complete workflow: scan the catalog, estimate memory for the best candidate, and load it with the fitted parameters.
using LMKit.Global;
using LMKit.Hardware;
using LMKit.Model;
Runtime.Initialize();
// Step 1: Pick a model from the catalog
var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:12b");
// Step 2: Download if needed (without loading)
if (!card.IsLocallyAvailable)
{
await card.DownloadAsync((path, len, read) =>
{
if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}% ");
return true;
});
Console.WriteLine();
}
// Step 3: Estimate memory before loading
var fit = MemoryEstimation.FitParameters(card.LocalPath, contextSize: 0);
if (!fit.Success)
{
Console.WriteLine($"Model '{card.ModelID}' does not fit on this hardware.");
Console.WriteLine("Consider a smaller model such as gemma3:4b.");
return;
}
Console.WriteLine($"Model: {card.ModelID}");
Console.WriteLine($"Context: {fit.ContextSize} tokens");
Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
// Step 4: Load the model with the fitted parameters
var deviceConfig = new LM.DeviceConfiguration
{
GpuLayerCount = fit.GpuLayerCount
};
using LM model = new LM(card, deviceConfiguration: deviceConfig,
loadingProgress: p =>
{
Console.Write($"\rLoading: {p * 100:F0}% ");
return true;
});
Console.WriteLine($"\nLoaded: {model.Name}");
Console.WriteLine($"Actual context: {model.ContextLength}");
Console.WriteLine($"Actual GPU layers: {model.GpuLayerCount}");
Understanding FitResult
| Property | Type | Description |
|---|---|---|
Success |
bool |
true when the model fits on the current hardware with at least the minimum context size. |
ContextSize |
uint |
The context size (in tokens) that fits. May be smaller than the requested size. 0 when Success is false. |
GpuLayerCount |
int |
The number of model layers offloaded to the GPU. May be smaller than the system default if memory is tight. 0 when Success is false. |
API Reference
FitParameters (file path)
public static FitResult FitParameters(
string modelPath,
uint contextSize = 0,
uint minimumContextSize = 2048,
LM.DeviceConfiguration deviceConfiguration = null)
| Parameter | Default | Description |
|---|---|---|
modelPath |
(required) | Path to a .gguf or .lmk model file. |
contextSize |
0 |
Desired context size in tokens. 0 = auto-detect maximum. |
minimumContextSize |
2048 |
Floor below which the fitter reports failure. |
deviceConfiguration |
null |
Custom GPU config. null = system default. |
Exceptions: ArgumentNullException (null path), FileNotFoundException (missing file), InvalidDataException (unsupported format).
FitParameters (loaded model)
public static FitResult FitParameters(
LM model,
uint contextSize = 0,
uint minimumContextSize = 2048)
| Parameter | Default | Description |
|---|---|---|
model |
(required) | A loaded LM instance. Device config is inherited from the model. |
contextSize |
0 |
Desired context size in tokens. 0 = auto-detect maximum. |
minimumContextSize |
2048 |
Floor below which the fitter reports failure. |
Exceptions: ArgumentNullException (null model).
Context Size vs. GPU Layers Trade-Off
Context size and GPU layer count compete for the same memory pool. A larger context means more KV cache, which leaves less room for GPU layers (and vice versa). FitParameters finds the best balance automatically, but it helps to understand the trade-off:
| Scenario | Context Size | GPU Layers | Use Case |
|---|---|---|---|
| Maximum context | Auto (0) | May be reduced | Long documents, extended conversations |
| Fixed context | e.g. 8192 | Maximized for that context | Known workload with predictable input size |
| CPU-only | Any | 0 | No GPU available, all layers on CPU |
Tip: If you need both a large context and full GPU offloading, the solution is a smaller model or a GPU with more VRAM.
Troubleshooting
| Problem | Solution |
|---|---|
Success is always false |
The model is too large for your hardware. Try a smaller model or reduce minimumContextSize. |
| Context size is much smaller than expected | Other applications may be consuming GPU memory. Close GPU-intensive programs and retry. |
GpuLayerCount is 0 |
No compatible GPU was detected, or the GPU backend is not enabled. See Configure GPU Backends. |
FileNotFoundException |
The model file is not at the specified path. If using a model ID, download it first with ModelCard.DownloadAsync(). |
InvalidDataException |
The file is not a valid GGUF or LMK archive. Verify the file is not corrupted. |
Next Steps
- Your First AI Agent: build a working agent with the model you just validated.
- Configure GPU Backends: set up CUDA, Vulkan, or Metal for maximum performance.
- Distributed Inference Across Multiple GPUs: split large models across multiple GPUs when a single card is not enough.
- Optimize Memory with Context Recycling and KV-Cache Configuration: fine-tune memory usage after loading.
- Choosing the Right Model: go back to model selection with better hardware awareness.