Estimating Memory and Context Size

Before loading a model in production, you need to answer two questions: does this model fit on my hardware? and what is the largest context size I can use? LM-Kit.NET provides MemoryEstimation.FitParameters() to answer both in a single call, without loading the full model weights.

TL;DR

using LMKit.Hardware;

// Auto-detect the maximum context size and GPU layers that fit
var result = MemoryEstimation.FitParameters("path/to/model.gguf");

if (result.Success)
{
    Console.WriteLine($"Context size: {result.ContextSize} tokens");
    Console.WriteLine($"GPU layers:   {result.GpuLayerCount}");
}
else
{
    Console.WriteLine("Model does not fit on this hardware.");
}

Pass contextSize: 0 (the default) to auto-detect the maximum context size.
Pass a specific value like contextSize: 8192 to check whether that size fits.
The estimation runs without loading the full model, making it fast and safe for pre-flight checks.

Prerequisites

Requirement	Details
LM-Kit.NET	Installed via NuGet (`LM-Kit.NET` package)
.NET	.NET 8.0 or later (or .NET Standard 2.0 compatible)
Completed	Understanding Model Loading and Caching

Why Memory Estimation Matters

When you load a model, memory is consumed by three things:

Model weights: the parameters stored in the GGUF file.
KV cache: grows with context size. A 32K context uses significantly more memory than a 4K context.
Compute buffers: scratch space used by the inference engine for tensor operations.

If the total exceeds your available VRAM (or RAM for CPU-only setups), loading fails or performance degrades severely as layers spill to system memory.

MemoryEstimation.FitParameters() probes all three components using the same native engine that will run inference. It returns the largest context size and GPU layer count that fit your actual available memory, accounting for other processes that may be using the GPU.

How It Differs from GetPerformanceScore

You may have seen DeviceConfiguration.GetPerformanceScore() in the Choosing the Right Model guide. Here is how the two approaches compare:

	`GetPerformanceScore`	`MemoryEstimation.FitParameters`
What it returns	A 0.0 to 1.0 score (rough fit estimate)	Exact context size and GPU layer count
How it works	Heuristic based on file size vs. total VRAM	Native memory probing across all devices
Accounts for KV cache	No	Yes
Accounts for other GPU usage	No	Yes (probes current available memory)
Speed	Instant (no file I/O)	Fast (reads model metadata, does not load weights)
Best for	Quick filtering of the model catalog	Pre-flight validation before loading

Use GetPerformanceScore to narrow down model candidates. Use MemoryEstimation.FitParameters to validate and configure the final choice.

Step 1: Find the Maximum Context Size

Pass contextSize: 0 to let the fitter determine the largest context that fits in your available memory.

using LMKit.Hardware;

var result = MemoryEstimation.FitParameters(
    "path/to/gemma-3-12b-it-Q4_K_M.gguf",
    contextSize: 0);

if (result.Success)
{
    Console.WriteLine($"Max context size: {result.ContextSize} tokens");
    Console.WriteLine($"GPU layers:       {result.GpuLayerCount}");
}
else
{
    Console.WriteLine("This model does not fit on this hardware, even with minimal context.");
}

When contextSize is 0, the fitter starts from the model's native context length and reduces it until it fits, stopping at minimumContextSize (default: 2048). If even the minimum does not fit, Success is false.

Step 2: Check a Specific Context Size

If your application requires a specific context size (for example, 16K tokens for RAG), pass it explicitly.

using LMKit.Hardware;

var result = MemoryEstimation.FitParameters(
    "path/to/gemma-3-12b-it-Q4_K_M.gguf",
    contextSize: 16384);

if (result.Success)
{
    Console.WriteLine($"16K context fits with {result.GpuLayerCount} GPU layers.");
}
else
{
    Console.WriteLine("16K context does not fit. Try a smaller model or reduce context size.");
}

The fitter will attempt to use exactly the requested context size. If it does not fit, it reduces the context down to minimumContextSize. Check result.ContextSize to see what actually fits.

Step 3: Use with a Loaded Model

If you already have a loaded LM instance, pass it directly. The fitter inherits the model's current GPU and device configuration.

using LMKit.Hardware;
using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma4:e4b");

var result = MemoryEstimation.FitParameters(model, contextSize: 0);

if (result.Success)
{
    Console.WriteLine($"Max context for loaded model: {result.ContextSize} tokens");
    Console.WriteLine($"GPU layers: {result.GpuLayerCount}");
}

This overload uses the model's existing MainGpu and GpuLayerCount settings, so the result reflects the same device configuration the model is already using.

Step 4: Test Different Hardware Configurations

You can pass a custom DeviceConfiguration to simulate different hardware scenarios without changing the global settings.

CPU-only estimation

using LMKit.Hardware;
using LMKit.Model;

var cpuOnly = new LM.DeviceConfiguration { GpuLayerCount = 0 };

var result = MemoryEstimation.FitParameters(
    "path/to/model.gguf",
    contextSize: 4096,
    deviceConfiguration: cpuOnly);

if (result.Success)
{
    Console.WriteLine($"CPU-only: context {result.ContextSize}, GPU layers {result.GpuLayerCount}");
}

Specific GPU selection

using LMKit.Hardware;
using LMKit.Model;

var gpu1 = new LM.DeviceConfiguration { MainGpu = 1 };

var result = MemoryEstimation.FitParameters(
    "path/to/model.gguf",
    contextSize: 0,
    deviceConfiguration: gpu1);

Step 5: Set a Minimum Context Floor

The minimumContextSize parameter prevents the fitter from reducing context below a usable threshold. The default is 2048 tokens. Raise it if your application has a hard minimum.

using LMKit.Hardware;

// Require at least 8K context. If 8K doesn't fit, report failure.
var result = MemoryEstimation.FitParameters(
    "path/to/model.gguf",
    contextSize: 0,
    minimumContextSize: 8192);

if (!result.Success)
{
    Console.WriteLine("Cannot fit this model with at least 8K context on this hardware.");
}

Complete Example: Pre-Flight Check Before Loading

This example shows a complete workflow: scan the catalog, estimate memory for the best candidate, and load it with the fitted parameters.

using LMKit.Global;
using LMKit.Hardware;
using LMKit.Model;

Runtime.Initialize();

// Step 1: Pick a model from the catalog
var card = ModelCard.GetPredefinedModelCardByModelID("gemma4:e4b");

// Step 2: Download if needed (without loading)
if (!card.IsLocallyAvailable)
{
    await card.DownloadAsync((path, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    });
    Console.WriteLine();
}

// Step 3: Estimate memory before loading
var fit = MemoryEstimation.FitParameters(card.LocalPath, contextSize: 0);

if (!fit.Success)
{
    Console.WriteLine($"Model '{card.ModelID}' does not fit on this hardware.");
    Console.WriteLine("Consider a smaller model such as gemma4:e4b.");
    return;
}

Console.WriteLine($"Model:       {card.ModelID}");
Console.WriteLine($"Context:     {fit.ContextSize} tokens");
Console.WriteLine($"GPU layers:  {fit.GpuLayerCount}");

// Step 4: Load the model with the fitted parameters
var deviceConfig = new LM.DeviceConfiguration
{
    GpuLayerCount = fit.GpuLayerCount
};

using LM model = new LM(card, deviceConfiguration: deviceConfig,
    loadingProgress: p =>
    {
        Console.Write($"\rLoading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\nLoaded: {model.Name}");
Console.WriteLine($"Actual context: {model.ContextLength}");
Console.WriteLine($"Actual GPU layers: {model.GpuLayerCount}");

Understanding FitResult

Property	Type	Description
`Success`	`bool`	`true` when the model fits on the current hardware with at least the minimum context size.
`ContextSize`	`uint`	The context size (in tokens) that fits. May be smaller than the requested size. 0 when `Success` is `false`.
`GpuLayerCount`	`int`	The number of model layers offloaded to the GPU. May be smaller than the system default if memory is tight. 0 when `Success` is `false`.

API Reference

FitParameters (file path)

public static FitResult FitParameters(
    string modelPath,
    uint contextSize = 0,
    uint minimumContextSize = 2048,
    LM.DeviceConfiguration deviceConfiguration = null)

Parameter	Default	Description
`modelPath`	(required)	Path to a `.gguf` or `.lmk` model file.
`contextSize`	`0`	Desired context size in tokens. `0` = auto-detect maximum.
`minimumContextSize`	`2048`	Floor below which the fitter reports failure.
`deviceConfiguration`	`null`	Custom GPU config. `null` = system default.

Exceptions: ArgumentNullException (null path), FileNotFoundException (missing file), InvalidDataException (unsupported format).

FitParameters (loaded model)

public static FitResult FitParameters(
    LM model,
    uint contextSize = 0,
    uint minimumContextSize = 2048)

Parameter	Default	Description
`model`	(required)	A loaded `LM` instance. Device config is inherited from the model.
`contextSize`	`0`	Desired context size in tokens. `0` = auto-detect maximum.
`minimumContextSize`	`2048`	Floor below which the fitter reports failure.

Exceptions: ArgumentNullException (null model).

Context Size vs. GPU Layers Trade-Off

Context size and GPU layer count compete for the same memory pool. A larger context means more KV cache, which leaves less room for GPU layers (and vice versa). FitParameters finds the best balance automatically, but it helps to understand the trade-off:

Scenario	Context Size	GPU Layers	Use Case
Maximum context	Auto (0)	May be reduced	Long documents, extended conversations
Fixed context	e.g. 8192	Maximized for that context	Known workload with predictable input size
CPU-only	Any	0	No GPU available, all layers on CPU

Tip: If you need both a large context and full GPU offloading, the solution is a smaller model or a GPU with more VRAM.

Troubleshooting

Problem	Solution
`Success` is always `false`	The model is too large for your hardware. Try a smaller model or reduce `minimumContextSize`.
Context size is much smaller than expected	Other applications may be consuming GPU memory. Close GPU-intensive programs and retry.
`GpuLayerCount` is 0	No compatible GPU was detected, or the GPU backend is not enabled. See Configure GPU Backends.
`FileNotFoundException`	The model file is not at the specified path. If using a model ID, download it first with `ModelCard.DownloadAsync()`.
`InvalidDataException`	The file is not a valid GGUF or LMK archive. Verify the file is not corrupted.

Next Steps

Your First AI Agent: build a working agent with the model you just validated.
Configure GPU Backends: set up CUDA, Vulkan, or Metal for maximum performance.
Distributed Inference Across Multiple GPUs: split large models across multiple GPUs when a single card is not enough.
Optimize Memory with Context Recycling and KV-Cache Configuration: fine-tune memory usage after loading.
Choosing the Right Model: go back to model selection with better hardware awareness.

Table of Contents