Table of Contents

Estimating Memory and Context Size

Before loading a model in production, you need to answer two questions: does this model fit on my hardware? and what is the largest context size I can use? LM-Kit.NET provides MemoryEstimation.FitParameters() to answer both in a single call, without loading the full model weights.


TL;DR

using LMKit.Hardware;

// Auto-detect the maximum context size and GPU layers that fit
var result = MemoryEstimation.FitParameters("path/to/model.gguf");

if (result.Success)
{
    Console.WriteLine($"Context size: {result.ContextSize} tokens");
    Console.WriteLine($"GPU layers:   {result.GpuLayerCount}");
}
else
{
    Console.WriteLine("Model does not fit on this hardware.");
}
  • Pass contextSize: 0 (the default) to auto-detect the maximum context size.
  • Pass a specific value like contextSize: 8192 to check whether that size fits.
  • The estimation runs without loading the full model, making it fast and safe for pre-flight checks.

Prerequisites

Requirement Details
LM-Kit.NET Installed via NuGet (LM-Kit.NET package)
.NET .NET 8.0 or later (or .NET Standard 2.0 compatible)
Completed Understanding Model Loading and Caching

Why Memory Estimation Matters

When you load a model, memory is consumed by three things:

  1. Model weights: the parameters stored in the GGUF file.
  2. KV cache: grows with context size. A 32K context uses significantly more memory than a 4K context.
  3. Compute buffers: scratch space used by the inference engine for tensor operations.

If the total exceeds your available VRAM (or RAM for CPU-only setups), loading fails or performance degrades severely as layers spill to system memory.

MemoryEstimation.FitParameters() probes all three components using the same native engine that will run inference. It returns the largest context size and GPU layer count that fit your actual available memory, accounting for other processes that may be using the GPU.


How It Differs from GetPerformanceScore

You may have seen DeviceConfiguration.GetPerformanceScore() in the Choosing the Right Model guide. Here is how the two approaches compare:

GetPerformanceScore MemoryEstimation.FitParameters
What it returns A 0.0 to 1.0 score (rough fit estimate) Exact context size and GPU layer count
How it works Heuristic based on file size vs. total VRAM Native memory probing across all devices
Accounts for KV cache No Yes
Accounts for other GPU usage No Yes (probes current available memory)
Speed Instant (no file I/O) Fast (reads model metadata, does not load weights)
Best for Quick filtering of the model catalog Pre-flight validation before loading

Use GetPerformanceScore to narrow down model candidates. Use MemoryEstimation.FitParameters to validate and configure the final choice.


Step 1: Find the Maximum Context Size

Pass contextSize: 0 to let the fitter determine the largest context that fits in your available memory.

using LMKit.Hardware;

var result = MemoryEstimation.FitParameters(
    "path/to/gemma-3-12b-it-Q4_K_M.gguf",
    contextSize: 0);

if (result.Success)
{
    Console.WriteLine($"Max context size: {result.ContextSize} tokens");
    Console.WriteLine($"GPU layers:       {result.GpuLayerCount}");
}
else
{
    Console.WriteLine("This model does not fit on this hardware, even with minimal context.");
}

When contextSize is 0, the fitter starts from the model's native context length and reduces it until it fits, stopping at minimumContextSize (default: 2048). If even the minimum does not fit, Success is false.


Step 2: Check a Specific Context Size

If your application requires a specific context size (for example, 16K tokens for RAG), pass it explicitly.

using LMKit.Hardware;

var result = MemoryEstimation.FitParameters(
    "path/to/gemma-3-12b-it-Q4_K_M.gguf",
    contextSize: 16384);

if (result.Success)
{
    Console.WriteLine($"16K context fits with {result.GpuLayerCount} GPU layers.");
}
else
{
    Console.WriteLine("16K context does not fit. Try a smaller model or reduce context size.");
}

The fitter will attempt to use exactly the requested context size. If it does not fit, it reduces the context down to minimumContextSize. Check result.ContextSize to see what actually fits.


Step 3: Use with a Loaded Model

If you already have a loaded LM instance, pass it directly. The fitter inherits the model's current GPU and device configuration.

using LMKit.Hardware;
using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma3:12b");

var result = MemoryEstimation.FitParameters(model, contextSize: 0);

if (result.Success)
{
    Console.WriteLine($"Max context for loaded model: {result.ContextSize} tokens");
    Console.WriteLine($"GPU layers: {result.GpuLayerCount}");
}

This overload uses the model's existing MainGpu and GpuLayerCount settings, so the result reflects the same device configuration the model is already using.


Step 4: Test Different Hardware Configurations

You can pass a custom DeviceConfiguration to simulate different hardware scenarios without changing the global settings.

CPU-only estimation

using LMKit.Hardware;
using LMKit.Model;

var cpuOnly = new LM.DeviceConfiguration { GpuLayerCount = 0 };

var result = MemoryEstimation.FitParameters(
    "path/to/model.gguf",
    contextSize: 4096,
    deviceConfiguration: cpuOnly);

if (result.Success)
{
    Console.WriteLine($"CPU-only: context {result.ContextSize}, GPU layers {result.GpuLayerCount}");
}

Specific GPU selection

using LMKit.Hardware;
using LMKit.Model;

var gpu1 = new LM.DeviceConfiguration { MainGpu = 1 };

var result = MemoryEstimation.FitParameters(
    "path/to/model.gguf",
    contextSize: 0,
    deviceConfiguration: gpu1);

Step 5: Set a Minimum Context Floor

The minimumContextSize parameter prevents the fitter from reducing context below a usable threshold. The default is 2048 tokens. Raise it if your application has a hard minimum.

using LMKit.Hardware;

// Require at least 8K context. If 8K doesn't fit, report failure.
var result = MemoryEstimation.FitParameters(
    "path/to/model.gguf",
    contextSize: 0,
    minimumContextSize: 8192);

if (!result.Success)
{
    Console.WriteLine("Cannot fit this model with at least 8K context on this hardware.");
}

Complete Example: Pre-Flight Check Before Loading

This example shows a complete workflow: scan the catalog, estimate memory for the best candidate, and load it with the fitted parameters.

using LMKit.Global;
using LMKit.Hardware;
using LMKit.Model;

Runtime.Initialize();

// Step 1: Pick a model from the catalog
var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:12b");

// Step 2: Download if needed (without loading)
if (!card.IsLocallyAvailable)
{
    await card.DownloadAsync((path, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    });
    Console.WriteLine();
}

// Step 3: Estimate memory before loading
var fit = MemoryEstimation.FitParameters(card.LocalPath, contextSize: 0);

if (!fit.Success)
{
    Console.WriteLine($"Model '{card.ModelID}' does not fit on this hardware.");
    Console.WriteLine("Consider a smaller model such as gemma3:4b.");
    return;
}

Console.WriteLine($"Model:       {card.ModelID}");
Console.WriteLine($"Context:     {fit.ContextSize} tokens");
Console.WriteLine($"GPU layers:  {fit.GpuLayerCount}");

// Step 4: Load the model with the fitted parameters
var deviceConfig = new LM.DeviceConfiguration
{
    GpuLayerCount = fit.GpuLayerCount
};

using LM model = new LM(card, deviceConfiguration: deviceConfig,
    loadingProgress: p =>
    {
        Console.Write($"\rLoading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\nLoaded: {model.Name}");
Console.WriteLine($"Actual context: {model.ContextLength}");
Console.WriteLine($"Actual GPU layers: {model.GpuLayerCount}");

Understanding FitResult

Property Type Description
Success bool true when the model fits on the current hardware with at least the minimum context size.
ContextSize uint The context size (in tokens) that fits. May be smaller than the requested size. 0 when Success is false.
GpuLayerCount int The number of model layers offloaded to the GPU. May be smaller than the system default if memory is tight. 0 when Success is false.

API Reference

FitParameters (file path)

public static FitResult FitParameters(
    string modelPath,
    uint contextSize = 0,
    uint minimumContextSize = 2048,
    LM.DeviceConfiguration deviceConfiguration = null)
Parameter Default Description
modelPath (required) Path to a .gguf or .lmk model file.
contextSize 0 Desired context size in tokens. 0 = auto-detect maximum.
minimumContextSize 2048 Floor below which the fitter reports failure.
deviceConfiguration null Custom GPU config. null = system default.

Exceptions: ArgumentNullException (null path), FileNotFoundException (missing file), InvalidDataException (unsupported format).

FitParameters (loaded model)

public static FitResult FitParameters(
    LM model,
    uint contextSize = 0,
    uint minimumContextSize = 2048)
Parameter Default Description
model (required) A loaded LM instance. Device config is inherited from the model.
contextSize 0 Desired context size in tokens. 0 = auto-detect maximum.
minimumContextSize 2048 Floor below which the fitter reports failure.

Exceptions: ArgumentNullException (null model).


Context Size vs. GPU Layers Trade-Off

Context size and GPU layer count compete for the same memory pool. A larger context means more KV cache, which leaves less room for GPU layers (and vice versa). FitParameters finds the best balance automatically, but it helps to understand the trade-off:

Scenario Context Size GPU Layers Use Case
Maximum context Auto (0) May be reduced Long documents, extended conversations
Fixed context e.g. 8192 Maximized for that context Known workload with predictable input size
CPU-only Any 0 No GPU available, all layers on CPU

Tip: If you need both a large context and full GPU offloading, the solution is a smaller model or a GPU with more VRAM.


Troubleshooting

Problem Solution
Success is always false The model is too large for your hardware. Try a smaller model or reduce minimumContextSize.
Context size is much smaller than expected Other applications may be consuming GPU memory. Close GPU-intensive programs and retry.
GpuLayerCount is 0 No compatible GPU was detected, or the GPU backend is not enabled. See Configure GPU Backends.
FileNotFoundException The model file is not at the specified path. If using a model ID, download it first with ModelCard.DownloadAsync().
InvalidDataException The file is not a valid GGUF or LMK archive. Verify the file is not corrupted.

Next Steps

Share