Table of Contents

Understanding Model Loading and Caching

Every LM-Kit.NET feature starts with a loaded model. This guide explains the three ways to load a model, how automatic downloading and caching work, which properties are available after loading, and how to tune memory behavior for your deployment scenario.


TL;DR

using LMKit.Model;

// Option 1: Load by model ID (auto-downloads and caches)
using LM model = LM.LoadFromModelID("gemma3:4b");

// Option 2: Load from a HuggingFace URI
using LM model = new LM(new Uri("https://huggingface.co/lm-kit/gemma-3-4b-instruct-lmk/resolve/main/gemma-3-4b-it-Q4_K_M.lmk"));

// Option 3: Load from a local file
using LM model = new LM(@"C:\models\gemma-3-4b-it-Q4_K_M.lmk");
  • Models are cached automatically after the first download. Subsequent loads are instant.
  • Use ModelCard.GetPredefinedModelCards() to browse the full catalog in code.
  • Both download and loading accept progress callbacks that return false to cancel.

Prerequisites

Requirement Details
LM-Kit.NET Installed via NuGet (LM-Kit.NET package)
.NET .NET 8.0 or later (or .NET Standard 2.0 compatible)
Disk space Enough free space for the model file (typically 2 to 18 GB depending on the model)
Internet Required only for the first download when using a model ID or remote URI

Three Ways to Load a Model

LM-Kit.NET provides three approaches for loading a model. Each returns an LM instance that you pass to inference classes such as MultiTurnConversation, Agent, or TextExtraction.

1. Load by Model ID

The simplest option. Pass a short identifier such as "gemma3:4b" and LM-Kit.NET resolves it to the correct HuggingFace URI, downloads the file if needed, and loads it.

using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
        {
            double pct = (double)bytesRead / contentLength.Value * 100;
            Console.Write($"\rDownloading: {pct:F1}%   ");
        }
        return true; // return false to cancel
    },
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading: {progress * 100:F0}%   ");
        return true; // return false to cancel
    });

Console.WriteLine($"\nModel loaded: {model.Name}");

Common model IDs include: gemma3:1b, gemma3:4b, gemma3:12b, qwen3:4b, qwen3:8b, phi4-mini:3.8b, phi4:14.7b, and llama3.1:8b.

2. Load from a URI

Use this approach when you need a specific model file from HuggingFace or another host. LM-Kit.NET downloads and caches the file automatically.

using LMKit.Model;

var uri = new Uri("https://huggingface.co/lm-kit/gemma-3-4b-instruct-lmk/resolve/main/gemma-3-4b-it-Q4_K_M.lmk");

using LM model = new LM(uri,
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
        {
            double pct = (double)bytesRead / contentLength.Value * 100;
            Console.Write($"\rDownloading: {pct:F1}%   ");
        }
        return true;
    },
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\nModel loaded: {model.Name}");

3. Load from a Local File Path

If you have already downloaded a GGUF model file, pass its path directly. No network access is required.

using LMKit.Model;

using LM model = new LM(@"C:\models\gemma-3-4b-it-Q4_K_M.lmk",
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"Model loaded: {model.Name}");

Progress Callbacks

Both download and load operations accept optional progress callbacks. They follow the same pattern: return true to continue, or false to cancel the operation.

Callback Signature When It Fires
downloadingProgress bool (string path, long? contentLength, long bytesRead) Periodically during file download
loadingProgress bool (float progress) Periodically while the model tensors are loaded into memory (0.0 to 1.0)

The contentLength parameter in the downloading callback may be null if the server does not provide a Content-Length header.


Download and Caching Behavior

When you load a model by ID or URI, LM-Kit.NET checks the local cache first. If the file is not cached, it downloads the model and stores it for future use.

  • Cache location: The default cache directory is managed by LM-Kit.NET internally. You can override it by passing a storagePath parameter.
  • Subsequent loads: If the file already exists in the cache, no download occurs and loading starts immediately.
  • Custom storage path: Provide a storagePath argument to control where the downloaded file is saved.
using LMKit.Model;

// Store the model in a custom directory
using LM model = LM.LoadFromModelID("qwen3:4b",
    storagePath: @"D:\my-models");

Model Properties After Loading

Once a model is loaded, you can inspect its capabilities and configuration through properties on the LM instance.

using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma3:12b");

Console.WriteLine($"Name:            {model.Name}");
Console.WriteLine($"Context length:  {model.ContextLength}");
Console.WriteLine($"GPU layers:      {model.GpuLayerCount}");
Console.WriteLine($"Has text gen:    {model.HasTextGeneration}");
Console.WriteLine($"Has vision:      {model.HasVision}");
Console.WriteLine($"Has tool calls:  {model.HasToolCalls}");
Property Type Description
Name string The name embedded in the model metadata.
ContextLength int Maximum number of tokens the model can process in a single inference.
GpuLayerCount int Number of layers currently offloaded to the GPU.
HasTextGeneration bool Whether the model supports text generation tasks.
HasVision bool Whether the model supports image input (vision language model).
HasToolCalls bool Whether the model supports native tool/function calling.

The Model Catalog

LM-Kit.NET ships with a curated catalog of validated models accessible through ModelCard.GetPredefinedModelCards(). Each ModelCard contains the download URI, file size, quantization precision, and capability flags.

using LMKit.Model;

var catalog = ModelCard.GetPredefinedModelCards();

Console.WriteLine($"Available models: {catalog.Count}\n");

foreach (var card in catalog)
{
    Console.WriteLine($"  {card.ModelID,-25} Size: {card.FileSize / (1024.0 * 1024.0 * 1024.0):F1} GB");
}

You can load a model directly from a ModelCard:

using LMKit.Model;

var catalog = ModelCard.GetPredefinedModelCards();
var card = catalog.First(c => c.ModelID == "gemma3:4b");

using LM model = new LM(card);

Evaluating Hardware Compatibility

Before loading a large model, check whether your hardware can run it efficiently. The DeviceConfiguration.GetPerformanceScore() method returns a value between 0 and 1 based on your GPU memory relative to the model size.

using LMKit.Hardware;
using LMKit.Model;

var catalog = ModelCard.GetPredefinedModelCards();

foreach (var card in catalog)
{
    float score = DeviceConfiguration.GetPerformanceScore(card);
    string rating = score >= 0.9f ? "Excellent"
                  : score >= 0.5f ? "Acceptable"
                  : "May be slow";

    Console.WriteLine($"  {card.ModelID,-25} Score: {score:F2}  ({rating})");
}
Score Range Meaning
0.9 to 1.0 The model fits comfortably in VRAM. Expect fast inference.
0.5 to 0.9 The model can run but some layers may spill to system memory.
Below 0.5 Consider a smaller model or a more powerful GPU.

Memory and Cache Configuration

LM-Kit.NET provides global settings that control how models and inference state are cached in memory.

using LMKit.Global;

// Enable model caching so reloading the same model is instant (default: true)
Configuration.EnableModelCache = true;

// Enable KV cache recycling to reuse attention caches across requests (default: true)
Configuration.EnableKVCacheRecycling = true;
Setting Default Description
Configuration.EnableModelCache true Keeps model weights in memory after the LM instance is disposed, speeding up subsequent loads of the same model.
Configuration.EnableKVCacheRecycling true Reuses key-value attention caches across inference calls, reducing memory allocations and improving throughput.

Tip: For server or batch scenarios where the same model handles many requests, keep both settings enabled. For memory-constrained environments running different models sequentially, consider disabling EnableModelCache to free memory sooner.


Putting It All Together

The following example demonstrates a complete model selection and loading workflow:

using LMKit.Global;
using LMKit.Hardware;
using LMKit.Model;

// Configure runtime
Runtime.EnableCuda = true;
Configuration.EnableKVCacheRecycling = true;
Runtime.Initialize();

// Find the best model for this hardware
var catalog = ModelCard.GetPredefinedModelCards();

ModelCard bestCard = null;
float bestScore = 0;

foreach (var card in catalog)
{
    if (!card.ModelID.Contains("embedding") && !card.ModelID.Contains("whisper"))
    {
        float score = DeviceConfiguration.GetPerformanceScore(card);
        if (score > bestScore)
        {
            bestScore = score;
            bestCard = card;
        }
    }
}

Console.WriteLine($"Selected model: {bestCard.ModelID} (score: {bestScore:F2})");

// Load with progress
using LM model = new LM(bestCard,
    downloadingProgress: (path, len, read) =>
    {
        if (len.HasValue)
            Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p =>
    {
        Console.Write($"\rLoading: {p * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\nLoaded: {model.Name}");
Console.WriteLine($"Context:    {model.ContextLength} tokens");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"Vision:     {model.HasVision}");
Console.WriteLine($"Tools:      {model.HasToolCalls}");

Next Steps

Share