Understanding Model Loading and Caching
Every LM-Kit.NET feature starts with a loaded model. This guide explains the three ways to load a model, how automatic downloading and caching work, which properties are available after loading, and how to tune memory behavior for your deployment scenario.
TL;DR
using LMKit.Model;
// Option 1: Load by model ID (auto-downloads and caches)
using LM model = LM.LoadFromModelID("gemma3:4b");
// Option 2: Load from a HuggingFace URI
using LM model = new LM(new Uri("https://huggingface.co/lm-kit/gemma-3-4b-instruct-lmk/resolve/main/gemma-3-4b-it-Q4_K_M.lmk"));
// Option 3: Load from a local file
using LM model = new LM(@"C:\models\gemma-3-4b-it-Q4_K_M.lmk");
- Models are cached automatically after the first download. Subsequent loads are instant.
- Use
ModelCard.GetPredefinedModelCards()to browse the full catalog in code. - Both download and loading accept progress callbacks that return
falseto cancel.
Prerequisites
| Requirement | Details |
|---|---|
| LM-Kit.NET | Installed via NuGet (LM-Kit.NET package) |
| .NET | .NET 8.0 or later (or .NET Standard 2.0 compatible) |
| Disk space | Enough free space for the model file (typically 2 to 18 GB depending on the model) |
| Internet | Required only for the first download when using a model ID or remote URI |
Three Ways to Load a Model
LM-Kit.NET provides three approaches for loading a model. Each returns an LM instance that you pass to inference classes such as MultiTurnConversation, Agent, or TextExtraction.
1. Load by Model ID
The simplest option. Pass a short identifier such as "gemma3:4b" and LM-Kit.NET resolves it to the correct HuggingFace URI, downloads the file if needed, and loads it.
using LMKit.Model;
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
{
double pct = (double)bytesRead / contentLength.Value * 100;
Console.Write($"\rDownloading: {pct:F1}% ");
}
return true; // return false to cancel
},
loadingProgress: progress =>
{
Console.Write($"\rLoading: {progress * 100:F0}% ");
return true; // return false to cancel
});
Console.WriteLine($"\nModel loaded: {model.Name}");
Common model IDs include: gemma3:1b, gemma3:4b, gemma3:12b, qwen3:4b, qwen3:8b, phi4-mini:3.8b, phi4:14.7b, and llama3.1:8b.
2. Load from a URI
Use this approach when you need a specific model file from HuggingFace or another host. LM-Kit.NET downloads and caches the file automatically.
using LMKit.Model;
var uri = new Uri("https://huggingface.co/lm-kit/gemma-3-4b-instruct-lmk/resolve/main/gemma-3-4b-it-Q4_K_M.lmk");
using LM model = new LM(uri,
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
{
double pct = (double)bytesRead / contentLength.Value * 100;
Console.Write($"\rDownloading: {pct:F1}% ");
}
return true;
},
loadingProgress: progress =>
{
Console.Write($"\rLoading: {progress * 100:F0}% ");
return true;
});
Console.WriteLine($"\nModel loaded: {model.Name}");
3. Load from a Local File Path
If you have already downloaded a GGUF model file, pass its path directly. No network access is required.
using LMKit.Model;
using LM model = new LM(@"C:\models\gemma-3-4b-it-Q4_K_M.lmk",
loadingProgress: progress =>
{
Console.Write($"\rLoading: {progress * 100:F0}% ");
return true;
});
Console.WriteLine($"Model loaded: {model.Name}");
Progress Callbacks
Both download and load operations accept optional progress callbacks. They follow the same pattern: return true to continue, or false to cancel the operation.
| Callback | Signature | When It Fires |
|---|---|---|
downloadingProgress |
bool (string path, long? contentLength, long bytesRead) |
Periodically during file download |
loadingProgress |
bool (float progress) |
Periodically while the model tensors are loaded into memory (0.0 to 1.0) |
The contentLength parameter in the downloading callback may be null if the server does not provide a Content-Length header.
Download and Caching Behavior
When you load a model by ID or URI, LM-Kit.NET checks the local cache first. If the file is not cached, it downloads the model and stores it for future use.
- Cache location: The default cache directory is managed by LM-Kit.NET internally. You can override it by passing a
storagePathparameter. - Subsequent loads: If the file already exists in the cache, no download occurs and loading starts immediately.
- Custom storage path: Provide a
storagePathargument to control where the downloaded file is saved.
using LMKit.Model;
// Store the model in a custom directory
using LM model = LM.LoadFromModelID("qwen3:4b",
storagePath: @"D:\my-models");
Model Properties After Loading
Once a model is loaded, you can inspect its capabilities and configuration through properties on the LM instance.
using LMKit.Model;
using LM model = LM.LoadFromModelID("gemma3:12b");
Console.WriteLine($"Name: {model.Name}");
Console.WriteLine($"Context length: {model.ContextLength}");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"Has text gen: {model.HasTextGeneration}");
Console.WriteLine($"Has vision: {model.HasVision}");
Console.WriteLine($"Has tool calls: {model.HasToolCalls}");
| Property | Type | Description |
|---|---|---|
Name |
string |
The name embedded in the model metadata. |
ContextLength |
int |
Maximum number of tokens the model can process in a single inference. |
GpuLayerCount |
int |
Number of layers currently offloaded to the GPU. |
HasTextGeneration |
bool |
Whether the model supports text generation tasks. |
HasVision |
bool |
Whether the model supports image input (vision language model). |
HasToolCalls |
bool |
Whether the model supports native tool/function calling. |
The Model Catalog
LM-Kit.NET ships with a curated catalog of validated models accessible through ModelCard.GetPredefinedModelCards(). Each ModelCard contains the download URI, file size, quantization precision, and capability flags.
using LMKit.Model;
var catalog = ModelCard.GetPredefinedModelCards();
Console.WriteLine($"Available models: {catalog.Count}\n");
foreach (var card in catalog)
{
Console.WriteLine($" {card.ModelID,-25} Size: {card.FileSize / (1024.0 * 1024.0 * 1024.0):F1} GB");
}
You can load a model directly from a ModelCard:
using LMKit.Model;
var catalog = ModelCard.GetPredefinedModelCards();
var card = catalog.First(c => c.ModelID == "gemma3:4b");
using LM model = new LM(card);
Evaluating Hardware Compatibility
Before loading a large model, check whether your hardware can run it efficiently. The DeviceConfiguration.GetPerformanceScore() method returns a value between 0 and 1 based on your GPU memory relative to the model size.
using LMKit.Hardware;
using LMKit.Model;
var catalog = ModelCard.GetPredefinedModelCards();
foreach (var card in catalog)
{
float score = DeviceConfiguration.GetPerformanceScore(card);
string rating = score >= 0.9f ? "Excellent"
: score >= 0.5f ? "Acceptable"
: "May be slow";
Console.WriteLine($" {card.ModelID,-25} Score: {score:F2} ({rating})");
}
| Score Range | Meaning |
|---|---|
| 0.9 to 1.0 | The model fits comfortably in VRAM. Expect fast inference. |
| 0.5 to 0.9 | The model can run but some layers may spill to system memory. |
| Below 0.5 | Consider a smaller model or a more powerful GPU. |
Memory and Cache Configuration
LM-Kit.NET provides global settings that control how models and inference state are cached in memory.
using LMKit.Global;
// Enable model caching so reloading the same model is instant (default: true)
Configuration.EnableModelCache = true;
// Enable KV cache recycling to reuse attention caches across requests (default: true)
Configuration.EnableKVCacheRecycling = true;
| Setting | Default | Description |
|---|---|---|
Configuration.EnableModelCache |
true |
Keeps model weights in memory after the LM instance is disposed, speeding up subsequent loads of the same model. |
Configuration.EnableKVCacheRecycling |
true |
Reuses key-value attention caches across inference calls, reducing memory allocations and improving throughput. |
Tip: For server or batch scenarios where the same model handles many requests, keep both settings enabled. For memory-constrained environments running different models sequentially, consider disabling
EnableModelCacheto free memory sooner.
Putting It All Together
The following example demonstrates a complete model selection and loading workflow:
using LMKit.Global;
using LMKit.Hardware;
using LMKit.Model;
// Configure runtime
Runtime.EnableCuda = true;
Configuration.EnableKVCacheRecycling = true;
Runtime.Initialize();
// Find the best model for this hardware
var catalog = ModelCard.GetPredefinedModelCards();
ModelCard bestCard = null;
float bestScore = 0;
foreach (var card in catalog)
{
if (!card.ModelID.Contains("embedding") && !card.ModelID.Contains("whisper"))
{
float score = DeviceConfiguration.GetPerformanceScore(card);
if (score > bestScore)
{
bestScore = score;
bestCard = card;
}
}
}
Console.WriteLine($"Selected model: {bestCard.ModelID} (score: {bestScore:F2})");
// Load with progress
using LM model = new LM(bestCard,
downloadingProgress: (path, len, read) =>
{
if (len.HasValue)
Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p =>
{
Console.Write($"\rLoading: {p * 100:F0}% ");
return true;
});
Console.WriteLine($"\nLoaded: {model.Name}");
Console.WriteLine($"Context: {model.ContextLength} tokens");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"Vision: {model.HasVision}");
Console.WriteLine($"Tools: {model.HasToolCalls}");
Next Steps
- Estimating Memory and Context Size: validate whether a model fits your hardware and find the optimal context size before loading.
- Configure GPU Backends: set up CUDA, Vulkan, or Metal for GPU acceleration.
- Your First AI Agent: use a loaded model to build an agent with tools.
- Choosing the Right Model: understand the trade-offs between model size, quantization, and hardware.
- Model Catalog: browse the full list of predefined models with download URIs and specifications.