Load a Model and Generate Your First Response
This tutorial takes you from zero to a working LLM response in a .NET console app. By the end, you will have a running program that downloads a model, loads it (with GPU acceleration if available), and generates a chat response.
Prerequisites
| Requirement | Minimum |
|---|---|
| .NET SDK | 8.0+ |
| RAM | 8 GB |
| VRAM (optional) | 4 GB for GPU acceleration |
| Disk | ~3 GB free for model download |
Step 1: Create the Project and Install LM-Kit.NET
dotnet new console -n MyFirstLMKit
cd MyFirstLMKit
dotnet add package LM-Kit.NET
For NVIDIA GPU acceleration, also install the CUDA backend:
# Windows
dotnet add package LM-Kit.NET.Backend.Cuda12.Windows
# Linux
dotnet add package LM-Kit.NET.Backend.Cuda12.Linux
Step 2: Understand Model Loading Options
LM-Kit.NET gives you three ways to load a model. Pick the one that fits your workflow:
| Method | When to Use | Example |
|---|---|---|
LoadFromModelID |
You want a curated, tested model by name. Simplest option. | LM.LoadFromModelID("gemma3:4b") |
| URI constructor | You have a direct HuggingFace or HTTP URL to a .gguf file. |
new LM(new Uri("https://huggingface.co/...")) |
| Local path | The model file is already on disk. No download needed. | new LM("C:/models/my-model.gguf") |
LoadFromModelID is the recommended starting point. It resolves to a known-good HuggingFace URI and handles download + caching automatically.
Available Model IDs (subset)
| Model ID | Parameters | VRAM Needed | Best For |
|---|---|---|---|
gemma3:1b |
1B | ~1.5 GB | Low-resource devices, quick tests |
gemma3:4b |
4B | ~3.5 GB | General chat, good quality/speed tradeoff |
qwen3:4b |
4B | ~3.5 GB | Multilingual, tool calling |
gemma3:12b |
12B | ~8 GB | High-quality reasoning |
qwen3:8b |
8B | ~6 GB | Complex tasks, coding |
To list all available models programmatically:
var cards = ModelCard.GetPredefinedModelCards();
foreach (var card in cards)
Console.WriteLine($"{card.ModelID} ({card.ParameterCount / 1_000_000_000.0:F1}B) - ctx:{card.ContextLength}");
Step 3: Write the Program
Replace the contents of Program.cs:
using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;
// Optional: set a license key if available.
// A free community license can be obtained from: https://lm-kit.com/products/community-edition/
LMKit.Licensing.LicenseManager.SetLicenseKey("");
Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;
// --- 1. Download and load the model ---
Console.WriteLine("Downloading and loading model (first run downloads ~3 GB)...\n");
using LM model = LM.LoadFromModelID(
"gemma3:4b",
downloadingProgress: (path, contentLength, bytesRead) =>
{
if (contentLength.HasValue)
{
double pct = (double)bytesRead / contentLength.Value * 100;
Console.Write($"\rDownloading: {pct:F1}% ");
}
return true; // return false to cancel
},
loadingProgress: progress =>
{
Console.Write($"\rLoading: {progress * 100:F0}% ");
return true;
});
Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($" Context length: {model.ContextLength} tokens");
Console.WriteLine($" GPU layers: {model.GpuLayerCount}");
Console.WriteLine($" Capabilities: text={model.HasTextGeneration}, vision={model.HasVision}, tools={model.HasToolCalls}\n");
// --- 2. Create a multi-turn conversation ---
var chat = new MultiTurnConversation(model)
{
SystemPrompt = "You are a helpful assistant. Be concise.",
MaximumCompletionTokens = 512
};
// Stream tokens as they are generated
chat.AfterTextCompletion += (sender, e) =>
{
if (e.SegmentType == TextSegmentType.UserVisible)
Console.Write(e.Text);
};
// --- 3. Chat loop ---
Console.WriteLine("Type a message (or 'quit' to exit):\n");
while (true)
{
Console.ForegroundColor = ConsoleColor.Green;
Console.Write("You: ");
Console.ResetColor();
string? input = Console.ReadLine();
if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
break;
Console.ForegroundColor = ConsoleColor.Cyan;
Console.Write("Assistant: ");
Console.ResetColor();
var result = chat.Submit(input);
Console.WriteLine($"\n [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
}
Step 4: Run It
dotnet run
Expected output on first run:
Downloading and loading model (first run downloads ~3 GB)...
Downloading: 100.0%
Loading: 100%
Model loaded: Gemma 3 4B Instruct
Context length: 8192 tokens
GPU layers: 35
Capabilities: text=True, vision=False, tools=True
Type a message (or 'quit' to exit):
You: What is retrieval-augmented generation?
Assistant: RAG combines a retrieval system with a language model. When a query arrives,
relevant documents are fetched from a knowledge base and injected into the prompt context.
The model then generates a response grounded in those documents, reducing hallucinations
and keeping answers up to date without retraining.
[87 tokens, 42.3 tok/s]
GPU Configuration
By default, LM-Kit.NET offloads all model layers to the GPU (GpuLayerCount = int.MaxValue). If you run out of VRAM, reduce the layer count:
using LM model = LM.LoadFromModelID(
"gemma3:4b",
deviceConfiguration: new LM.DeviceConfiguration
{
GpuLayerCount = 20 // offload only 20 layers, keep the rest on CPU
});
Set GpuLayerCount = 0 to force CPU-only inference.
Choosing Between SingleTurnConversation and MultiTurnConversation
| Class | Keeps History | Use Case |
|---|---|---|
SingleTurnConversation |
No | Stateless tasks: classification, extraction, one-shot Q&A |
MultiTurnConversation |
Yes | Chatbots, assistants, anything that needs context across turns |
MultiTurnConversation accumulates messages in its History property. Call chat.ClearHistory() to reset the conversation context.
Common Issues
| Problem | Cause | Fix |
|---|---|---|
OutOfMemoryException on load |
Model too large for available VRAM | Use a smaller model (gemma3:1b) or reduce GpuLayerCount |
| Slow generation (~1 tok/s) | Running on CPU without GPU backend | Install the CUDA or Vulkan backend NuGet package |
| Download hangs | Network/firewall blocking HuggingFace | Download the .gguf file manually and load from local path |
| Garbled output | Wrong chat template format | Use LoadFromModelID (auto-detects template) or set model.ChatTemplateFormat explicitly |
Next Steps
- Build a RAG Pipeline Over Your Own Documents: ground model responses in your own data.
- Create an AI Agent with Tools: give the model the ability to search the web, do math, and call your code.
- Extract Structured Data from Unstructured Text: pull typed fields (names, dates, amounts) from free-form text.