Table of Contents

Load a Model and Generate Your First Response

This tutorial takes you from zero to a working LLM response in a .NET console app. By the end, you will have a running program that downloads a model, loads it (with GPU acceleration if available), and generates a chat response.


Prerequisites

Requirement Minimum
.NET SDK 8.0+
RAM 8 GB
VRAM (optional) 4 GB for GPU acceleration
Disk ~3 GB free for model download

Step 1: Create the Project and Install LM-Kit.NET

dotnet new console -n MyFirstLMKit
cd MyFirstLMKit
dotnet add package LM-Kit.NET

For NVIDIA GPU acceleration, also install the CUDA backend:

# Windows
dotnet add package LM-Kit.NET.Backend.Cuda12.Windows

# Linux
dotnet add package LM-Kit.NET.Backend.Cuda12.Linux

Step 2: Understand Model Loading Options

LM-Kit.NET gives you three ways to load a model. Pick the one that fits your workflow:

Method When to Use Example
LoadFromModelID You want a curated, tested model by name. Simplest option. LM.LoadFromModelID("gemma3:4b")
URI constructor You have a direct HuggingFace or HTTP URL to a .gguf file. new LM(new Uri("https://huggingface.co/..."))
Local path The model file is already on disk. No download needed. new LM("C:/models/my-model.gguf")

LoadFromModelID is the recommended starting point. It resolves to a known-good HuggingFace URI and handles download + caching automatically.

Available Model IDs (subset)

Model ID Parameters VRAM Needed Best For
gemma3:1b 1B ~1.5 GB Low-resource devices, quick tests
gemma3:4b 4B ~3.5 GB General chat, good quality/speed tradeoff
qwen3:4b 4B ~3.5 GB Multilingual, tool calling
gemma3:12b 12B ~8 GB High-quality reasoning
qwen3:8b 8B ~6 GB Complex tasks, coding

To list all available models programmatically:

var cards = ModelCard.GetPredefinedModelCards();
foreach (var card in cards)
    Console.WriteLine($"{card.ModelID} ({card.ParameterCount / 1_000_000_000.0:F1}B) - ctx:{card.ContextLength}");

Step 3: Write the Program

Replace the contents of Program.cs:

using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;

// Optional: set a license key if available.
// A free community license can be obtained from: https://lm-kit.com/products/community-edition/
LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// --- 1. Download and load the model ---
Console.WriteLine("Downloading and loading model (first run downloads ~3 GB)...\n");

using LM model = LM.LoadFromModelID(
    "gemma3:4b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
        {
            double pct = (double)bytesRead / contentLength.Value * 100;
            Console.Write($"\rDownloading: {pct:F1}%   ");
        }
        return true; // return false to cancel
    },
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Context length: {model.ContextLength} tokens");
Console.WriteLine($"  GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"  Capabilities: text={model.HasTextGeneration}, vision={model.HasVision}, tools={model.HasToolCalls}\n");

// --- 2. Create a multi-turn conversation ---
var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant. Be concise.",
    MaximumCompletionTokens = 512
};

// Stream tokens as they are generated
chat.AfterTextCompletion += (sender, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

// --- 3. Chat loop ---
Console.WriteLine("Type a message (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    var result = chat.Submit(input);

    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
}

Step 4: Run It

dotnet run

Expected output on first run:

Downloading and loading model (first run downloads ~3 GB)...

Downloading: 100.0%
Loading: 100%

Model loaded: Gemma 3 4B Instruct
  Context length: 8192 tokens
  GPU layers: 35
  Capabilities: text=True, vision=False, tools=True

Type a message (or 'quit' to exit):

You: What is retrieval-augmented generation?
Assistant: RAG combines a retrieval system with a language model. When a query arrives,
relevant documents are fetched from a knowledge base and injected into the prompt context.
The model then generates a response grounded in those documents, reducing hallucinations
and keeping answers up to date without retraining.
  [87 tokens, 42.3 tok/s]

GPU Configuration

By default, LM-Kit.NET offloads all model layers to the GPU (GpuLayerCount = int.MaxValue). If you run out of VRAM, reduce the layer count:

using LM model = LM.LoadFromModelID(
    "gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 20  // offload only 20 layers, keep the rest on CPU
    });

Set GpuLayerCount = 0 to force CPU-only inference.


Choosing Between SingleTurnConversation and MultiTurnConversation

Class Keeps History Use Case
SingleTurnConversation No Stateless tasks: classification, extraction, one-shot Q&A
MultiTurnConversation Yes Chatbots, assistants, anything that needs context across turns

MultiTurnConversation accumulates messages in its History property. Call chat.ClearHistory() to reset the conversation context.


Common Issues

Problem Cause Fix
OutOfMemoryException on load Model too large for available VRAM Use a smaller model (gemma3:1b) or reduce GpuLayerCount
Slow generation (~1 tok/s) Running on CPU without GPU backend Install the CUDA or Vulkan backend NuGet package
Download hangs Network/firewall blocking HuggingFace Download the .gguf file manually and load from local path
Garbled output Wrong chat template format Use LoadFromModelID (auto-detects template) or set model.ChatTemplateFormat explicitly

Next Steps