Load a Model and Generate Your First Response

This tutorial takes you from zero to a working LLM response in a .NET console app. By the end, you will have a running program that downloads a model, loads it (with GPU acceleration if available), and generates a chat response.

Prerequisites

Requirement	Minimum
.NET SDK	8.0+
RAM	8 GB
VRAM (optional)	4 GB for GPU acceleration
Disk	~3 GB free for model download

Step 1: Create the Project and Install LM-Kit.NET

dotnet new console -n MyFirstLMKit
cd MyFirstLMKit
dotnet add package LM-Kit.NET

For NVIDIA GPU acceleration, also install the CUDA backend:

# Windows
dotnet add package LM-Kit.NET.Backend.Cuda12.Windows

# Linux
dotnet add package LM-Kit.NET.Backend.Cuda12.Linux

Step 2: Understand Model Loading Options

LM-Kit.NET gives you three ways to load a model. Pick the one that fits your workflow:

Method	When to Use	Example
`LoadFromModelID`	You want a curated, tested model by name. Simplest option.	`LM.LoadFromModelID("gemma3:4b")`
URI constructor	You have a direct HuggingFace or HTTP URL to a `.gguf` file.	`new LM(new Uri("https://huggingface.co/..."))`
Local path	The model file is already on disk. No download needed.	`new LM("C:/models/my-model.gguf")`

LoadFromModelID is the recommended starting point. It resolves to a known-good HuggingFace URI and handles download + caching automatically.

Available Model IDs (subset)

Model ID	Parameters	VRAM Needed	Best For
`gemma3:1b`	1B	~1.5 GB	Low-resource devices, quick tests
`gemma3:4b`	4B	~3.5 GB	General chat, good quality/speed tradeoff
`qwen3:4b`	4B	~3.5 GB	Multilingual, tool calling
`gemma3:12b`	12B	~8 GB	High-quality reasoning
`qwen3:8b`	8B	~6 GB	Complex tasks, coding

To list all available models programmatically:

using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;

// Optional: set a license key if available.
// A free community license can be obtained from: https://lm-kit.com/products/community-edition/
LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// --- 1. Download and load the model ---
Console.WriteLine("Downloading and loading model (first run downloads ~3 GB)...\n");

var cards = ModelCard.GetPredefinedModelCards();
foreach (var card in cards)
    Console.WriteLine($"{card.ModelID} ({card.ParameterCount / 1_000_000_000.0:F1}B) - ctx:{card.ContextLength}");

Step 3: Write the Program

Replace the contents of Program.cs:

using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.TextGeneration.Chat;

// Optional: set a license key if available.
// A free community license can be obtained from: https://lm-kit.com/products/community-edition/
LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// --- 1. Download and load the model ---
Console.WriteLine("Downloading and loading model (first run downloads ~3 GB)...\n");

using LM model = LM.LoadFromModelID(
    "gemma3:4b",
    downloadingProgress: (path, contentLength, bytesRead) =>
    {
        if (contentLength.HasValue)
        {
            double pct = (double)bytesRead / contentLength.Value * 100;
            Console.Write($"\rDownloading: {pct:F1}%   ");
        }
        return true; // return false to cancel
    },
    loadingProgress: progress =>
    {
        Console.Write($"\rLoading: {progress * 100:F0}%   ");
        return true;
    });

Console.WriteLine($"\n\nModel loaded: {model.Name}");
Console.WriteLine($"  Context length: {model.ContextLength} tokens");
Console.WriteLine($"  GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"  Capabilities: text={model.HasTextGeneration}, vision={model.HasVision}, tools={model.HasToolCalls}\n");

// --- 2. Create a multi-turn conversation ---
var chat = new MultiTurnConversation(model)
{
    SystemPrompt = "You are a helpful assistant. Be concise.",
    MaximumCompletionTokens = 512
};

// Stream tokens as they are generated
chat.AfterTextCompletion += (sender, e) =>
{
    if (e.SegmentType == TextSegmentType.UserVisible)
        Console.Write(e.Text);
};

// --- 3. Chat loop ---
Console.WriteLine("Type a message (or 'quit' to exit):\n");

while (true)
{
    Console.ForegroundColor = ConsoleColor.Green;
    Console.Write("You: ");
    Console.ResetColor();

    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("quit", StringComparison.OrdinalIgnoreCase))
        break;

    Console.ForegroundColor = ConsoleColor.Cyan;
    Console.Write("Assistant: ");
    Console.ResetColor();

    var result = chat.Submit(input);

    Console.WriteLine($"\n  [{result.GeneratedTokenCount} tokens, {result.TokenGenerationRate:F1} tok/s]\n");
}

Step 4: Run It

dotnet run

Expected output on first run:

Downloading and loading model (first run downloads ~3 GB)...

Downloading: 100.0%
Loading: 100%

Model loaded: Gemma 3 4B Instruct
  Context length: 8192 tokens
  GPU layers: 35
  Capabilities: text=True, vision=False, tools=True

Type a message (or 'quit' to exit):

You: What is retrieval-augmented generation?
Assistant: RAG combines a retrieval system with a language model. When a query arrives,
relevant documents are fetched from a knowledge base and injected into the prompt context.
The model then generates a response grounded in those documents, reducing hallucinations
and keeping answers up to date without retraining.
  [87 tokens, 42.3 tok/s]

GPU Configuration

By default, LM-Kit.NET offloads all model layers to the GPU (GpuLayerCount = int.MaxValue). If you run out of VRAM, reduce the layer count:

using System.Text;
using LMKit.Model;
using LMKit.TextGeneration;

// Optional: set a license key if available.
// A free community license can be obtained from: https://lm-kit.com/products/community-edition/
LMKit.Licensing.LicenseManager.SetLicenseKey("");

Console.InputEncoding = Encoding.UTF8;
Console.OutputEncoding = Encoding.UTF8;

// --- 1. Download and load the model ---
Console.WriteLine("Downloading and loading model (first run downloads ~3 GB)...\n");

using LM model = LM.LoadFromModelID(
    "gemma3:4b",
    deviceConfiguration: new LM.DeviceConfiguration
    {
        GpuLayerCount = 20  // offload only 20 layers, keep the rest on CPU
    });

Set GpuLayerCount = 0 to force CPU-only inference.

Choosing Between SingleTurnConversation and MultiTurnConversation

Class	Keeps History	Use Case
`SingleTurnConversation`	No	Stateless tasks: classification, extraction, one-shot Q&A
`MultiTurnConversation`	Yes	Chatbots, assistants, anything that needs context across turns

MultiTurnConversation accumulates messages in its History property. Call chat.ClearHistory() to reset the conversation context.

Common Issues

Problem	Cause	Fix
`OutOfMemoryException` on load	Model too large for available VRAM	Use a smaller model (`gemma3:1b`) or reduce `GpuLayerCount`
Slow generation (~1 tok/s)	Running on CPU without GPU backend	Install the CUDA or Vulkan backend NuGet package
Download hangs	Network/firewall blocking HuggingFace	Download the `.gguf` file manually and load from local path
Garbled output	Wrong chat template format	Use `LoadFromModelID` (auto-detects template) or set `model.ChatTemplateFormat` explicitly

Next Steps

Build a RAG Pipeline Over Your Own Documents: ground model responses in your own data.
Create an AI Agent with Tools: give the model the ability to search the web, do math, and call your code.
Extract Structured Data from Unstructured Text: pull typed fields (names, dates, amounts) from free-form text.
Migrate from Cloud AI APIs to Local Inference with Microsoft.Extensions.AI: use LM-Kit as a drop-in replacement for OpenAI or Azure through standard .NET interfaces.
Build and Deploy an Offline AI Application for Edge Environments: package models and application for air-gapped deployment.

Table of Contents