Table of Contents

Choosing the Right Model for Your Use Case and Hardware

LM-Kit.NET ships with a curated catalog of 60+ models that have been validated and benchmarked for compatibility. Selecting the right model depends on your task, your hardware, and the trade-off between quality and speed. This guide walks you through the decision process.

Recently added to the catalog: GPT OSS 20B (OpenAI, MoE reasoning), GLM 4.7 Flash (Z.ai, MoE coding/agentic), Falcon H1R 7B (hybrid Mamba-2 reasoning), Devstral Small 2 (agentic coding, 393K context), Qwen 3 VL 30B (MoE vision), Granite 4 Hybrid (1M context), Nemotron 3 Nano 30B (MoE, 1M context), SmolLM3 3B.


TL;DR

using LMKit.Model;

// Load a model by its catalog ID (auto-downloads if needed)
using LM model = LM.LoadFromModelID("gemma3:4b");
Use Case Recommended Model
General-purpose assistant gemma3:4b or qwen3:4b
Budget hardware (CPU only) gemma3:1b or qwen3:1.7b
High-quality reasoning gptoss:20b or glm4.7-flash
Tool-calling agent glm4.7-flash or qwen3:8b
Agentic coding devstral-small2 or glm4.7-flash
Vision / multimodal gemma3:12b or qwen3-vl:8b
Embeddings for RAG embeddinggemma-300m or qwen3-embedding:0.6b
Speech-to-text whisper-large-turbo3

Browse all available models in the Model Catalog. For GPU-specific picks, multi-model stacks, and upgrade paths, see Model Recommendations.


The Model Catalog

LM-Kit maintains a predefined catalog of models hosted on Hugging Face. Every model in the catalog has been tested for correctness, performance, and stability with the SDK.

You can browse the full catalog interactively in the Model Catalog page, which lets you filter by capability, size, and format.

In code, you can retrieve the full catalog programmatically:

using LMKit.Model;

// Get all predefined models
List<ModelCard> models = ModelCard.GetPredefinedModelCards();

foreach (var card in models)
{
    Console.WriteLine($"{card.ModelID,-30} {card.ParameterCount / 1e9,5:F1}B  {card.Capabilities}");
}

To load a model by its catalog ID:

using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma3:4b");

LoadFromModelID downloads the model automatically if it is not already cached locally.


Step 1: Match the Model to Your Task

Every model in the catalog has a set of capabilities that describe what it can do. Choose a model whose capabilities match your use case.

Capability Description Example Models
Chat Multi-turn dialogue, Q&A, assistants Gemma 3, Qwen 3, GPT OSS, GLM 4.7, Phi 4
Text Generation Content creation, summarization, rewriting Gemma 3, Qwen 3, GPT OSS, GLM 4.7, Mistral
Code Completion Code generation and completion Devstral, GLM 4.7, GPT OSS, DeepSeek Coder, Falcon H1R
Reasoning Multi-step reasoning, chain-of-thought GPT OSS, GLM 4.7, Falcon H1R, Magistral, QwQ, Nemotron 3 Nano
Tools Call Function/tool invocation by the model GLM 4.7, Qwen 3, GPT OSS, Mistral Small 3.2, Ministral 3, Granite 4 Hybrid
Vision Image understanding, visual Q&A Gemma 3 (4B+), Qwen 3 VL, Ministral 3, Devstral, MiniCPM, Pixtral
Text Embeddings Semantic similarity, clustering, RAG retrieval Embedding Gemma, Qwen 3 Embedding, Nomic Embed, BGE-M3
Image Embeddings Image similarity and visual search Nomic Embed Vision
Text Reranking Reranking search candidates by relevance BGE M3 Reranker
Speech-to-Text Audio transcription Whisper (tiny through large-v3-turbo)
Sentiment Analysis Sentiment and emotion detection LM-Kit Sentiment Analysis (finetuned)
Math Mathematical reasoning GLM 4.7, GPT OSS, Qwen 3 (4B+), Falcon H1R
Image Segmentation Image partitioning into regions U2-Net

For detailed descriptions and benchmark data for each family, see Model Families and Benchmarks.

You can filter models by capability in code:

using LMKit.Model;

var chatModels = ModelCard.GetPredefinedModelCards()
    .Where(c => c.Capabilities.HasFlag(ModelCapabilities.Chat))
    .ToList();

Console.WriteLine($"Found {chatModels.Count} chat-capable models");

Step 2: Size the Model to Your Hardware

Model size directly determines how much memory (RAM or VRAM) you need. The general rule: larger models produce better outputs but require more powerful hardware.

Quick Sizing Guide

Model Size RAM / VRAM Needed (4-bit) Hardware Typical Use
Under 1B ~1 GB CPU Embeddings, lightweight classification
1B to 3B 1 to 2 GB CPU or entry-level GPU Simple chat, basic classification, translation
4B to 8B 3 to 6 GB GPU with 6+ GB VRAM General-purpose chat, RAG, tool calling
12B to 14B 8 to 10 GB GPU with 10+ GB VRAM High-quality chat, complex reasoning
20B to 30B 12 to 20 GB GPU with 16+ GB VRAM or multi-GPU Advanced reasoning, large-scale production. MoE models (GPT OSS, GLM 4.7) activate only ~3B params per token, delivering 20B/30B quality at lower compute cost.
70B 40+ GB Multi-GPU setup Maximum quality, enterprise server workloads
Use Case Recommended Model Why How-To Guide
General-purpose assistant gemma3:4b or qwen3:4b Good quality-to-size ratio, vision support Build a Conversational Assistant
Budget hardware (CPU only) gemma3:1b or qwen3:1.7b Fast on CPU, acceptable quality Load Model and Generate
High-quality reasoning gptoss:20b or glm4.7-flash MoE efficiency: only ~3B active params with 20B/30B quality reasoning Control Reasoning and Chain-of-Thought
Reasoning on small hardware falcon-h1r:7b or qwen3:8b Falcon H1R scores 88% AIME 2024, outperforming many larger models on math Control Reasoning and Chain-of-Thought
Agentic coding devstral-small2 or glm4.7-flash Top SWE-bench scores, agentic multi-file coding Build a Function-Calling Agent
Code generation devstral-small2 or gptoss:20b Specialized for code and tool-driven development Extract Structured Data
Tool-calling agent glm4.7-flash or qwen3:8b GLM 4.7 leads agentic benchmarks; Qwen 3 offers native MCP support Create an Agent with Tools
Vision / multimodal gemma3:12b or qwen3-vl:8b Strong vision with reasoning. For lighter hardware, gemma3:4b or qwen3-vl:2b Analyze Images with Vision
Embeddings for RAG embeddinggemma-300m or qwen3-embedding:0.6b Embedding Gemma is the top open model under 500M on MTEB. Qwen 3 Embedding for higher accuracy and multilingual Build a RAG Pipeline
Multilingual embeddings qwen3-embedding:8b or bge-m3 Broad language coverage for cross-lingual RAG Build Semantic Search
Speech-to-text whisper-large-turbo3 Best speed/quality trade-off Transcribe Audio
Long context (100K+ tokens) granite4-h:3b or granite4-h:7b Up to 1M token context with hybrid Mamba-2 architecture Handle Long Inputs
Advanced reasoning (large) qwq or nemotron3-nano 32B/30B class, top-tier math and reasoning Control Reasoning and Chain-of-Thought

Need more guidance? See Model Recommendations for GPU-specific picks, ready-made multi-model stacks, and upgrade paths.


Step 3: Measure Performance on Your Hardware

Instead of guessing, use the built-in performance scorer to evaluate how well each model will run on your specific machine:

using LMKit.Model;

var models = ModelCard.GetPredefinedModelCards();

foreach (var card in models.Where(c => c.Capabilities.HasFlag(ModelCapabilities.Chat)))
{
    float score = LM.DeviceConfiguration.GetPerformanceScore(card);
    string rating = score > 0.7f ? "Good"
                  : score > 0.4f ? "Acceptable"
                  : "Too slow";

    Console.WriteLine($"{card.ModelID,-30} Score: {score:F2}  ({rating})");
}
Score Meaning
0.7 to 1.0 Model runs comfortably on your hardware.
0.4 to 0.7 Model works but may be slow. Consider partial GPU offloading.
Below 0.4 Model is too large for your hardware. Choose a smaller variant.

Auto-Filter by Hardware

You can also let the SDK drop models that are too small when your hardware can handle larger ones:

// Drops smaller siblings when a larger model in the same family scores 1.0
var bestModels = ModelCard.GetPredefinedModelCards(dropSmallerModels: true);

Step 4: Understand Quantization

All models in the LM-Kit catalog are distributed as pre-quantized files, primarily in 4-bit (Q4_K_M) format. Quantization compresses model weights to reduce file size and memory usage with minimal quality loss.

Precision File Size (relative) Quality Use Case
4-bit (Q4_K_M) ~1x (baseline) Very good for most tasks Recommended default
8-bit (Q8) ~2x Slightly better When quality matters more than memory
16-bit (F16) ~4x Original precision Embedding models, fine-tuning base

For most use cases, the default 4-bit quantization in the catalog provides the best balance between quality and resource usage. You do not need to select a quantization level manually.


Step 5: Consider Context Length

Context length determines how much text the model can process in a single inference pass. Longer context means the model can handle larger documents, longer conversations, or more retrieved chunks in a RAG pipeline.

Context Length Typical Use
2K to 4K tokens Short prompts, simple Q&A
8K to 32K tokens Multi-turn chat, moderate documents
128K tokens Long documents, extended conversations
1M tokens Entire codebases, book-length documents

Most models in the catalog support 8K to 128K tokens. A few specialized models (Granite 4 Hybrid, Nemotron 3 Nano) support up to 1M tokens.

The SDK can recommend an optimal context size based on your available memory:

using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma3:4b");
int optimalContext = LM.DeviceConfiguration.GetOptimalContextSize(model);

Console.WriteLine($"Recommended context size: {optimalContext} tokens");

Note: Larger context sizes consume more VRAM for the KV cache. If you run into memory issues, reduce the context size or enable KV cache recycling with Configuration.EnableKVCacheRecycling = true.

For precise, hardware-aware estimation that accounts for KV cache and current GPU usage, use MemoryEstimation.FitParameters(). See Estimating Memory and Context Size for the full guide.


Step 6: Load and Verify

Once you have chosen a model, load it and verify that it is running on the expected backend:

using LMKit.Global;
using LMKit.Model;

Runtime.Initialize();

using LM model = LM.LoadFromModelID("gemma3:4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\nModel:        {model.Name}");
Console.WriteLine($"Parameters:   {model.ParameterCount / 1e9:F1}B");
Console.WriteLine($"Context:      {model.ContextLength} tokens");
Console.WriteLine($"Backend:      {Runtime.Backend}");
Console.WriteLine($"GPU layers:   {model.GpuLayerCount}");
Console.WriteLine($"Has vision:   {model.HasVision}");
Console.WriteLine($"Has tools:    {model.HasToolCalls}");
Console.WriteLine($"Has reasoning:{model.HasReasoning}");

Model Storage and Caching

Downloaded models are stored locally so subsequent loads are instant. The SDK resolves the storage directory in this order:

  1. Programmatic: Configuration.ModelStorageDirectory = "D:/my-models";
  2. Environment variable: LMKIT_MODELS_DIR
  3. Default: %APPDATA%/LM-Kit/models (Windows) or ~/.local/share/LM-Kit/models (Linux/macOS)

You can also pre-download a model without loading it:

var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:4b");

if (!card.IsLocallyAvailable)
{
    await card.DownloadAsync((path, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  {(double)read / len.Value * 100:F1}%");
        return true;
    });
}

Console.WriteLine($"Model stored at: {card.LocalPath}");

To validate file integrity after download:

bool valid = card.ValidateFileChecksum();
Console.WriteLine($"Checksum valid: {valid}");

Loading Custom Models

You are not limited to the predefined catalog. LM-Kit.NET supports any GGUF-compatible model:

// From a local file
using LM model = new LM("path/to/my-model.gguf");

// From a Hugging Face URL (auto-downloads)
using LM model = new LM(new Uri("https://huggingface.co/org/repo/resolve/main/model.gguf"));

// Quick metadata inspection (without loading weights)
var card = ModelCard.CreateFromFile("path/to/my-model.gguf");
Console.WriteLine($"Architecture: {card.Architecture}");
Console.WriteLine($"Parameters:   {card.ParameterCount}");
Console.WriteLine($"Context:      {card.ContextLength}");

Decision Flowchart

  1. What is my task? Pick a capability (Chat, Vision, Embeddings, Speech-to-Text, etc.).
  2. What hardware do I have? Check the Hardware Quick Pick table or run GetPerformanceScore to find which sizes fit.
  3. Do I need special features? Tool calling, reasoning, vision, long context?
  4. Start with the recommended model from the table above, then experiment.

Next Steps

Share