Choosing the Right Model for Your Use Case and Hardware

LM-Kit.NET ships with a curated catalog of 60+ models that have been validated and benchmarked for compatibility. Selecting the right model depends on your task, your hardware, and the trade-off between quality and speed. This guide walks you through the decision process.

TL;DR

using LMKit.Model;

// Load a model by its catalog ID (auto-downloads if needed)

Use Case	Recommended Model
General-purpose assistant	`gemma4:e4b` or `qwen3.5:4b`
Budget hardware (CPU only)	`gemma4:e4b` or `qwen3.5:2b`
High-quality reasoning	`gptoss:20b`, `glm4.7-flash`, `gemma4:26b-a4b`, or `qwen3.6:27b`
Tool-calling agent	`glm4.7-flash` or `qwen3.5:9b`
Agentic coding	`qwen3-coder:30b-a3b` or `devstral-small2`
Vision / multimodal	`gemma4:e4b`, `gemma4:26b-a4b`, or `qwen3.5:9b`
OCR / document scanning	`paddleocr-vl-1.6:0.9b` or `lightonocr-2:1b`
Embeddings for RAG	`embeddinggemma-300m`, `harrier-oss:0.6b`, or `qwen3-embedding:0.6b`
Speech-to-text	`whisper-large-turbo3`

Browse all available models in the Model Catalog. For GPU-specific picks, multi-model stacks, and upgrade paths, see Model Recommendations.

The Model Catalog

LM-Kit maintains a predefined catalog of models hosted on Hugging Face. Every model in the catalog has been tested for correctness, performance, and stability with the SDK.

You can browse the full catalog interactively in the Model Catalog page, which lets you filter by capability, size, and format.

In code, you can retrieve the full catalog programmatically:

using LMKit.Model;

// Get all predefined models
List<ModelCard> models = ModelCard.GetPredefinedModelCards();

foreach (var card in models)
{
    Console.WriteLine($"{card.ModelID,-30} {card.ParameterCount / 1e9,5:F1}B  {card.Capabilities}");
}

To load a model by its catalog ID:

using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma4:e4b");

LoadFromModelID downloads the model automatically if it is not already cached locally.

Step 1: Match the Model to Your Task

Every model in the catalog has a set of capabilities that describe what it can do. Choose a model whose capabilities match your use case.

Capability	Description	Example Models
Chat	Multi-turn dialogue, Q&A, assistants	Gemma 4, Qwen 3.5, GPT OSS, GLM 4.7, Phi 4
Text Generation	Content creation, summarization, rewriting	Gemma 4, Qwen 3.5, GPT OSS, GLM 4.7, Mistral
Code Completion	Code generation and completion	Qwen 3 Coder, Devstral, GLM 4.7, GPT OSS, DeepSeek Coder, Falcon H1R
Reasoning	Multi-step reasoning, chain-of-thought	GPT OSS, GLM 4.7, Falcon H1R, Magistral, QwQ, Nemotron 3 Nano
Tools Call	Function/tool invocation by the model	GLM 4.7, Qwen 3.5, GPT OSS, Mistral Small 3.2, Ministral 3, Granite 4 Hybrid
Vision	Image understanding, visual Q&A	Gemma 4 (4B+), Qwen 3.5, Ministral 3, Devstral, MiniCPM, Pixtral
Text Embeddings	Semantic similarity, clustering, RAG retrieval	Embedding Gemma, Harrier OSS, Qwen 3 Embedding, Nomic Embed, BGE-M3
Image Embeddings	Image similarity and visual search	Nomic Embed Vision
Text Reranking	Reranking search candidates by relevance	BGE M3 Reranker
Speech-to-Text	Audio transcription	Whisper (tiny through large-v3-turbo)
Sentiment Analysis	Sentiment and emotion detection	LM-Kit Sentiment Analysis (finetuned)
OCR	Text extraction from images, scanned documents, PDFs	PaddleOCR VL, LightOnOCR 2, Qwen 3.5, MiniCPM-V 4.5
Math	Mathematical reasoning	GLM 4.7, GPT OSS, Qwen 3.5 (4B+), Falcon H1R
Image Segmentation	Image partitioning into regions	U2-Net

For detailed descriptions and benchmark data for each family, see Model Families and Benchmarks.

You can filter models by capability in code:

using LMKit.Model;

var chatModels = ModelCard.GetPredefinedModelCards()
    .Where(c => c.Capabilities.HasFlag(ModelCapabilities.Chat))
    .ToList();

Console.WriteLine($"Found {chatModels.Count} chat-capable models");

Step 2: Size the Model to Your Hardware

Model size directly determines how much memory (RAM or VRAM) you need. The general rule: larger models produce better outputs but require more powerful hardware.

Quick Sizing Guide

Model Size	RAM / VRAM Needed (4-bit)	Hardware	Typical Use
Under 1B	~1 GB	CPU	Embeddings, lightweight classification
1B to 3B	1 to 2 GB	CPU or entry-level GPU	Simple chat, basic classification, translation
4B to 8B	3 to 6 GB	GPU with 6+ GB VRAM	General-purpose chat, RAG, tool calling
12B to 14B	8 to 10 GB	GPU with 10+ GB VRAM	High-quality chat, complex reasoning. `gemma4:12b` (~6.8 GB) adds vision and a 256K context window at this size.
20B to 30B	12 to 20 GB	GPU with 16+ GB VRAM or multi-GPU	Advanced reasoning, large-scale production. MoE models (GPT OSS, GLM 4.7, Gemma 4 26B-A4B) activate only ~3-4B params per token, delivering 20B/30B-class quality at lower compute cost.
70B	40+ GB	Multi-GPU setup	Maximum quality, enterprise server workloads

Recommended Starting Points

Use Case	Recommended Model	Why	How-To Guide
General-purpose assistant	`gemma4:e4b` or `qwen3.5:4b`	Good quality-to-size ratio, vision support	Build a Conversational Assistant
Budget hardware (CPU only)	`gemma4:e4b` or `qwen3.5:2b`	Fast on CPU, acceptable quality	Load Model and Generate
High-quality reasoning	`gptoss:20b`, `glm4.7-flash`, `gemma4:26b-a4b`, or `qwen3.6:27b`	MoE efficiency (`gptoss`, `glm4.7`): only ~3B active params with 20B/30B quality. `gemma4:26b-a4b`: multimodal MoE (~4B active, ~18 GB) that adds vision and tool calling to strong reasoning. `qwen3.6:27b`: dense model with top-tier reasoning across benchmarks	Control Reasoning and Chain-of-Thought
Reasoning on small hardware	`falcon-h1r:7b` or `qwen3.5:9b`	Falcon H1R scores 88% AIME 2024, outperforming many larger models on math	Control Reasoning and Chain-of-Thought
Agentic coding	`qwen3-coder:30b-a3b` or `devstral-small2`	Qwen 3 Coder: MoE with 3.3B active params, 262K context, native tool calling. Devstral: top SWE-bench scores.	Build a Function-Calling Agent, Code Analysis Demo, Code Writing Demo
Code generation	`qwen3.5:9b`, `qwen3-coder:30b-a3b`, or `gptoss:20b`	`qwen3.5:9b` (~6 GB) for mid-range hardware. For 16+ GB VRAM, dedicated coding models deliver stronger results	Code Writing Demo
Tool-calling agent	`glm4.7-flash` or `qwen3.5:9b`	GLM 4.7 leads agentic benchmarks; Qwen 3.5 offers native MCP support	Create an Agent with Tools
Vision / multimodal	`gemma4:e4b`, `gemma4:12b`, `gemma4:26b-a4b`, or `qwen3.5:9b`	Strong vision with reasoning. For lighter hardware, `gemma4:e4b` or `qwen3.5:2b`. `gemma4:12b` (~6.8 GB) is a dense 12B with a 256K context window for large multimodal inputs. On a 24 GB GPU, `gemma4:26b-a4b` (~18 GB) delivers 26B-class multimodal quality at ~4B-active MoE compute; `qwen3.6:27b` (~18 GB) is a dense alternative.	Analyze Images with Vision
OCR / document scanning	`paddleocr-vl-1.6:0.9b` or `lightonocr-2:1b`	Ultra-compact dedicated OCR models (~0.7 GB). PaddleOCR VL excels at tables, formulas, and charts. For higher accuracy on complex layouts, use `qwen3.5:9b` or `qwen3.6:27b`	Extract Text with VLM OCR
Embeddings for RAG	`embeddinggemma-300m`, `harrier-oss:0.6b`, or `qwen3-embedding:0.6b`	Embedding Gemma is the top open model under 500M on MTEB. Harrier OSS delivers 1024-dim instruction-aware multilingual embeddings in a ~360 MB footprint. Qwen 3 Embedding for higher accuracy and multilingual	Build a RAG Pipeline
Multilingual embeddings	`qwen3-embedding:8b`, `bge-m3`, or `harrier-oss:0.6b` (compact)	Broad language coverage for cross-lingual RAG	Build Semantic Search
Speech-to-text	`whisper-large-turbo3`	Best speed/quality trade-off	Transcribe Audio
Long context (100K+ tokens)	`granite4-h:3b` or `granite4-h:7b`	Up to 1M token context with hybrid Mamba-2 architecture	Handle Long Inputs
Advanced reasoning (large)	`qwen3.6:27b`, `qwq`, or `nemotron3-nano`	27B to 32B class, top-tier math and reasoning. `qwen3.6:27b` also supports vision and tool calling	Control Reasoning and Chain-of-Thought

Need more guidance? See Model Recommendations for GPU-specific picks, ready-made multi-model stacks, and upgrade paths.

Step 3: Measure Performance on Your Hardware

Instead of guessing, use the built-in performance scorer to evaluate how well each model will run on your specific machine:

using LMKit.Hardware;
using LMKit.Model;

var models = ModelCard.GetPredefinedModelCards();

foreach (var card in models.Where(c => c.Capabilities.HasFlag(ModelCapabilities.Chat)))
{
    float score = DeviceConfiguration.GetPerformanceScore(card);
    string rating = score > 0.7f ? "Good"
                  : score > 0.4f ? "Acceptable"
                  : "Too slow";

    Console.WriteLine($"{card.ModelID,-30} Score: {score:F2}  ({rating})");
}

Score	Meaning
0.7 to 1.0	Model runs comfortably on your hardware.
0.4 to 0.7	Model works but may be slow. Consider partial GPU offloading.
Below 0.4	Model is too large for your hardware. Choose a smaller variant.

Auto-Filter by Hardware

You can also let the SDK drop models that are too small when your hardware can handle larger ones:

// Drops smaller siblings when a larger model in the same family scores 1.0
var bestModels = ModelCard.GetPredefinedModelCards(dropSmallerModels: true);

Step 4: Understand Quantization

All models in the LM-Kit catalog are distributed as pre-quantized files, primarily in 4-bit (Q4_K_M) format. Quantization compresses model weights to reduce file size and memory usage with minimal quality loss.

Precision	File Size (relative)	Quality	Use Case
4-bit (Q4_K_M)	~1x (baseline)	Very good for most tasks	Recommended default
8-bit (Q8)	~2x	Slightly better	When quality matters more than memory
16-bit (F16)	~4x	Original precision	Embedding models, fine-tuning base

For most use cases, the default 4-bit quantization in the catalog provides the best balance between quality and resource usage. You do not need to select a quantization level manually.

Step 5: Consider Context Length

Context length determines how much text the model can process in a single inference pass. Longer context means the model can handle larger documents, longer conversations, or more retrieved chunks in a RAG pipeline.

Context Length	Typical Use
2K to 4K tokens	Short prompts, simple Q&A
8K to 32K tokens	Multi-turn chat, moderate documents
128K tokens	Long documents, extended conversations
1M tokens	Entire codebases, book-length documents

Most models in the catalog support 8K to 128K tokens. A few specialized models (Granite 4 Hybrid, Nemotron 3 Nano) support up to 1M tokens.

The SDK can recommend an optimal context size based on your available memory:

using LMKit.Hardware;
using LMKit.Model;

using LM model = LM.LoadFromModelID("gemma4:e4b");
int optimalContext = DeviceConfiguration.GetOptimalContextSize(model);

Console.WriteLine($"Recommended context size: {optimalContext} tokens");

Note: Larger context sizes consume more VRAM for the KV cache. If you run into memory issues, reduce the context size or enable KV cache recycling with Configuration.EnableKVCacheRecycling = true.

For precise, hardware-aware estimation that accounts for KV cache and current GPU usage, use MemoryEstimation.FitParameters(). See Estimating Memory and Context Size for the full guide.

Step 6: Load and Verify

Once you have chosen a model, load it and verify that it is running on the expected backend:

using LMKit.Global;
using LMKit.Model;

Runtime.Initialize();

using LM model = LM.LoadFromModelID("gemma4:e4b",
    downloadingProgress: (_, len, read) =>
    {
        if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}%   ");
        return true;
    },
    loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}%   "); return true; });

Console.WriteLine($"\nModel:        {model.Name}");
Console.WriteLine($"Parameters:   {model.ParameterCount / 1e9:F1}B");
Console.WriteLine($"Context:      {model.ContextLength} tokens");
Console.WriteLine($"Backend:      {Runtime.Backend}");
Console.WriteLine($"GPU layers:   {model.GpuLayerCount}");
Console.WriteLine($"Has vision:   {model.HasVision}");
Console.WriteLine($"Has tools:    {model.HasToolCalls}");
Console.WriteLine($"Has reasoning:{model.HasReasoning}");

Model Storage and Caching

Downloaded models are stored locally so subsequent loads are instant. The SDK resolves the storage directory in this order:

Programmatic: Configuration.ModelStorageDirectory = "D:/my-models";
Environment variable: LMKIT_MODELS_DIR
Default: %APPDATA%/LM-Kit/models (Windows) or ~/.local/share/LM-Kit/models (Linux/macOS)

You can also pre-download a model without loading it:

var card = ModelCard.GetPredefinedModelCardByModelID("gemma4:e4b");

if (!card.IsLocallyAvailable)
{
    await card.DownloadAsync((path, len, read) =>
    {
        if (len.HasValue) Console.Write($"\r  {(double)read / len.Value * 100:F1}%");
        return true;
    });
}

Console.WriteLine($"Model stored at: {card.LocalPath}");

To validate file integrity after download:

bool valid = card.ValidateFileChecksum();
Console.WriteLine($"Checksum valid: {valid}");

Loading Custom Models

You are not limited to the predefined catalog. LM-Kit.NET supports any GGUF-compatible model:

// From a local file
using LM model = new LM("path/to/my-model.gguf");

// From a Hugging Face URL (auto-downloads)
using LM model = new LM(new Uri("https://huggingface.co/org/repo/resolve/main/model.gguf"));

// Quick metadata inspection (without loading weights)
var card = ModelCard.CreateFromFile("path/to/my-model.gguf");
Console.WriteLine($"Architecture: {card.Architecture}");
Console.WriteLine($"Parameters:   {card.ParameterCount}");
Console.WriteLine($"Context:      {card.ContextLength}");

Decision Flowchart

What is my task? Pick a capability (Chat, Vision, Embeddings, Speech-to-Text, etc.).
What hardware do I have? Check the Hardware Quick Pick table or run GetPerformanceScore to find which sizes fit.
Do I need special features? Tool calling, reasoning, vision, long context?
Start with the recommended model from the table above, then experiment.

Next Steps

Model Recommendations: GPU-specific picks, multi-model stacks, and upgrade paths.
Model Families and Benchmarks: detailed descriptions and benchmark data for every model family.
Model Catalog: browse all available models with interactive filtering.
Configure GPU Backends: set up GPU acceleration for faster inference.
Distributed Inference Across Multiple GPUs: split large models across multiple GPUs.
Understanding Model Loading and Caching: learn about download behavior, caching, and model properties.
Estimating Memory and Context Size: validate whether a model fits and find the optimal context size before loading.
Your First AI Agent: build a working agent with tools.

Table of Contents