Choosing the Right Model for Your Use Case and Hardware
LM-Kit.NET ships with a curated catalog of 60+ models that have been validated and benchmarked for compatibility. Selecting the right model depends on your task, your hardware, and the trade-off between quality and speed. This guide walks you through the decision process.
Recently added to the catalog: GPT OSS 20B (OpenAI, MoE reasoning), GLM 4.7 Flash (Z.ai, MoE coding/agentic), Falcon H1R 7B (hybrid Mamba-2 reasoning), Devstral Small 2 (agentic coding, 393K context), Qwen 3 VL 30B (MoE vision), Granite 4 Hybrid (1M context), Nemotron 3 Nano 30B (MoE, 1M context), SmolLM3 3B.
TL;DR
using LMKit.Model;
// Load a model by its catalog ID (auto-downloads if needed)
using LM model = LM.LoadFromModelID("gemma3:4b");
| Use Case | Recommended Model |
|---|---|
| General-purpose assistant | gemma3:4b or qwen3:4b |
| Budget hardware (CPU only) | gemma3:1b or qwen3:1.7b |
| High-quality reasoning | gptoss:20b or glm4.7-flash |
| Tool-calling agent | glm4.7-flash or qwen3:8b |
| Agentic coding | devstral-small2 or glm4.7-flash |
| Vision / multimodal | gemma3:12b or qwen3-vl:8b |
| Embeddings for RAG | embeddinggemma-300m or qwen3-embedding:0.6b |
| Speech-to-text | whisper-large-turbo3 |
Browse all available models in the Model Catalog. For GPU-specific picks, multi-model stacks, and upgrade paths, see Model Recommendations.
The Model Catalog
LM-Kit maintains a predefined catalog of models hosted on Hugging Face. Every model in the catalog has been tested for correctness, performance, and stability with the SDK.
You can browse the full catalog interactively in the Model Catalog page, which lets you filter by capability, size, and format.
In code, you can retrieve the full catalog programmatically:
using LMKit.Model;
// Get all predefined models
List<ModelCard> models = ModelCard.GetPredefinedModelCards();
foreach (var card in models)
{
Console.WriteLine($"{card.ModelID,-30} {card.ParameterCount / 1e9,5:F1}B {card.Capabilities}");
}
To load a model by its catalog ID:
using LMKit.Model;
using LM model = LM.LoadFromModelID("gemma3:4b");
LoadFromModelID downloads the model automatically if it is not already cached locally.
Step 1: Match the Model to Your Task
Every model in the catalog has a set of capabilities that describe what it can do. Choose a model whose capabilities match your use case.
| Capability | Description | Example Models |
|---|---|---|
| Chat | Multi-turn dialogue, Q&A, assistants | Gemma 3, Qwen 3, GPT OSS, GLM 4.7, Phi 4 |
| Text Generation | Content creation, summarization, rewriting | Gemma 3, Qwen 3, GPT OSS, GLM 4.7, Mistral |
| Code Completion | Code generation and completion | Devstral, GLM 4.7, GPT OSS, DeepSeek Coder, Falcon H1R |
| Reasoning | Multi-step reasoning, chain-of-thought | GPT OSS, GLM 4.7, Falcon H1R, Magistral, QwQ, Nemotron 3 Nano |
| Tools Call | Function/tool invocation by the model | GLM 4.7, Qwen 3, GPT OSS, Mistral Small 3.2, Ministral 3, Granite 4 Hybrid |
| Vision | Image understanding, visual Q&A | Gemma 3 (4B+), Qwen 3 VL, Ministral 3, Devstral, MiniCPM, Pixtral |
| Text Embeddings | Semantic similarity, clustering, RAG retrieval | Embedding Gemma, Qwen 3 Embedding, Nomic Embed, BGE-M3 |
| Image Embeddings | Image similarity and visual search | Nomic Embed Vision |
| Text Reranking | Reranking search candidates by relevance | BGE M3 Reranker |
| Speech-to-Text | Audio transcription | Whisper (tiny through large-v3-turbo) |
| Sentiment Analysis | Sentiment and emotion detection | LM-Kit Sentiment Analysis (finetuned) |
| Math | Mathematical reasoning | GLM 4.7, GPT OSS, Qwen 3 (4B+), Falcon H1R |
| Image Segmentation | Image partitioning into regions | U2-Net |
For detailed descriptions and benchmark data for each family, see Model Families and Benchmarks.
You can filter models by capability in code:
using LMKit.Model;
var chatModels = ModelCard.GetPredefinedModelCards()
.Where(c => c.Capabilities.HasFlag(ModelCapabilities.Chat))
.ToList();
Console.WriteLine($"Found {chatModels.Count} chat-capable models");
Step 2: Size the Model to Your Hardware
Model size directly determines how much memory (RAM or VRAM) you need. The general rule: larger models produce better outputs but require more powerful hardware.
Quick Sizing Guide
| Model Size | RAM / VRAM Needed (4-bit) | Hardware | Typical Use |
|---|---|---|---|
| Under 1B | ~1 GB | CPU | Embeddings, lightweight classification |
| 1B to 3B | 1 to 2 GB | CPU or entry-level GPU | Simple chat, basic classification, translation |
| 4B to 8B | 3 to 6 GB | GPU with 6+ GB VRAM | General-purpose chat, RAG, tool calling |
| 12B to 14B | 8 to 10 GB | GPU with 10+ GB VRAM | High-quality chat, complex reasoning |
| 20B to 30B | 12 to 20 GB | GPU with 16+ GB VRAM or multi-GPU | Advanced reasoning, large-scale production. MoE models (GPT OSS, GLM 4.7) activate only ~3B params per token, delivering 20B/30B quality at lower compute cost. |
| 70B | 40+ GB | Multi-GPU setup | Maximum quality, enterprise server workloads |
Recommended Starting Points
| Use Case | Recommended Model | Why | How-To Guide |
|---|---|---|---|
| General-purpose assistant | gemma3:4b or qwen3:4b |
Good quality-to-size ratio, vision support | Build a Conversational Assistant |
| Budget hardware (CPU only) | gemma3:1b or qwen3:1.7b |
Fast on CPU, acceptable quality | Load Model and Generate |
| High-quality reasoning | gptoss:20b or glm4.7-flash |
MoE efficiency: only ~3B active params with 20B/30B quality reasoning | Control Reasoning and Chain-of-Thought |
| Reasoning on small hardware | falcon-h1r:7b or qwen3:8b |
Falcon H1R scores 88% AIME 2024, outperforming many larger models on math | Control Reasoning and Chain-of-Thought |
| Agentic coding | devstral-small2 or glm4.7-flash |
Top SWE-bench scores, agentic multi-file coding | Build a Function-Calling Agent |
| Code generation | devstral-small2 or gptoss:20b |
Specialized for code and tool-driven development | Extract Structured Data |
| Tool-calling agent | glm4.7-flash or qwen3:8b |
GLM 4.7 leads agentic benchmarks; Qwen 3 offers native MCP support | Create an Agent with Tools |
| Vision / multimodal | gemma3:12b or qwen3-vl:8b |
Strong vision with reasoning. For lighter hardware, gemma3:4b or qwen3-vl:2b |
Analyze Images with Vision |
| Embeddings for RAG | embeddinggemma-300m or qwen3-embedding:0.6b |
Embedding Gemma is the top open model under 500M on MTEB. Qwen 3 Embedding for higher accuracy and multilingual | Build a RAG Pipeline |
| Multilingual embeddings | qwen3-embedding:8b or bge-m3 |
Broad language coverage for cross-lingual RAG | Build Semantic Search |
| Speech-to-text | whisper-large-turbo3 |
Best speed/quality trade-off | Transcribe Audio |
| Long context (100K+ tokens) | granite4-h:3b or granite4-h:7b |
Up to 1M token context with hybrid Mamba-2 architecture | Handle Long Inputs |
| Advanced reasoning (large) | qwq or nemotron3-nano |
32B/30B class, top-tier math and reasoning | Control Reasoning and Chain-of-Thought |
Need more guidance? See Model Recommendations for GPU-specific picks, ready-made multi-model stacks, and upgrade paths.
Step 3: Measure Performance on Your Hardware
Instead of guessing, use the built-in performance scorer to evaluate how well each model will run on your specific machine:
using LMKit.Model;
var models = ModelCard.GetPredefinedModelCards();
foreach (var card in models.Where(c => c.Capabilities.HasFlag(ModelCapabilities.Chat)))
{
float score = LM.DeviceConfiguration.GetPerformanceScore(card);
string rating = score > 0.7f ? "Good"
: score > 0.4f ? "Acceptable"
: "Too slow";
Console.WriteLine($"{card.ModelID,-30} Score: {score:F2} ({rating})");
}
| Score | Meaning |
|---|---|
| 0.7 to 1.0 | Model runs comfortably on your hardware. |
| 0.4 to 0.7 | Model works but may be slow. Consider partial GPU offloading. |
| Below 0.4 | Model is too large for your hardware. Choose a smaller variant. |
Auto-Filter by Hardware
You can also let the SDK drop models that are too small when your hardware can handle larger ones:
// Drops smaller siblings when a larger model in the same family scores 1.0
var bestModels = ModelCard.GetPredefinedModelCards(dropSmallerModels: true);
Step 4: Understand Quantization
All models in the LM-Kit catalog are distributed as pre-quantized files, primarily in 4-bit (Q4_K_M) format. Quantization compresses model weights to reduce file size and memory usage with minimal quality loss.
| Precision | File Size (relative) | Quality | Use Case |
|---|---|---|---|
| 4-bit (Q4_K_M) | ~1x (baseline) | Very good for most tasks | Recommended default |
| 8-bit (Q8) | ~2x | Slightly better | When quality matters more than memory |
| 16-bit (F16) | ~4x | Original precision | Embedding models, fine-tuning base |
For most use cases, the default 4-bit quantization in the catalog provides the best balance between quality and resource usage. You do not need to select a quantization level manually.
Step 5: Consider Context Length
Context length determines how much text the model can process in a single inference pass. Longer context means the model can handle larger documents, longer conversations, or more retrieved chunks in a RAG pipeline.
| Context Length | Typical Use |
|---|---|
| 2K to 4K tokens | Short prompts, simple Q&A |
| 8K to 32K tokens | Multi-turn chat, moderate documents |
| 128K tokens | Long documents, extended conversations |
| 1M tokens | Entire codebases, book-length documents |
Most models in the catalog support 8K to 128K tokens. A few specialized models (Granite 4 Hybrid, Nemotron 3 Nano) support up to 1M tokens.
The SDK can recommend an optimal context size based on your available memory:
using LMKit.Model;
using LM model = LM.LoadFromModelID("gemma3:4b");
int optimalContext = LM.DeviceConfiguration.GetOptimalContextSize(model);
Console.WriteLine($"Recommended context size: {optimalContext} tokens");
Note: Larger context sizes consume more VRAM for the KV cache. If you run into memory issues, reduce the context size or enable KV cache recycling with
Configuration.EnableKVCacheRecycling = true.
For precise, hardware-aware estimation that accounts for KV cache and current GPU usage, use MemoryEstimation.FitParameters(). See Estimating Memory and Context Size for the full guide.
Step 6: Load and Verify
Once you have chosen a model, load it and verify that it is running on the expected backend:
using LMKit.Global;
using LMKit.Model;
Runtime.Initialize();
using LM model = LM.LoadFromModelID("gemma3:4b",
downloadingProgress: (_, len, read) =>
{
if (len.HasValue) Console.Write($"\rDownloading: {(double)read / len.Value * 100:F1}% ");
return true;
},
loadingProgress: p => { Console.Write($"\rLoading: {p * 100:F0}% "); return true; });
Console.WriteLine($"\nModel: {model.Name}");
Console.WriteLine($"Parameters: {model.ParameterCount / 1e9:F1}B");
Console.WriteLine($"Context: {model.ContextLength} tokens");
Console.WriteLine($"Backend: {Runtime.Backend}");
Console.WriteLine($"GPU layers: {model.GpuLayerCount}");
Console.WriteLine($"Has vision: {model.HasVision}");
Console.WriteLine($"Has tools: {model.HasToolCalls}");
Console.WriteLine($"Has reasoning:{model.HasReasoning}");
Model Storage and Caching
Downloaded models are stored locally so subsequent loads are instant. The SDK resolves the storage directory in this order:
- Programmatic:
Configuration.ModelStorageDirectory = "D:/my-models"; - Environment variable:
LMKIT_MODELS_DIR - Default:
%APPDATA%/LM-Kit/models(Windows) or~/.local/share/LM-Kit/models(Linux/macOS)
You can also pre-download a model without loading it:
var card = ModelCard.GetPredefinedModelCardByModelID("gemma3:4b");
if (!card.IsLocallyAvailable)
{
await card.DownloadAsync((path, len, read) =>
{
if (len.HasValue) Console.Write($"\r {(double)read / len.Value * 100:F1}%");
return true;
});
}
Console.WriteLine($"Model stored at: {card.LocalPath}");
To validate file integrity after download:
bool valid = card.ValidateFileChecksum();
Console.WriteLine($"Checksum valid: {valid}");
Loading Custom Models
You are not limited to the predefined catalog. LM-Kit.NET supports any GGUF-compatible model:
// From a local file
using LM model = new LM("path/to/my-model.gguf");
// From a Hugging Face URL (auto-downloads)
using LM model = new LM(new Uri("https://huggingface.co/org/repo/resolve/main/model.gguf"));
// Quick metadata inspection (without loading weights)
var card = ModelCard.CreateFromFile("path/to/my-model.gguf");
Console.WriteLine($"Architecture: {card.Architecture}");
Console.WriteLine($"Parameters: {card.ParameterCount}");
Console.WriteLine($"Context: {card.ContextLength}");
Decision Flowchart
- What is my task? Pick a capability (Chat, Vision, Embeddings, Speech-to-Text, etc.).
- What hardware do I have? Check the Hardware Quick Pick table or run
GetPerformanceScoreto find which sizes fit. - Do I need special features? Tool calling, reasoning, vision, long context?
- Start with the recommended model from the table above, then experiment.
Next Steps
- Model Recommendations: GPU-specific picks, multi-model stacks, and upgrade paths.
- Model Families and Benchmarks: detailed descriptions and benchmark data for every model family.
- Model Catalog: browse all available models with interactive filtering.
- Configure GPU Backends: set up GPU acceleration for faster inference.
- Distributed Inference Across Multiple GPUs: split large models across multiple GPUs.
- Understanding Model Loading and Caching: learn about download behavior, caching, and model properties.
- Estimating Memory and Context Size: validate whether a model fits and find the optimal context size before loading.
- Your First AI Agent: build a working agent with tools.