How Do I Choose the Right Model Size for My Hardware?
TL;DR
Start with a simple rule: the model file size is a good approximation of the VRAM (or RAM) it will consume. A 5 GB model file needs roughly 5 GB of VRAM for GPU inference, plus additional memory for context. LM-Kit.NET provides a MemoryEstimation.FitParameters() API that calculates the exact context size and GPU layer count your hardware can support before you load the model.
The Quick Decision Table
Use this table to pick a model based on your available VRAM (GPU) or RAM (CPU):
| Available Memory | Recommended Models | Typical Use Cases |
|---|---|---|
| 2 GB | gemma3:1b, qwen3.5:0.8b |
Edge devices, simple chat, classification |
| 4 GB | qwen3.5:4b, gemma3:4b, phi4-mini:3.8b |
General chat, tool calling, summarization |
| 6 to 8 GB | qwen3.5:9b, gemma3:12b |
Strong chat, reasoning, agents, code generation |
| 12 to 16 GB | phi4:14.7b, gptoss:20b, gemma3:27b |
High-quality generation, advanced reasoning, long context |
| 24 GB+ | qwen3.5:27b, glm4.7-flash, qwen2.5-vl:32b |
Maximum quality, large vision models |
These recommendations assume Q4_K_M quantization (4-bit), which is the default for all models in the LM-Kit.NET catalog.
Understanding Memory Consumption
A model's memory usage has two main components:
- Model weights. Determined by the file size. A 5 GB model file loads approximately 5 GB into VRAM or RAM.
- KV cache. Grows with the context size (the number of tokens the model can process in a single conversation). Larger context windows require more memory.
This means the same model uses more memory with a 16,384-token context than with a 2,048-token context. If you run out of memory, reducing the context size is often the easiest fix.
Use the MemoryEstimation API
Instead of guessing, let LM-Kit.NET calculate exactly what your hardware can handle:
using LMKit.Model;
// Auto-detect the maximum context size for your GPU
var fit = MemoryEstimation.FitParameters(
modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
contextSize: 0 // 0 = auto-detect maximum
);
if (fit.Success)
{
Console.WriteLine($"Max context size: {fit.ContextSize} tokens");
Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
}
else
{
Console.WriteLine("Model does not fit in available memory.");
}
You can also check a specific context size:
// Check if 16K context fits
var fit = MemoryEstimation.FitParameters(
modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
contextSize: 16384
);
This API accounts for your current GPU memory usage, the KV cache, and compute buffers. It gives you a reliable answer before committing to a model load.
Choosing by Capability
Not all models support the same features. Match the model to the task you need:
| Task | Required Capability | Recommended Models |
|---|---|---|
| Chat and conversation | Chat | qwen3.5:9b, gemma3:12b, gptoss:20b |
| AI agents with tool calling | ToolsCall | qwen3.5:9b, qwen3.5:27b, gptoss:20b, glm4.7-flash |
| Image understanding | Vision | qwen3.5:9b, gemma3:12b, glm-4.6v-flash (~7 GB) |
| Text embeddings and RAG | TextEmbeddings | qwen3-embedding:0.6b, embeddinggemma-300m, bge-m3 |
| Speech-to-text | SpeechToText | whisper-small (264 MB), whisper-large-turbo3 (874 MB) |
| Code generation | CodeCompletion | qwen3-coder:30b-a3b, qwen3.5:9b, deepseek-coder-v2:16b, gptoss:20b |
| Math and reasoning | Math, Reasoning | qwen3.5:27b, gptoss:20b, glm4.7-flash |
| OCR | OCR | paddleocr-vl:0.9b, glm-ocr, glm-4.6v-flash |
You can check a model's capabilities in the Model Catalog or programmatically:
using LMKit.Model;
using LM model = LM.LoadFromModelID("qwen3.5:9b");
Console.WriteLine($"Chat: {model.Capabilities.HasFlag(ModelCapabilities.Chat)}");
Console.WriteLine($"Tools: {model.Capabilities.HasFlag(ModelCapabilities.ToolsCall)}");
Console.WriteLine($"Vision: {model.Capabilities.HasFlag(ModelCapabilities.Vision)}");
Model Size vs Quality Trade-Off
Larger models produce higher-quality output, but the improvement is not linear. Here is a practical guide:
- 1B models handle simple classification, short-form generation, and basic Q&A. They run fast even on CPU.
- 4B models are the sweet spot for most tool-calling agents and structured extraction tasks. Good balance of quality and speed.
- 8B models deliver strong multi-turn chat, reasoning, and code generation. This is the most popular size for production agents.
- 12B to 20B models provide noticeably better reasoning, longer coherent outputs, and more nuanced instruction following.
- 27B+ models approach the quality ceiling for local inference. Best for high-stakes generation where accuracy matters most.
When in doubt, start with an 8B model. It covers the widest range of tasks with good quality and reasonable hardware requirements.
📚 Related Content
- Do I need a GPU to run AI models with LM-Kit.NET?: Understand when CPU is enough and when GPU acceleration makes a real difference.
- Model Catalog: Browse every supported model with download sizes, parameter counts, and capabilities.
- Choosing the Right Model: In-depth guide to model selection by use case and performance tier.
- Estimating Memory and Context Size: Detailed walkthrough of the
MemoryEstimationAPI with advanced usage patterns. - How much disk space do LM-Kit.NET binaries add to my application?: Plan total deployment size including model files and native binaries.