How Do I Choose the Right Model Size for My Hardware?

TL;DR

Start with a simple rule: the model file size is a good approximation of the VRAM (or RAM) it will consume. A 5 GB model file needs roughly 5 GB of VRAM for GPU inference, plus additional memory for context. LM-Kit.NET provides a MemoryEstimation.FitParameters() API that calculates the exact context size and GPU layer count your hardware can support before you load the model.

The Quick Decision Table

Use this table to pick a model based on your available VRAM (GPU) or RAM (CPU):

Available Memory	Recommended Models	Typical Use Cases
2 GB	`qwen3.5:0.8b`	Edge devices, simple chat, classification
4 GB	`qwen3.5:4b`, `phi4-mini:3.8b`	General chat, tool calling, summarization
6 to 8 GB	`gemma4:e4b`, `gemma4:12b`, `qwen3.5:9b`	Strong chat, vision, reasoning, agents, code generation
12 to 16 GB	`phi4:14.7b`, `gptoss:20b`	High-quality generation, advanced reasoning, long context
24 GB+	`gemma4:26b-a4b`, `qwen3.6:27b`, `glm4.7-flash`, `qwen3.6:35b-a3b`	Maximum quality, large vision models

These recommendations assume Q4_K_M quantization (4-bit), which is the default for all models in the LM-Kit.NET catalog.

Understanding Memory Consumption

A model's memory usage has two main components:

Model weights. Determined by the file size. A 5 GB model file loads approximately 5 GB into VRAM or RAM.
KV cache. Grows with the context size (the number of tokens the model can process in a single conversation). Larger context windows require more memory.

This means the same model uses more memory with a 16,384-token context than with a 2,048-token context. If you run out of memory, reducing the context size is often the easiest fix.

Use the MemoryEstimation API

Instead of guessing, let LM-Kit.NET calculate exactly what your hardware can handle:

using LMKit.Model;

// Auto-detect the maximum context size for your GPU
var fit = MemoryEstimation.FitParameters(
    modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
    contextSize: 0  // 0 = auto-detect maximum
);

if (fit.Success)
{
    Console.WriteLine($"Max context size: {fit.ContextSize} tokens");
    Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
}
else
{
    Console.WriteLine("Model does not fit in available memory.");
}

You can also check a specific context size:

// Check if 16K context fits
var fit = MemoryEstimation.FitParameters(
    modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
    contextSize: 16384
);

This API accounts for your current GPU memory usage, the KV cache, and compute buffers. It gives you a reliable answer before committing to a model load.

Choosing by Capability

Not all models support the same features. Match the model to the task you need:

Task	Required Capability	Recommended Models
Chat and conversation	Chat	`qwen3.5:9b`, `gemma4:e4b`, `gptoss:20b`
AI agents with tool calling	ToolsCall	`qwen3.5:9b`, `qwen3.6:27b`, `gptoss:20b`, `glm4.7-flash`
Image understanding	Vision	`qwen3-vl:4b`, `qwen3.5:9b`, `gemma4:e4b`, `gemma4:12b` (~6.8 GB), `glm-4.6v-flash` (~7 GB), `gemma4:26b-a4b` (~18 GB), `qwen3.6:27b` (top tier)
Text embeddings and RAG	TextEmbeddings	`qwen3-embedding:0.6b`, `embeddinggemma-300m`, `harrier-oss:0.6b`, `bge-m3`
Speech-to-text	SpeechToText	`whisper-small` (264 MB), `whisper-large-turbo3` (874 MB)
Code generation	CodeCompletion	`qwen3-coder:30b-a3b`, `qwen3.5:9b`, `deepseek-coder-v2:16b`, `gptoss:20b`
Math and reasoning	Math, Reasoning	`qwen3.6:27b`, `gptoss:20b`, `glm4.7-flash`
OCR	OCR	`lightonocr-2:1b`, `paddleocr-vl-1.6:0.9b`, `glm-ocr`, `glm-4.6v-flash`, `qwen3.6:27b`

You can check a model's capabilities in the Model Catalog or programmatically:

using LMKit.Model;

using LM model = LM.LoadFromModelID("qwen3.5:9b");

Console.WriteLine($"Chat: {model.Capabilities.HasFlag(ModelCapabilities.Chat)}");
Console.WriteLine($"Tools: {model.Capabilities.HasFlag(ModelCapabilities.ToolsCall)}");
Console.WriteLine($"Vision: {model.Capabilities.HasFlag(ModelCapabilities.Vision)}");

Model Size vs Quality Trade-Off

Larger models produce higher-quality output, but the improvement is not linear. Here is a practical guide:

1B models handle simple classification, short-form generation, and basic Q&A. They run fast even on CPU.
4B models are the sweet spot for most tool-calling agents and structured extraction tasks. Good balance of quality and speed.
8B models deliver strong multi-turn chat, reasoning, and code generation. This is the most popular size for production agents.
12B to 20B models provide noticeably better reasoning, longer coherent outputs, and more nuanced instruction following.
27B+ models approach the quality ceiling for local inference. Best for high-stakes generation where accuracy matters most.

When in doubt, start with an 8B model. It covers the widest range of tasks with good quality and reasonable hardware requirements.

Do I need a GPU to run AI models with LM-Kit.NET?: Understand when CPU is enough and when GPU acceleration makes a real difference.
Model Catalog: Browse every supported model with download sizes, parameter counts, and capabilities.
Choosing the Right Model: In-depth guide to model selection by use case and performance tier.
Estimating Memory and Context Size: Detailed walkthrough of the MemoryEstimation API with advanced usage patterns.
How much disk space do LM-Kit.NET binaries add to my application?: Plan total deployment size including model files and native binaries.

Table of Contents