Table of Contents

How Do I Choose the Right Model Size for My Hardware?


TL;DR

Start with a simple rule: the model file size is a good approximation of the VRAM (or RAM) it will consume. A 5 GB model file needs roughly 5 GB of VRAM for GPU inference, plus additional memory for context. LM-Kit.NET provides a MemoryEstimation.FitParameters() API that calculates the exact context size and GPU layer count your hardware can support before you load the model.


The Quick Decision Table

Use this table to pick a model based on your available VRAM (GPU) or RAM (CPU):

Available Memory Recommended Models Typical Use Cases
2 GB gemma3:1b, qwen3.5:0.8b Edge devices, simple chat, classification
4 GB qwen3.5:4b, gemma3:4b, phi4-mini:3.8b General chat, tool calling, summarization
6 to 8 GB qwen3.5:9b, gemma3:12b Strong chat, reasoning, agents, code generation
12 to 16 GB phi4:14.7b, gptoss:20b, gemma3:27b High-quality generation, advanced reasoning, long context
24 GB+ qwen3.5:27b, glm4.7-flash, qwen2.5-vl:32b Maximum quality, large vision models

These recommendations assume Q4_K_M quantization (4-bit), which is the default for all models in the LM-Kit.NET catalog.


Understanding Memory Consumption

A model's memory usage has two main components:

  1. Model weights. Determined by the file size. A 5 GB model file loads approximately 5 GB into VRAM or RAM.
  2. KV cache. Grows with the context size (the number of tokens the model can process in a single conversation). Larger context windows require more memory.

This means the same model uses more memory with a 16,384-token context than with a 2,048-token context. If you run out of memory, reducing the context size is often the easiest fix.


Use the MemoryEstimation API

Instead of guessing, let LM-Kit.NET calculate exactly what your hardware can handle:

using LMKit.Model;

// Auto-detect the maximum context size for your GPU
var fit = MemoryEstimation.FitParameters(
    modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
    contextSize: 0  // 0 = auto-detect maximum
);

if (fit.Success)
{
    Console.WriteLine($"Max context size: {fit.ContextSize} tokens");
    Console.WriteLine($"GPU layers: {fit.GpuLayerCount}");
}
else
{
    Console.WriteLine("Model does not fit in available memory.");
}

You can also check a specific context size:

// Check if 16K context fits
var fit = MemoryEstimation.FitParameters(
    modelPath: "path/to/qwen3.5-9b-Q4_K_M.lmk",
    contextSize: 16384
);

This API accounts for your current GPU memory usage, the KV cache, and compute buffers. It gives you a reliable answer before committing to a model load.


Choosing by Capability

Not all models support the same features. Match the model to the task you need:

Task Required Capability Recommended Models
Chat and conversation Chat qwen3.5:9b, gemma3:12b, gptoss:20b
AI agents with tool calling ToolsCall qwen3.5:9b, qwen3.5:27b, gptoss:20b, glm4.7-flash
Image understanding Vision qwen3.5:9b, gemma3:12b, glm-4.6v-flash (~7 GB)
Text embeddings and RAG TextEmbeddings qwen3-embedding:0.6b, embeddinggemma-300m, bge-m3
Speech-to-text SpeechToText whisper-small (264 MB), whisper-large-turbo3 (874 MB)
Code generation CodeCompletion qwen3-coder:30b-a3b, qwen3.5:9b, deepseek-coder-v2:16b, gptoss:20b
Math and reasoning Math, Reasoning qwen3.5:27b, gptoss:20b, glm4.7-flash
OCR OCR paddleocr-vl:0.9b, glm-ocr, glm-4.6v-flash

You can check a model's capabilities in the Model Catalog or programmatically:

using LMKit.Model;

using LM model = LM.LoadFromModelID("qwen3.5:9b");

Console.WriteLine($"Chat: {model.Capabilities.HasFlag(ModelCapabilities.Chat)}");
Console.WriteLine($"Tools: {model.Capabilities.HasFlag(ModelCapabilities.ToolsCall)}");
Console.WriteLine($"Vision: {model.Capabilities.HasFlag(ModelCapabilities.Vision)}");

Model Size vs Quality Trade-Off

Larger models produce higher-quality output, but the improvement is not linear. Here is a practical guide:

  • 1B models handle simple classification, short-form generation, and basic Q&A. They run fast even on CPU.
  • 4B models are the sweet spot for most tool-calling agents and structured extraction tasks. Good balance of quality and speed.
  • 8B models deliver strong multi-turn chat, reasoning, and code generation. This is the most popular size for production agents.
  • 12B to 20B models provide noticeably better reasoning, longer coherent outputs, and more nuanced instruction following.
  • 27B+ models approach the quality ceiling for local inference. Best for high-stakes generation where accuracy matters most.

When in doubt, start with an 8B model. It covers the widest range of tasks with good quality and reasonable hardware requirements.


Share