Table of Contents

Model Recommendations

This page provides practical, ready-to-use model picks organized three ways: by GPU hardware, by multi-model pipeline, and by upgrade path. All sizes listed are for 4-bit quantized weights.

Looking for the step-by-step decision guide? See Choosing the Right Model. For detailed descriptions of each model family, see Model Families and Benchmarks.


Hardware Quick Pick

Find your GPU below and use the recommended models directly.

Hardware VRAM Best Chat Model Best Vision Model Best Embedding
CPU only (16 GB+ RAM) gemma3:1b or qwen3:1.7b qwen3-vl:2b (~1.4 GB) embeddinggemma-300m
Entry GPU (GTX 1660, RTX 3060 6 GB) 6 GB gemma3:4b (~3.1 GB) gemma3:4b (~3.1 GB) embeddinggemma-300m
Mid-range GPU (RTX 4060 8 GB) 8 GB qwen3:8b (~5 GB) qwen3-vl:8b (~5.6 GB) qwen3-embedding:0.6b
High-end GPU (RTX 4070 Ti 12 GB) 12 GB gemma3:12b (~7.9 GB) gemma3:12b (~7.9 GB) qwen3-embedding:0.6b
Enthusiast GPU (RTX 4090 24 GB) 24 GB gptoss:20b (~12.1 GB) or devstral-small2 (~15.2 GB) qwen3-vl:8b + larger chat model qwen3-embedding:4b
Workstation / multi-GPU 48+ GB glm4.7-flash (~18.1 GB) + room for embeddings qwen3-vl:30b qwen3-embedding:8b

Tip: Actual VRAM usage is slightly higher than the file size due to the KV cache and runtime overhead. Use GetPerformanceScore for a precise measurement on your machine. See Step 3 in the choosing guide.


Model Stack Recipes

Most real-world applications combine multiple models. Here are pre-built stacks for common scenarios with approximate total VRAM requirements.

RAG Pipeline (10 GB VRAM)

Retrieve relevant documents and generate answers grounded in your data.

Role Model Size
Embedding embeddinggemma-300m ~0.3 GB
Reranker bge-m3-reranker ~0.4 GB
Generator gemma3:12b ~7.9 GB

How-to: Build a RAG Pipeline, Improve RAG Results with Reranking

Agentic Assistant with Web Search (14 GB VRAM)

An agent that plans, searches the web, calls tools, and reasons through multi-step tasks.

Role Model Size
Agent (chat + tools + reasoning) gptoss:20b ~12.1 GB
Embedding (for memory/context) embeddinggemma-300m ~0.3 GB

How-to: Build an Agent with Web Search, Create an Agent with Tools

Document Q&A with Vision (10 GB VRAM)

Chat with PDFs, images, and scanned documents using vision and embeddings.

Role Model Size
Vision + Chat gemma3:12b ~7.9 GB
Embedding qwen3-embedding:0.6b ~0.6 GB

How-to: Chat with PDF Documents, Analyze Images with Vision

Meeting Transcription + Summary (10 GB VRAM)

Transcribe audio and generate structured meeting notes.

Role Model Size
Speech-to-text whisper-large-turbo3 ~0.9 GB
Summarizer gemma3:12b ~7.9 GB

How-to: Transcribe Audio with Speech-to-Text, Summarize Documents and Text

Lightweight Edge Deployment (3 GB VRAM or CPU)

Run on devices with minimal resources: laptops, embedded systems, or CPU-only servers.

Role Model Size
Chat gemma3:1b or qwen3:1.7b ~0.8 to 1.3 GB
Embedding embeddinggemma-300m ~0.3 GB

How-to: Build a Conversational Assistant with Memory


Model Upgrade Paths

Start small during prototyping, then scale up as your hardware budget or accuracy requirements grow. Models within the same family share the same prompt format and behavior, making upgrades seamless.

Chat and Reasoning

gemma3:1b  →  gemma3:4b  →  gemma3:12b  →  gemma3:27b
 ~0.8 GB       ~3.1 GB       ~7.9 GB        ~17 GB

qwen3:0.6b  →  qwen3:1.7b  →  qwen3:4b  →  qwen3:8b  →  qwen3:14b
 ~0.5 GB        ~1.3 GB        ~2.5 GB      ~5.0 GB       ~9.2 GB

Vision

qwen3-vl:2b  →  qwen3-vl:4b  →  qwen3-vl:8b  →  qwen3-vl:30b
  ~1.4 GB         ~2.9 GB         ~5.6 GB         ~18 GB

gemma3:4b (with vision)  →  gemma3:12b  →  gemma3:27b
     ~3.1 GB                   ~7.9 GB       ~17 GB

Embeddings

embeddinggemma-300m  →  qwen3-embedding:0.6b  →  qwen3-embedding:4b  →  qwen3-embedding:8b
      ~0.3 GB                ~0.6 GB                  ~2.5 GB                ~5.2 GB

Speech-to-Text

whisper-tiny  →  whisper-small  →  whisper-medium  →  whisper-large-turbo3
  ~39 MB           ~244 MB          ~488 MB              ~874 MB

Tip: Switching models within the same family usually requires no code changes. Just change the model ID passed to LoadFromModelID.


Next Steps

Share