Model Recommendations
This page provides practical, ready-to-use model picks organized three ways: by GPU hardware, by multi-model pipeline, and by upgrade path. All sizes listed are for 4-bit quantized weights.
Looking for the step-by-step decision guide? See Choosing the Right Model. For detailed descriptions of each model family, see Model Families and Benchmarks.
Hardware Quick Pick
Find your GPU below and use the recommended models directly.
| Hardware | VRAM | Best Chat Model | Best Vision Model | Best Embedding |
|---|---|---|---|---|
| CPU only (16 GB+ RAM) | — | gemma3:1b or qwen3:1.7b |
qwen3-vl:2b (~1.4 GB) |
embeddinggemma-300m |
| Entry GPU (GTX 1660, RTX 3060 6 GB) | 6 GB | gemma3:4b (~3.1 GB) |
gemma3:4b (~3.1 GB) |
embeddinggemma-300m |
| Mid-range GPU (RTX 4060 8 GB) | 8 GB | qwen3:8b (~5 GB) |
qwen3-vl:8b (~5.6 GB) |
qwen3-embedding:0.6b |
| High-end GPU (RTX 4070 Ti 12 GB) | 12 GB | gemma3:12b (~7.9 GB) |
gemma3:12b (~7.9 GB) |
qwen3-embedding:0.6b |
| Enthusiast GPU (RTX 4090 24 GB) | 24 GB | gptoss:20b (~12.1 GB) or devstral-small2 (~15.2 GB) |
qwen3-vl:8b + larger chat model |
qwen3-embedding:4b |
| Workstation / multi-GPU | 48+ GB | glm4.7-flash (~18.1 GB) + room for embeddings |
qwen3-vl:30b |
qwen3-embedding:8b |
Tip: Actual VRAM usage is slightly higher than the file size due to the KV cache and runtime overhead. Use
GetPerformanceScorefor a precise measurement on your machine. See Step 3 in the choosing guide.
Model Stack Recipes
Most real-world applications combine multiple models. Here are pre-built stacks for common scenarios with approximate total VRAM requirements.
RAG Pipeline (10 GB VRAM)
Retrieve relevant documents and generate answers grounded in your data.
| Role | Model | Size |
|---|---|---|
| Embedding | embeddinggemma-300m |
~0.3 GB |
| Reranker | bge-m3-reranker |
~0.4 GB |
| Generator | gemma3:12b |
~7.9 GB |
How-to: Build a RAG Pipeline, Improve RAG Results with Reranking
Agentic Assistant with Web Search (14 GB VRAM)
An agent that plans, searches the web, calls tools, and reasons through multi-step tasks.
| Role | Model | Size |
|---|---|---|
| Agent (chat + tools + reasoning) | gptoss:20b |
~12.1 GB |
| Embedding (for memory/context) | embeddinggemma-300m |
~0.3 GB |
How-to: Build an Agent with Web Search, Create an Agent with Tools
Document Q&A with Vision (10 GB VRAM)
Chat with PDFs, images, and scanned documents using vision and embeddings.
| Role | Model | Size |
|---|---|---|
| Vision + Chat | gemma3:12b |
~7.9 GB |
| Embedding | qwen3-embedding:0.6b |
~0.6 GB |
How-to: Chat with PDF Documents, Analyze Images with Vision
Meeting Transcription + Summary (10 GB VRAM)
Transcribe audio and generate structured meeting notes.
| Role | Model | Size |
|---|---|---|
| Speech-to-text | whisper-large-turbo3 |
~0.9 GB |
| Summarizer | gemma3:12b |
~7.9 GB |
How-to: Transcribe Audio with Speech-to-Text, Summarize Documents and Text
Lightweight Edge Deployment (3 GB VRAM or CPU)
Run on devices with minimal resources: laptops, embedded systems, or CPU-only servers.
| Role | Model | Size |
|---|---|---|
| Chat | gemma3:1b or qwen3:1.7b |
~0.8 to 1.3 GB |
| Embedding | embeddinggemma-300m |
~0.3 GB |
How-to: Build a Conversational Assistant with Memory
Model Upgrade Paths
Start small during prototyping, then scale up as your hardware budget or accuracy requirements grow. Models within the same family share the same prompt format and behavior, making upgrades seamless.
Chat and Reasoning
gemma3:1b → gemma3:4b → gemma3:12b → gemma3:27b
~0.8 GB ~3.1 GB ~7.9 GB ~17 GB
qwen3:0.6b → qwen3:1.7b → qwen3:4b → qwen3:8b → qwen3:14b
~0.5 GB ~1.3 GB ~2.5 GB ~5.0 GB ~9.2 GB
Vision
qwen3-vl:2b → qwen3-vl:4b → qwen3-vl:8b → qwen3-vl:30b
~1.4 GB ~2.9 GB ~5.6 GB ~18 GB
gemma3:4b (with vision) → gemma3:12b → gemma3:27b
~3.1 GB ~7.9 GB ~17 GB
Embeddings
embeddinggemma-300m → qwen3-embedding:0.6b → qwen3-embedding:4b → qwen3-embedding:8b
~0.3 GB ~0.6 GB ~2.5 GB ~5.2 GB
Speech-to-Text
whisper-tiny → whisper-small → whisper-medium → whisper-large-turbo3
~39 MB ~244 MB ~488 MB ~874 MB
Tip: Switching models within the same family usually requires no code changes. Just change the model ID passed to
LoadFromModelID.
Next Steps
- Choosing the Right Model: step-by-step guide to pick a model for your task and hardware.
- Model Families and Benchmarks: detailed descriptions and benchmark data for every model family.
- Model Catalog: browse all available models with interactive filtering.
- Configure GPU Backends: set up GPU acceleration for faster inference.