Model Recommendations
This page provides practical, ready-to-use model picks organized three ways: by GPU hardware, by multi-model pipeline, and by upgrade path. All sizes listed are for 4-bit quantized weights.
Looking for the step-by-step decision guide? See Choosing the Right Model. For detailed descriptions of each model family, see Model Families and Benchmarks.
Hardware Quick Pick
Find your GPU below and use the recommended models directly.
| Hardware | VRAM | Best Chat Model | Best Coding Model | Best Vision Model | Best OCR Model | Best Embedding |
|---|---|---|---|---|---|---|
| CPU only (16 GB+ RAM) | — | gemma3:1b or qwen3.5:2b |
qwen3.5:2b (~2 GB) |
qwen3.5:2b (~2 GB) |
paddleocr-vl:0.9b (~0.7 GB) or glm-ocr (~1 GB) |
embeddinggemma-300m |
| Entry GPU (GTX 1660, RTX 3060 6 GB) | 6 GB | gemma3:4b (~3.1 GB) |
qwen3.5:4b (~3.5 GB) |
gemma3:4b (~3.1 GB) or qwen3.5:4b (~3.5 GB) |
paddleocr-vl:0.9b or glm-ocr or lightonocr-2:1b |
embeddinggemma-300m |
| Mid-range GPU (RTX 4060 8 GB) | 8 GB | qwen3.5:9b (~7 GB) |
qwen3.5:9b (~7 GB) |
glm-4.6v-flash (~7 GB) or qwen3.5:9b (~7 GB) |
glm-4.6v-flash (~7 GB) or qwen3.5:9b |
qwen3-embedding:0.6b |
| High-end GPU (RTX 4070 Ti 12 GB) | 12 GB | gemma3:12b (~7.9 GB) |
qwen3.5:9b (~7 GB) |
gemma3:12b (~7.9 GB) |
qwen3.5:9b |
qwen3-embedding:0.6b |
| Enthusiast GPU (RTX 4090 24 GB) | 24 GB | gptoss:20b (~12.1 GB) or devstral-small2 (~15.2 GB) or qwen3-coder:30b-a3b (~17.3 GB) |
devstral-small2 (~15.2 GB) or qwen3-coder:30b-a3b (~17.3 GB) |
qwen3.5:27b (~18 GB) or qwen3.5:9b + larger chat model |
qwen3.5:27b (~18 GB) |
qwen3-embedding:4b |
| Workstation / multi-GPU | 48+ GB | glm4.7-flash (~18.1 GB) + room for embeddings |
qwen3-coder:30b-a3b (~17.3 GB) |
qwen3.5:35b-a3b (~22 GB) |
qwen3.5:35b-a3b (~22 GB) |
qwen3-embedding:8b |
Tip: Actual VRAM usage is slightly higher than the file size due to the KV cache and runtime overhead. Use
GetPerformanceScorefor a precise measurement on your machine. See Step 3 in the choosing guide.
Model Stack Recipes
Most real-world applications combine multiple models. Here are pre-built stacks for common scenarios with approximate total VRAM requirements.
RAG Pipeline (10 GB VRAM)
Retrieve relevant documents and generate answers grounded in your data.
| Role | Model | Size |
|---|---|---|
| Embedding | embeddinggemma-300m |
~0.3 GB |
| Reranker | bge-m3-reranker |
~0.4 GB |
| Generator | gemma3:12b |
~7.9 GB |
How-to: Build a RAG Pipeline, Improve RAG Results with Reranking
Agentic Assistant with Web Search (14 GB VRAM)
An agent that plans, searches the web, calls tools, and reasons through multi-step tasks.
| Role | Model | Size |
|---|---|---|
| Agent (chat + tools + reasoning) | gptoss:20b |
~12.1 GB |
| Embedding (for memory/context) | embeddinggemma-300m |
~0.3 GB |
How-to: Build an Agent with Web Search, Create an Agent with Tools
Agentic Coding Assistant (19 GB VRAM)
A code-focused agent that analyzes repositories, generates code, and uses tools for file operations and web search.
| Role | Model | Size |
|---|---|---|
| Agent (code + tools) | qwen3-coder:30b-a3b |
~17.3 GB |
| Embedding (for code search) | embeddinggemma-300m |
~0.3 GB |
How-to: Build a Function-Calling Agent, Create an Agent with Tools Demos: Code Analysis Assistant, Code Writing Assistant
Document Q&A with Vision (10 GB VRAM)
Chat with PDFs, images, and scanned documents using vision and embeddings.
| Role | Model | Size |
|---|---|---|
| Vision + Chat | gemma3:12b |
~7.9 GB |
| Embedding | qwen3-embedding:0.6b |
~0.6 GB |
How-to: Chat with PDF Documents, Analyze Images with Vision
Meeting Transcription + Summary (10 GB VRAM)
Transcribe audio and generate structured meeting notes.
| Role | Model | Size |
|---|---|---|
| Speech-to-text | whisper-large-turbo3 |
~0.9 GB |
| Summarizer | gemma3:12b |
~7.9 GB |
How-to: Transcribe Audio with Speech-to-Text, Summarize Documents and Text
Lightweight Edge Deployment (3 GB VRAM or CPU)
Run on devices with minimal resources: laptops, embedded systems, or CPU-only servers.
| Role | Model | Size |
|---|---|---|
| Chat | gemma3:1b or qwen3.5:2b |
~0.8 to 2 GB |
| Embedding | embeddinggemma-300m |
~0.3 GB |
How-to: Build a Conversational Assistant with Memory
Model Upgrade Paths
Start small during prototyping, then scale up as your hardware budget or accuracy requirements grow. Models within the same family share the same prompt format and behavior, making upgrades seamless.
Chat and Reasoning
gemma3:1b → gemma3:4b → gemma3:12b → gemma3:27b
~0.8 GB ~3.1 GB ~7.9 GB ~17 GB
qwen3.5:0.8b → qwen3.5:2b → qwen3.5:4b → qwen3.5:9b → qwen3.5:27b
~1 GB ~2 GB ~3.5 GB ~7 GB ~18 GB
Coding
qwen3.5:2b → qwen3.5:4b → qwen3.5:9b → devstral-small2 → qwen3-coder:30b-a3b
~2 GB ~3.5 GB ~7 GB ~15.2 GB ~17.3 GB
Tip: The
qwen3.5family provides strong coding performance at smaller sizes. When you have 16+ GB VRAM, dedicated coding models (devstral-small2,qwen3-coder) deliver significantly better results on code generation, analysis, and refactoring tasks.
Vision
qwen3.5:2b → qwen3.5:4b → qwen3.5:9b → qwen3.5:27b → qwen3.5:35b-a3b
~2 GB ~3.5 GB ~7 GB ~18 GB ~22 GB
gemma3:4b (with vision) → gemma3:12b → gemma3:27b
~3.1 GB ~7.9 GB ~17 GB
OCR and Document Understanding
paddleocr-vl:0.9b / glm-ocr → lightonocr-2:1b → glm-4.6v-flash → qwen3.5:9b → qwen3.5:27b → qwen3.5:35b-a3b
~0.7 / ~1 GB ~0.6 GB ~7 GB ~7 GB ~18 GB ~22 GB
Tip: The dedicated OCR models (
paddleocr-vl,glm-ocr,lightonocr-2) are extremely lightweight and can run alongside a chat model with minimal VRAM overhead. Useglm-4.6v-flash(~7 GB) when you need OCR combined with chat and tool calling in a single model. Scale up to VLM-based OCR (qwen3.5:9b,qwen3.5:27b) when you need higher accuracy on complex multilingual documents.
Embeddings
embeddinggemma-300m → qwen3-embedding:0.6b → qwen3-embedding:4b → qwen3-embedding:8b
~0.3 GB ~0.6 GB ~2.5 GB ~5.2 GB
Speech-to-Text
whisper-tiny → whisper-small → whisper-medium → whisper-large-turbo3
~39 MB ~244 MB ~488 MB ~874 MB
Tip: Switching models within the same family usually requires no code changes. Just change the model ID passed to
LoadFromModelID.
Next Steps
- Choosing the Right Model: step-by-step guide to pick a model for your task and hardware.
- Model Families and Benchmarks: detailed descriptions and benchmark data for every model family.
- Model Catalog: browse all available models with interactive filtering.
- Configure GPU Backends: set up GPU acceleration for faster inference.