Model Recommendations

This page provides practical, ready-to-use model picks organized three ways: by GPU hardware, by multi-model pipeline, and by upgrade path. All sizes listed are for 4-bit quantized weights.

Looking for the step-by-step decision guide? See Choosing the Right Model. For detailed descriptions of each model family, see Model Families and Benchmarks.

Hardware Quick Pick

Find your GPU below and use the recommended models directly.

Hardware	VRAM	Best Chat Model	Best Coding Model	Best Vision Model	Best OCR Model	Best Embedding
CPU only (16 GB+ RAM)	n/a	`qwen3.5:2b` or `phi4-mini:3.8b`	`qwen3.5:2b` (~2 GB)	`qwen3.5:2b` (~2 GB)	`paddleocr-vl-1.6:0.9b` (~0.7 GB) or `glm-ocr` (~6 GB)	`embeddinggemma-300m` or `harrier-oss:0.6b`
Entry GPU (GTX 1660, RTX 3060 6 GB)	6 GB	`gemma4:e4b` (~4.8 GB)	`qwen3.5:4b` (~3.5 GB)	`gemma4:e4b` (~4.8 GB) or `qwen3.5:4b` (~3.5 GB)	`paddleocr-vl-1.6:0.9b` or `glm-ocr` or `lightonocr-2:1b`	`embeddinggemma-300m` or `harrier-oss:0.6b`
Mid-range GPU (RTX 4060 8 GB)	8 GB	`qwen3.5:9b` (~7 GB)	`qwen3.5:9b` (~7 GB)	`glm-4.6v-flash` (~7 GB) or `qwen3.5:9b` (~7 GB)	`glm-4.6v-flash` (~7 GB) or `qwen3.5:9b`	`qwen3-embedding:0.6b`
High-end GPU (RTX 4070 Ti 12 GB)	12 GB	`phi4` (~8.5 GB) or `gemma4:12b` (~6.8 GB)	`qwen3.5:9b` (~7 GB)	`gemma4:12b` (~6.8 GB)	`glm-4.6v-flash` (~7 GB)	`qwen3-embedding:0.6b`
Enthusiast GPU (RTX 4090 24 GB)	24 GB	`gemma4:26b-a4b` (~18 GB) or `gptoss:20b` (~12.1 GB)	`devstral-small2` (~15.2 GB) or `qwen3-coder:30b-a3b` (~17.3 GB)	`gemma4:26b-a4b` (~18 GB) or `qwen3.6:27b` (~18 GB)	`qwen3.6:27b` (~18 GB)	`qwen3-embedding:4b`
Workstation / multi-GPU	48+ GB	`glm4.7-flash` (~18.1 GB) + room for embeddings	`qwen3-coder:30b-a3b` (~17.3 GB)	`qwen3.6:35b-a3b` (~22 GB)	`qwen3.6:35b-a3b` (~22 GB)	`qwen3-embedding:8b`

Tip: Actual VRAM usage is slightly higher than the file size due to the KV cache and runtime overhead. Use GetPerformanceScore for a precise measurement on your machine. See Step 3 in the choosing guide.

Model Stack Recipes

Most real-world applications combine multiple models. Here are pre-built stacks for common scenarios with approximate total VRAM requirements.

RAG Pipeline (7 GB VRAM)

Retrieve relevant documents and generate answers grounded in your data.

Role	Model	Size
Embedding	`embeddinggemma-300m` (or `harrier-oss:0.6b` for multilingual)	~0.3 GB
Reranker	`bge-m3-reranker`	~0.4 GB
Generator	`gemma4:e4b`	~4.8 GB

How-to: Build a RAG Pipeline, Improve RAG Results with Reranking

Agentic Assistant with Web Search (14 GB VRAM)

An agent that plans, searches the web, calls tools, and reasons through multi-step tasks.

Role	Model	Size
Agent (chat + tools + reasoning)	`gptoss:20b`	~12.1 GB
Embedding (for memory/context)	`embeddinggemma-300m` (or `harrier-oss:0.6b` for multilingual)	~0.3 GB

How-to: Build an Agent with Web Search, Create an Agent with Tools

Agentic Coding Assistant (19 GB VRAM)

A code-focused agent that analyzes repositories, generates code, and uses tools for file operations and web search.

Role	Model	Size
Agent (code + tools)	`qwen3-coder:30b-a3b`	~17.3 GB
Embedding (for code search)	`embeddinggemma-300m` (or `harrier-oss:0.6b` for multilingual)	~0.3 GB

How-to: Build a Function-Calling Agent, Create an Agent with Tools Demos: Code Analysis Assistant, Code Writing Assistant

Document Q&A with Vision (7 GB VRAM)

Chat with PDFs, images, and scanned documents using vision and embeddings.

Role	Model	Size
Vision + Chat	`gemma4:e4b`	~4.8 GB
Embedding	`qwen3-embedding:0.6b`	~0.6 GB

How-to: Chat with PDF Documents, Analyze Images with Vision

Meeting Transcription + Summary (7 GB VRAM)

Transcribe audio and generate structured meeting notes.

Role	Model	Size
Speech-to-text	`whisper-large-turbo3`	~0.9 GB
Summarizer	`gemma4:e4b`	~4.8 GB

How-to: Transcribe Audio with Speech-to-Text, Summarize Documents and Text

Lightweight Edge Deployment (3 GB VRAM or CPU)

Run on devices with minimal resources: laptops, embedded systems, or CPU-only servers.

Role	Model	Size
Chat	`gemma4:e4b` or `qwen3.5:2b`	~2 to 4.8 GB
Embedding	`embeddinggemma-300m`	~0.3 GB

How-to: Build a Conversational Assistant with Memory

Model Upgrade Paths

Start small during prototyping, then scale up as your hardware budget or accuracy requirements grow. Models within the same family share the same prompt format and behavior, making upgrades seamless.

Chat and Reasoning

gemma4:e4b  →  gemma4:e4b  →  gemma4:e4b  →  gemma4:26b-a4b
 ~0.8 GB       ~3.1 GB       ~7.9 GB        ~17 GB

qwen3.5:0.8b  →  qwen3.5:2b  →  qwen3.5:4b  →  qwen3.5:9b  →  qwen3.6:27b
  ~1 GB            ~2 GB          ~3.5 GB        ~7 GB          ~18 GB

Coding

qwen3.5:2b  →  qwen3.5:4b  →  qwen3.5:9b  →  devstral-small2  →  qwen3-coder:30b-a3b
  ~2 GB          ~3.5 GB        ~7 GB           ~15.2 GB             ~17.3 GB

Tip: The qwen3.5 family provides strong coding performance at smaller sizes. When you have 16+ GB VRAM, dedicated coding models (devstral-small2, qwen3-coder) deliver significantly better results on code generation, analysis, and refactoring tasks.

Vision

qwen3.5:2b  →  qwen3.5:4b  →  qwen3.5:9b  →  qwen3.6:27b  →  qwen3.6:35b-a3b
  ~2 GB          ~3.5 GB        ~7 GB          ~18 GB           ~22 GB

gemma4:e4b (with vision)  →  gemma4:e4b  →  gemma4:26b-a4b
     ~3.1 GB                   ~7.9 GB       ~17 GB

OCR and Document Understanding

paddleocr-vl-1.6:0.9b / glm-ocr  →  lightonocr-2:1b  →  glm-4.6v-flash  →  qwen3.5:9b  →  qwen3.6:27b  →  qwen3.6:35b-a3b
        ~0.7 / ~1 GB                ~0.6 GB               ~7 GB            ~7 GB          ~18 GB           ~22 GB

Tip: The dedicated OCR models (paddleocr-vl, glm-ocr, lightonocr-2) are extremely lightweight and can run alongside a chat model with minimal VRAM overhead. Use glm-4.6v-flash (~7 GB) when you need OCR combined with chat and tool calling in a single model. Scale up to VLM-based OCR (qwen3.5:9b, qwen3.6:27b) when you need higher accuracy on complex multilingual documents.

Embeddings

embeddinggemma-300m  →  qwen3-embedding:0.6b  →  qwen3-embedding:4b  →  qwen3-embedding:8b
      ~0.3 GB                ~0.6 GB                  ~2.5 GB                ~5.2 GB

Speech-to-Text

whisper-tiny  →  whisper-small  →  whisper-medium  →  whisper-large-turbo3
  ~39 MB           ~244 MB          ~488 MB              ~874 MB

Tip: Switching models within the same family usually requires no code changes. Just change the model ID passed to LoadFromModelID.

Next Steps

Choosing the Right Model: step-by-step guide to pick a model for your task and hardware.
Model Families and Benchmarks: detailed descriptions and benchmark data for every model family.
Model Catalog: browse all available models with interactive filtering.
Configure GPU Backends: set up GPU acceleration for faster inference.

Table of Contents