Model Families and Benchmarks
This page describes every model family in the LM-Kit catalog, organized by category. Use it as a reference when comparing models or exploring alternatives beyond the recommended starting points.
Looking for a quick recommendation? See Choosing the Right Model for the step-by-step decision guide, or Model Recommendations for hardware-based picks and ready-made stacks.
Chat and Reasoning Models
| Family | Sizes | Strengths |
|---|---|---|
| GPT OSS | 20B (MoE, ~3.6B active) | OpenAI open-weight. Near o3-mini on reasoning benchmarks (96% AIME 2024). Configurable reasoning effort. Strong agentic and tool-use capabilities. Runs on 16 GB VRAM thanks to MoE efficiency. |
| GLM 4.7 | 30B (MoE, ~3B active) | Z.ai. Leads the 30B class on coding and agentic benchmarks (59% SWE-bench Verified, 79% Tau2-Bench). Strong math (92% AIME 2025). Interleaved thinking preserves reasoning context across tool calls. 200K context. |
| Gemma 4 | E2B (4.6B MoE), E4B (7.5B MoE), 12B (dense), 26B-A4B (MoE, ~4B active), 31B (dense) | Google's next-generation multimodal family. Vision, tool calling, coding, math, and reasoning across 140+ languages. The MoE variants deliver large-model quality at small-model speed: gemma4:26b-a4b activates only ~4B parameters per token, so it reasons at 26B-class quality while running in ~18 GB on a single 24 GB GPU. 128K context (256K on the 12B and 31B dense variants). |
| Qwen 3.5 | 0.8B, 2B, 4B, 9B, 27B (dense), 35B-A3B (MoE, 3B active) | Alibaba. Next-generation hybrid models using Gated Delta Networks. Vision, tool calling, code, math, and OCR across 200+ languages. 262K native context. Dual-mode thinking (reasoning on/off). The 35B-A3B is a sparse MoE delivering 35B quality at a fraction of the compute. |
| Falcon H1R | 7B | TII. Hybrid Transformer + Mamba2 reasoning model. 88% on AIME 2024, outperforming many models up to 7x its size on math benchmarks. Exceptional inference speed (~1,500 tok/sec/GPU). 256K context. |
| Falcon 3 | 3B, 7B, 10B | TII. Open-weight dense models. Solid general-purpose chat with math and code. |
| Llama 3 | 1B, 3B, 8B, 70B | Meta. Well-rounded, large community. 131K context. Tool calling on 8B (3.1) and 70B (3.3). |
| Phi 4 | 3.8B (Mini), 14.7B | Microsoft. Compact and efficient. Strong for its size class, good tool calling support. |
| QwQ | 32.5B | Alibaba. Dedicated reasoning model with math, coding, and tool calling. 40K context. |
| Nemotron 3 Nano | 30B (MoE, ~3.5B active) | NVIDIA. Hybrid Mamba-2/Transformer reasoning model. 1M context. Strong on math and coding. |
| SmolLM3 | 3B | HuggingFace. Lightweight, math and code capable. 65K context. |
Code Generation Models
| Family | Sizes | Strengths |
|---|---|---|
| Qwen 3 Coder | 30B (MoE, ~3.3B active) | Alibaba. Purpose-built for agentic coding with 128 experts. Native tool calling, 262K context for repository-scale code understanding. Apache 2.0 license. |
| Devstral | 24B | Mistral. Purpose-built for agentic software engineering. 68% on SWE-bench Verified, the highest among open models under 30B. 393K context. Vision capable. |
| DeepSeek Coder | 16B | Specialized code generation. 163K context. |
| DeepSeek R1 | 8B (distilled) | Code and math reasoning. Distilled from the full R1 model. |
→ Try it: Code Analysis Assistant · Code Writing Assistant
Mistral Family (Chat, Vision, Reasoning)
| Family | Sizes | Strengths |
|---|---|---|
| Ministral 3 | 3B, 8B, 14B | Edge-optimized with vision and tool calling. 262K context. Great for on-device deployment. |
| Mistral Small 3.2 | 24B | Strong tool calling and code. 131K context. |
| Magistral Small | 24B | Reasoning specialist with transparent chain-of-thought. Tool calling support. |
| Pixtral | 12B | Vision-language model. 1M context. |
Vision / Multimodal Models
| Family | Sizes | Strengths |
|---|---|---|
| GLM-V 4.6 Flash | 10B | Z.ai. Lightweight vision-language model optimized for low-latency local deployment. Strong at OCR and document/screenshot understanding (text, layout, charts, tables). Native function calling for tool-driven multimodal agents. 131K context. |
| Gemma 4 | E2B, E4B, 12B, 26B-A4B, 31B | Also listed under Chat. Google's multimodal family with vision on every variant, plus tool calling, reasoning, coding, and math in 140+ languages. The gemma4:26b-a4b MoE pairs 26B-class visual understanding with ~4B-active efficiency (~18 GB, fits a single 24 GB GPU), while gemma4:e4b (~4.8 GB) brings vision to entry GPUs. 128K context. |
| Qwen 3.5 | 0.8B, 2B, 4B, 9B, 27B, 35B-A3B | Also listed under Chat. Vision-language with tool calling, code, and math. 262K context. |
| MiniCPM | 8B, 9B | OpenBMB. Compact vision models. MiniCPM-o 4.5 is the latest with strong visual understanding, OCR, and document parsing in 30+ languages. |
OCR and Document Understanding Models
| Family | Sizes | Strengths |
|---|---|---|
| GLM-OCR | 0.9B | Z.ai. Specialized in document parsing, OCR, and structured information extraction. Supports text, formula, table, and complex layout recognition across multiple languages. 131K context. |
| PaddleOCR VL | 0.9B | PaddlePaddle. Ultra-compact (0.7 GB) vision-language model achieving 94.5% on OmniDocBench v1.5. Supports OCR, table recognition, formula recognition, chart recognition, text spotting, and seal recognition. 131K context. |
| LightOnOCR 2 | 1B | LightOn. Efficient end-to-end OCR model refined with RLVR training. Converts documents (PDFs, scans, images) into clean, naturally ordered text. Excels at tables, receipts, forms, multi-column layouts, and math notation. 16K context. Replaces LightOnOCR 1025. |
| GLM-V 4.6 Flash | 10B | Also listed under Vision. Strong at OCR and document/screenshot understanding with native function calling. 131K context. Offers a step up in accuracy from sub-1B OCR models with native function calling (~7 GB VRAM). |
| Qwen 3.5 | 0.8B, 2B, 4B, 9B, 27B, 35B-A3B | Also listed under Chat and Vision. OCR across 200+ languages with 262K context. The 9B variant offers the best balance of accuracy and resource usage. The 27B and 35B-A3B deliver the highest accuracy for complex documents when VRAM allows. |
| MiniCPM-V 4.5 | 8B | Also listed under Vision. GPT-4o-level OCR and document parsing with multilingual support (30+ languages). |
Tip: For dedicated OCR workloads, start with
paddleocr-vl-1.6:0.9b,glm-ocr, orlightonocr-2:1b(all under 1 GB). Useglm-4.6v-flash(~7 GB) when you need OCR combined with chat and tool calling. Scale up toqwen3.5:9borqwen3.6:27bfor complex multilingual documents or when you need additional vision reasoning alongside OCR.
Enterprise and Long-Context Models
| Family | Sizes | Strengths |
|---|---|---|
| Granite 4 Hybrid | 3B, 7B (MoE) | IBM. Hybrid Mamba-2/Transformer. Up to 1M token context with 70% less memory than standard transformers. ISO 42001 certified. Strong instruction following and function calling. |
Embedding Models
| Family | Sizes | Strengths |
|---|---|---|
| Embedding Gemma | 300M | Google. Derived from Gemma. Highest-ranked open model under 500M params on MTEB. Excellent default for lightweight RAG. 2K context. |
| Qwen 3 Embedding | 0.6B, 4B, 8B | Top 3 on MTEB multilingual (8B variant). 32K/40K context. Best overall accuracy for RAG, especially multilingual and code. |
| Nomic Embed Text | 137M | Outperforms OpenAI Ada-002. Excellent quality-to-size ratio. 2K context. |
| Nomic Embed Vision | 92M | Image embeddings for visual search and similarity. ONNX format. |
| BGE-M3 | 568M | BAAI. Multilingual embeddings. 8K context. |
| BGE Small | 33M | BAAI. Ultra-lightweight English embeddings. 512 context. |
Reranking Models
| Family | Sizes | Strengths |
|---|---|---|
| BGE M3 Reranker | 568M | Reranks search candidates by relevance. Use after initial embedding retrieval for better accuracy. |
Speech-to-Text Models
| Family | Sizes | Strengths |
|---|---|---|
| Whisper | 39M to 1.5B | OpenAI. Seven size tiers from tiny to large-v3-turbo. The turbo variant offers the best speed/quality balance. |
Specialized Models
| Family | Sizes | Strengths |
|---|---|---|
| LM-Kit Sentiment Analysis | 1.2B | Finetuned for sentiment and emotion detection. |
| LM-Kit Sarcasm Detection | 1.1B | Finetuned for sarcasm detection. |
| U2-Net | 44M | Image segmentation. |
Tip: Models marked with a "Replaced by" indicator in the Model Catalog have a newer successor. Prefer the replacement model for new projects.
Compare Models with Public Leaderboards
Model benchmarks evolve rapidly. Use these independent leaderboards to compare models before committing:
| Leaderboard | What It Measures | Link |
|---|---|---|
| LMArena Chatbot Arena | Human preference Elo ratings from blind A/B tests | lmarena.ai |
| Open LLM Leaderboard | Standardized benchmarks (MMLU, ARC, HellaSwag, etc.) for open models | huggingface.co/open-llm-leaderboard |
| MTEB Leaderboard | Embedding model quality (retrieval, classification, clustering, reranking) | huggingface.co/mteb |
| SWE-bench Verified | Real-world coding ability (fixing GitHub issues) | swebench.com |
| BFCL Leaderboard | Function/tool calling accuracy | gorilla.cs.berkeley.edu |
| LiveCodeBench | Code generation on fresh, unseen problems | livecodebench.github.io |
| Open VLM Leaderboard | Vision-language model comparison | huggingface.co/opencompass |
| AIME Benchmark | Mathematical reasoning (competition-level math problems) | Results published per model release |
Tip: Benchmark scores are self-reported by model authors and may use different evaluation settings. Cross-reference multiple leaderboards, and always test on your own data before choosing a model for production.
Next Steps
- Choosing the Right Model: step-by-step guide to pick a model for your task and hardware.
- Model Recommendations: hardware quick picks, multi-model stacks, and upgrade paths.
- Model Catalog: browse all available models with interactive filtering.