Table of Contents

Model Families and Benchmarks

This page describes every model family in the LM-Kit catalog, organized by category. Use it as a reference when comparing models or exploring alternatives beyond the recommended starting points.

Looking for a quick recommendation? See Choosing the Right Model for the step-by-step decision guide, or Model Recommendations for hardware-based picks and ready-made stacks.


Chat and Reasoning Models

Family Sizes Strengths
GPT OSS 20B (MoE, ~3.6B active) OpenAI open-weight. Near o3-mini on reasoning benchmarks (96% AIME 2024). Configurable reasoning effort. Strong agentic and tool-use capabilities. Runs on 16 GB VRAM thanks to MoE efficiency.
GLM 4.7 30B (MoE, ~3B active) Z.ai. Leads the 30B class on coding and agentic benchmarks (59% SWE-bench Verified, 79% Tau2-Bench). Strong math (92% AIME 2025). Interleaved thinking preserves reasoning context across tool calls. 200K context.
Gemma 3 270M, 1B, 4B, 12B, 27B Google. Versatile all-rounder with vision support (4B+). The 27B is among the highest-rated open models on LMArena. 128K context. Excellent quality-to-size ratio across the full range.
Qwen 3 0.6B, 1.7B, 4B, 8B, 14B Alibaba. Dual-mode thinking (reasoning on/off in one model). 119 languages, strong math and tool calling. Each size matches a Qwen 2.5 model roughly 2x larger.
Falcon H1R 7B TII. Hybrid Transformer + Mamba2 reasoning model. 88% on AIME 2024, outperforming many models up to 7x its size on math benchmarks. Exceptional inference speed (~1,500 tok/sec/GPU). 256K context.
Falcon 3 3B, 7B, 10B TII. Open-weight dense models. Solid general-purpose chat with math and code.
Llama 3 1B, 3B, 8B, 70B Meta. Well-rounded, large community. 131K context. Tool calling on 8B (3.1) and 70B (3.3).
Phi 4 3.8B (Mini), 14.7B Microsoft. Compact and efficient. Strong for its size class, good tool calling support.
QwQ 32.5B Alibaba. Dedicated reasoning model with math, coding, and tool calling. 40K context.
Nemotron 3 Nano 30B (MoE, ~3.5B active) NVIDIA. Hybrid Mamba-2/Transformer reasoning model. 1M context. Strong on math and coding.
SmolLM3 3B HuggingFace. Lightweight, math and code capable. 65K context.

Code Generation Models

Family Sizes Strengths
Devstral 24B Mistral. Purpose-built for agentic software engineering. 68% on SWE-bench Verified, the highest among open models under 30B. 393K context. Vision capable.
DeepSeek Coder 16B Specialized code generation. 163K context.
DeepSeek R1 8B (distilled) Code and math reasoning. Distilled from the full R1 model.

Mistral Family (Chat, Vision, Reasoning)

Family Sizes Strengths
Ministral 3 3B, 8B, 14B Edge-optimized with vision and tool calling. 262K context. Great for on-device deployment.
Mistral Small 3.2 24B Strong tool calling and code. 131K context.
Magistral Small 24B Reasoning specialist with transparent chain-of-thought. Tool calling support.
Pixtral 12B Vision-language model. 1M context.

Vision / Multimodal Models

Family Sizes Strengths
Qwen 3 VL 2B, 4B, 8B, 30B Vision-language with tool calling, code, and math. 262K context. The 30B is an MoE variant.
MiniCPM 8B, 9B OpenBMB. Compact vision models. MiniCPM-o 4.5 is the latest with strong visual understanding.
LightOnOCR 1B Specialized for OCR tasks. Lightweight.

Enterprise and Long-Context Models

Family Sizes Strengths
Granite 4 Hybrid 3B, 7B (MoE) IBM. Hybrid Mamba-2/Transformer. Up to 1M token context with 70% less memory than standard transformers. ISO 42001 certified. Strong instruction following and function calling.

Embedding Models

Family Sizes Strengths
Embedding Gemma 300M Google. Derived from Gemma. Highest-ranked open model under 500M params on MTEB. Excellent default for lightweight RAG. 2K context.
Qwen 3 Embedding 0.6B, 4B, 8B Top 3 on MTEB multilingual (8B variant). 32K/40K context. Best overall accuracy for RAG, especially multilingual and code.
Nomic Embed Text 137M Outperforms OpenAI Ada-002. Excellent quality-to-size ratio. 2K context.
Nomic Embed Vision 92M Image embeddings for visual search and similarity. ONNX format.
BGE-M3 568M BAAI. Multilingual embeddings. 8K context.
BGE Small 33M BAAI. Ultra-lightweight English embeddings. 512 context.

Reranking Models

Family Sizes Strengths
BGE M3 Reranker 568M Reranks search candidates by relevance. Use after initial embedding retrieval for better accuracy.

Speech-to-Text Models

Family Sizes Strengths
Whisper 39M to 1.5B OpenAI. Seven size tiers from tiny to large-v3-turbo. The turbo variant offers the best speed/quality balance.

Specialized Models

Family Sizes Strengths
LM-Kit Sentiment Analysis 1.2B Finetuned for sentiment and emotion detection.
LM-Kit Sarcasm Detection 1.1B Finetuned for sarcasm detection.
U2-Net 44M Image segmentation.

Tip: Models marked with a "Replaced by" indicator in the Model Catalog have a newer successor. Prefer the replacement model for new projects.


Compare Models with Public Leaderboards

Model benchmarks evolve rapidly. Use these independent leaderboards to compare models before committing:

Leaderboard What It Measures Link
LMArena Chatbot Arena Human preference Elo ratings from blind A/B tests lmarena.ai
Open LLM Leaderboard Standardized benchmarks (MMLU, ARC, HellaSwag, etc.) for open models huggingface.co/open-llm-leaderboard
MTEB Leaderboard Embedding model quality (retrieval, classification, clustering, reranking) huggingface.co/mteb
SWE-bench Verified Real-world coding ability (fixing GitHub issues) swebench.com
BFCL Leaderboard Function/tool calling accuracy gorilla.cs.berkeley.edu
LiveCodeBench Code generation on fresh, unseen problems livecodebench.github.io
Open VLM Leaderboard Vision-language model comparison huggingface.co/opencompass
AIME Benchmark Mathematical reasoning (competition-level math problems) Results published per model release

Tip: Benchmark scores are self-reported by model authors and may use different evaluation settings. Cross-reference multiple leaderboards, and always test on your own data before choosing a model for production.


Next Steps

Share