Model Families and Benchmarks

This page describes every model family in the LM-Kit catalog, organized by category. Use it as a reference when comparing models or exploring alternatives beyond the recommended starting points.

Looking for a quick recommendation? See Choosing the Right Model for the step-by-step decision guide, or Model Recommendations for hardware-based picks and ready-made stacks.

Chat and Reasoning Models

Family	Sizes	Strengths
GPT OSS	20B (MoE, ~3.6B active)	OpenAI open-weight. Near o3-mini on reasoning benchmarks (96% AIME 2024). Configurable reasoning effort. Strong agentic and tool-use capabilities. Runs on 16 GB VRAM thanks to MoE efficiency.
GLM 4.7	30B (MoE, ~3B active)	Z.ai. Leads the 30B class on coding and agentic benchmarks (59% SWE-bench Verified, 79% Tau2-Bench). Strong math (92% AIME 2025). Interleaved thinking preserves reasoning context across tool calls. 200K context.
Gemma 4	270M, 1B, 4B, 12B, 27B	Google. Versatile all-rounder with vision support (4B+). The 27B is among the highest-rated open models on LMArena. 128K context. Excellent quality-to-size ratio across the full range.
Qwen 3.5	0.8B, 2B, 4B, 9B, 27B (dense), 35B-A3B (MoE, 3B active)	Alibaba. Next-generation hybrid models using Gated Delta Networks. Vision, tool calling, code, math, and OCR across 200+ languages. 262K native context. Dual-mode thinking (reasoning on/off). The 35B-A3B is a sparse MoE delivering 35B quality at a fraction of the compute.
Falcon H1R	7B	TII. Hybrid Transformer + Mamba2 reasoning model. 88% on AIME 2024, outperforming many models up to 7x its size on math benchmarks. Exceptional inference speed (~1,500 tok/sec/GPU). 256K context.
Falcon 3	3B, 7B, 10B	TII. Open-weight dense models. Solid general-purpose chat with math and code.
Llama 3	1B, 3B, 8B, 70B	Meta. Well-rounded, large community. 131K context. Tool calling on 8B (3.1) and 70B (3.3).
Phi 4	3.8B (Mini), 14.7B	Microsoft. Compact and efficient. Strong for its size class, good tool calling support.
QwQ	32.5B	Alibaba. Dedicated reasoning model with math, coding, and tool calling. 40K context.
Nemotron 3 Nano	30B (MoE, ~3.5B active)	NVIDIA. Hybrid Mamba-2/Transformer reasoning model. 1M context. Strong on math and coding.
SmolLM3	3B	HuggingFace. Lightweight, math and code capable. 65K context.

Code Generation Models

Family	Sizes	Strengths
Qwen 3 Coder	30B (MoE, ~3.3B active)	Alibaba. Purpose-built for agentic coding with 128 experts. Native tool calling, 262K context for repository-scale code understanding. Apache 2.0 license.
Devstral	24B	Mistral. Purpose-built for agentic software engineering. 68% on SWE-bench Verified, the highest among open models under 30B. 393K context. Vision capable.
DeepSeek Coder	16B	Specialized code generation. 163K context.
DeepSeek R1	8B (distilled)	Code and math reasoning. Distilled from the full R1 model.

→ Try it: Code Analysis Assistant · Code Writing Assistant

Mistral Family (Chat, Vision, Reasoning)

Family	Sizes	Strengths
Ministral 3	3B, 8B, 14B	Edge-optimized with vision and tool calling. 262K context. Great for on-device deployment.
Mistral Small 3.2	24B	Strong tool calling and code. 131K context.
Magistral Small	24B	Reasoning specialist with transparent chain-of-thought. Tool calling support.
Pixtral	12B	Vision-language model. 1M context.

Vision / Multimodal Models

Family	Sizes	Strengths
GLM-V 4.6 Flash	10B	Z.ai. Lightweight vision-language model optimized for low-latency local deployment. Strong at OCR and document/screenshot understanding (text, layout, charts, tables). Native function calling for tool-driven multimodal agents. 131K context.
Qwen 3.5	0.8B, 2B, 4B, 9B, 27B, 35B-A3B	Also listed under Chat. Vision-language with tool calling, code, and math. 262K context.
MiniCPM	8B, 9B	OpenBMB. Compact vision models. MiniCPM-o 4.5 is the latest with strong visual understanding, OCR, and document parsing in 30+ languages.

OCR and Document Understanding Models

Family	Sizes	Strengths
GLM-OCR	0.9B	Z.ai. Specialized in document parsing, OCR, and structured information extraction. Supports text, formula, table, and complex layout recognition across multiple languages. 131K context.
PaddleOCR VL	0.9B	PaddlePaddle. Ultra-compact (0.7 GB) vision-language model achieving 94.5% on OmniDocBench v1.5. Supports OCR, table recognition, formula recognition, chart recognition, text spotting, and seal recognition. 131K context.
LightOnOCR 2	1B	LightOn. Efficient end-to-end OCR model refined with RLVR training. Converts documents (PDFs, scans, images) into clean, naturally ordered text. Excels at tables, receipts, forms, multi-column layouts, and math notation. 16K context. Replaces LightOnOCR 1025.
GLM-V 4.6 Flash	10B	Also listed under Vision. Strong at OCR and document/screenshot understanding with native function calling. 131K context. Offers a step up in accuracy from sub-1B OCR models with native function calling (~7 GB VRAM).
Qwen 3.5	0.8B, 2B, 4B, 9B, 27B, 35B-A3B	Also listed under Chat and Vision. OCR across 200+ languages with 262K context. The 9B variant offers the best balance of accuracy and resource usage. The 27B and 35B-A3B deliver the highest accuracy for complex documents when VRAM allows.
MiniCPM-V 4.5	8B	Also listed under Vision. GPT-4o-level OCR and document parsing with multilingual support (30+ languages).

Tip: For dedicated OCR workloads, start with paddleocr-vl:0.9b, glm-ocr, or lightonocr-2:1b (all under 1 GB). Use glm-4.6v-flash (~7 GB) when you need OCR combined with chat and tool calling. Scale up to qwen3.5:9b or qwen3.6:27b for complex multilingual documents or when you need additional vision reasoning alongside OCR.

Enterprise and Long-Context Models

Family	Sizes	Strengths
Granite 4 Hybrid	3B, 7B (MoE)	IBM. Hybrid Mamba-2/Transformer. Up to 1M token context with 70% less memory than standard transformers. ISO 42001 certified. Strong instruction following and function calling.

Embedding Models

Family	Sizes	Strengths
Embedding Gemma	300M	Google. Derived from Gemma. Highest-ranked open model under 500M params on MTEB. Excellent default for lightweight RAG. 2K context.
Qwen 3 Embedding	0.6B, 4B, 8B	Top 3 on MTEB multilingual (8B variant). 32K/40K context. Best overall accuracy for RAG, especially multilingual and code.
Nomic Embed Text	137M	Outperforms OpenAI Ada-002. Excellent quality-to-size ratio. 2K context.
Nomic Embed Vision	92M	Image embeddings for visual search and similarity. ONNX format.
BGE-M3	568M	BAAI. Multilingual embeddings. 8K context.
BGE Small	33M	BAAI. Ultra-lightweight English embeddings. 512 context.

Reranking Models

Family	Sizes	Strengths
BGE M3 Reranker	568M	Reranks search candidates by relevance. Use after initial embedding retrieval for better accuracy.

Speech-to-Text Models

Family	Sizes	Strengths
Whisper	39M to 1.5B	OpenAI. Seven size tiers from tiny to large-v3-turbo. The turbo variant offers the best speed/quality balance.

Specialized Models

Family	Sizes	Strengths
LM-Kit Sentiment Analysis	1.2B	Finetuned for sentiment and emotion detection.
LM-Kit Sarcasm Detection	1.1B	Finetuned for sarcasm detection.
U2-Net	44M	Image segmentation.

Tip: Models marked with a "Replaced by" indicator in the Model Catalog have a newer successor. Prefer the replacement model for new projects.

Compare Models with Public Leaderboards

Model benchmarks evolve rapidly. Use these independent leaderboards to compare models before committing:

Leaderboard	What It Measures	Link
LMArena Chatbot Arena	Human preference Elo ratings from blind A/B tests	lmarena.ai
Open LLM Leaderboard	Standardized benchmarks (MMLU, ARC, HellaSwag, etc.) for open models	huggingface.co/open-llm-leaderboard
MTEB Leaderboard	Embedding model quality (retrieval, classification, clustering, reranking)	huggingface.co/mteb
SWE-bench Verified	Real-world coding ability (fixing GitHub issues)	swebench.com
BFCL Leaderboard	Function/tool calling accuracy	gorilla.cs.berkeley.edu
LiveCodeBench	Code generation on fresh, unseen problems	livecodebench.github.io
Open VLM Leaderboard	Vision-language model comparison	huggingface.co/opencompass
AIME Benchmark	Mathematical reasoning (competition-level math problems)	Results published per model release

Tip: Benchmark scores are self-reported by model authors and may use different evaluation settings. Cross-reference multiple leaderboards, and always test on your own data before choosing a model for production.

Next Steps

Choosing the Right Model: step-by-step guide to pick a model for your task and hardware.
Model Recommendations: hardware quick picks, multi-model stacks, and upgrade paths.
Model Catalog: browse all available models with interactive filtering.

Table of Contents