Table of Contents

What is Model Distillation?


TL;DR

Model distillation (or knowledge distillation) is the process of transferring the capabilities of a large, expensive "teacher" model into a smaller, faster "student" model. The teacher generates high-quality outputs that serve as training data for the student, which learns to replicate the teacher's behavior at a fraction of the size and cost. This is how many of the best small language models are created: a 70B-parameter teacher produces reasoning traces and answers that train a 7B student to perform nearly as well on specific tasks. Distillation is the bridge between "best possible quality" and "deployable in production," and it is foundational to the 2026 trend of heterogeneous architectures where expensive models plan and cheap models execute.


What Exactly is Model Distillation?

Training a language model from scratch requires enormous amounts of raw text data. The resulting model has broad knowledge but no task-specific optimization. Instruction tuning teaches it to follow commands, but the best-performing models are still very large (30B-70B+ parameters), making them expensive to deploy.

Distillation offers an alternative path: instead of training a small model from scratch, you teach it by example using a large model's outputs:

Teacher Model (70B parameters)
    |
    | Generates high-quality outputs:
    |   "Explain quantum entanglement" → [excellent explanation]
    |   "Debug this Python code"       → [correct fix with reasoning]
    |   "Classify this document"       → [accurate classification]
    |
    v
Training Dataset: 10,000-100,000 (input, teacher_output) pairs
    |
    v
Student Model (7B parameters)
    |
    | Trained on teacher's outputs
    | Learns to replicate teacher's quality
    |
    v
Result: Small model with large-model quality on target tasks

The student model cannot match the teacher on every task, but on the specific tasks it was distilled for, it often achieves 85-95% of the teacher's quality at 10-20% of the cost.

Why Distillation Works

A large model's outputs contain more information than raw text data:

  • Reasoning patterns: The teacher demonstrates how to approach problems step by step
  • Output format: The student learns the teacher's response structure and style
  • Nuanced judgments: The teacher encodes subtle distinctions that raw training data does not make explicit
  • Error avoidance: The teacher's outputs (mostly) avoid the mistakes that raw data might contain

This is sometimes called dark knowledge: information about the task that is implicit in the teacher's outputs but not explicitly present in any training document.


Why Model Distillation Matters

  1. Production Cost Reduction: A distilled 4B model serving 10,000 requests per day costs a fraction of what a 70B model would. For many applications, distillation is the difference between a viable product and an unsustainable prototype.

  2. Latency Reduction: Smaller models generate tokens faster. For real-time applications (chatbots, coding assistants, agents with many tool calls), latency matters. A distilled model can be 5-10x faster than its teacher.

  3. Edge Deployment: Distilled models are small enough to run on consumer hardware, mobile devices, and edge AI deployments where large models simply cannot fit in memory. See Quantization for further size reduction.

  4. Heterogeneous Architectures: The 2026 architectural trend uses expensive reasoning models for planning and cheap distilled models for execution. This "Plan-and-Execute" pattern can reduce costs by 90%. See Route Prompts Across Models.

  5. Task Specialization: Distillation naturally produces task specialists. A general 70B model distilled on medical Q&A produces a compact medical specialist that outperforms the original on in-domain tasks.

  6. Privacy and Data Sovereignty: Organizations that cannot send data to cloud-hosted large models can distill capabilities into models small enough to run entirely on premises.


Technical Insights

Distillation Techniques

1. Response-Based Distillation (Standard)

The simplest and most common approach. The teacher generates complete responses, and the student is trained to produce the same outputs:

For each training example:
  1. Feed input prompt to teacher model
  2. Teacher generates response
  3. Train student on (input, teacher_response) pair

This is essentially supervised fine-tuning where the training data comes from a model rather than human annotators.

2. Reasoning Trace Distillation

The teacher generates not just the answer but the full reasoning process. The student learns to reason, not just to produce the right output:

Teacher output:
  "Let me analyze this step by step.
   1. The function takes a list as input
   2. The loop iterates over each element
   3. The condition checks for even numbers
   4. Bug: the index starts at 1 instead of 0
   Therefore, the fix is to change the loop start index."

Student learns: the reasoning pattern, not just "change index to 0"

This produces students that generalize better to new problems because they have learned the reasoning process, not just the answers. This technique benefits from inference-time compute scaling on the teacher side: letting the teacher reason extensively produces higher-quality training data.

3. Logit-Based Distillation (Soft Labels)

Instead of training on the teacher's final output, the student learns from the teacher's full probability distribution (logits) over the vocabulary:

Teacher's prediction for next token:
  "Paris"  : 0.85 probability
  "Lyon"   : 0.08 probability
  "Marseille": 0.04 probability
  ...

Hard label: "Paris" (just the top prediction)
Soft label: full distribution (all probabilities)

Soft labels contain richer information: the student learns not just that "Paris" is correct but that "Lyon" is a reasonable alternative while "banana" is not. This produces better-calibrated students with smoother probability distributions.

4. Progressive Distillation

Distill in stages through a chain of decreasingly smaller models:

70B Teacher → 30B Intermediate → 7B Student

Each step loses less information than jumping directly from 70B to 7B, producing a higher-quality final student.

Technique What It Does Relationship to Distillation
Distillation Large model teaches small model via outputs Core technique
LoRA Adapters Add small trainable layers to a base model Can use distilled data for LoRA training
Quantization Reduce numerical precision of weights Complementary: distill first, then quantize
Synthetic Data Generation Generate training data with LLMs Distillation is a specific form of this
Instruction Tuning Train on instruction-response pairs Distillation uses teacher-generated pairs

In practice, these techniques are combined: distill a large model into a smaller one, apply LoRA for additional task specialization, then quantize for deployment.

Quality Factors

What determines distillation quality:

  • Teacher quality: A better teacher produces better students. Use the best available model as teacher.
  • Data diversity: Training examples should cover the full range of tasks the student will encounter.
  • Data volume: More examples generally help, but quality matters more than quantity. 10,000 high-quality examples often beat 100,000 mediocre ones.
  • Task alignment: Examples should closely match the student's deployment tasks.
  • Student capacity: The student must be large enough to absorb the teacher's knowledge. A 0.5B model cannot capture everything a 70B model knows.

Practical Use Cases

  • Deploying Cost-Effective Agents: Distill a large model's planning and reasoning capabilities into a smaller model that handles routine agent tasks, reserving the large model for complex queries. See Route Prompts Across Models.

  • Domain-Specific Specialists: Use a large general model to generate domain-specific training data (medical, legal, financial), then train a compact specialist via LoRA adapters on the distilled data. See Prepare Training Datasets.

  • On-Device Deployment: Distill cloud-quality capabilities into models small enough to run on consumer hardware or edge devices, enabling offline and privacy-preserving AI applications.

  • Improving Small Model Quality: When a small language model underperforms on a specific task, distillation from a larger model is often more effective than collecting and annotating training data manually.

  • Building Evaluation Data: Use a large model to generate ground-truth answers for an evaluation benchmark, then use those answers to both train and evaluate smaller models.


Key Terms

  • Model Distillation: The process of training a small student model to replicate the behavior of a large teacher model.

  • Teacher Model: The large, capable model that generates training data for distillation.

  • Student Model: The smaller model being trained on the teacher's outputs.

  • Dark Knowledge: Information implicit in the teacher's outputs (probability distributions, reasoning patterns, stylistic choices) that is not present in raw training data.

  • Soft Labels: The teacher's full probability distribution over the vocabulary, which provides richer training signal than the single best prediction (hard label).

  • Response Distillation: Training the student on the teacher's complete text outputs.

  • Logit Distillation: Training the student on the teacher's probability distributions (logits) rather than just the final output text.

  • Progressive Distillation: Multi-stage distillation through a chain of decreasingly smaller models.

  • Capability Gap: The performance difference between teacher and student, which distillation aims to minimize.


  • LM: Load teacher and student models
  • ModelCard: Model catalog including distilled model variants
  • LoraAdapterSource: Load LoRA adapters trained on distilled data



External Resources


Summary

Model distillation is the practical technique that makes high-quality AI accessible and affordable. By training smaller student models on the outputs of larger teacher models, distillation transfers capabilities, including reasoning patterns, output quality, and domain knowledge, at a fraction of the deployment cost. The result is compact, fast models that achieve 85-95% of the teacher's quality on targeted tasks while being small enough for edge deployment, cheap enough for high-volume production, and fast enough for latency-sensitive agent workflows. Combined with LoRA adapters for further task specialization and quantization for additional size reduction, distillation is the cornerstone of the heterogeneous architecture trend: expensive LLMs reason and plan, distilled SLMs execute at scale.

Share