Table of Contents

Understanding the Transformer Architecture


TL;DR

The Transformer is the neural network architecture behind virtually all modern large language models. Introduced in 2017, it replaced recurrent networks by processing entire sequences in parallel through a mechanism called self-attention. All models in the LM-Kit.NET catalog (Gemma, Qwen, Llama, Phi, GPT-OSS, GLM, Whisper) are transformer-based. Understanding the transformer explains why concepts like context windows, KV-cache, tokenization, attention, and quantization work the way they do.


What is the Transformer?

Definition: The Transformer is a deep learning architecture that processes sequential data (text, audio, images) using self-attention instead of recurrence. Unlike RNNs and LSTMs, which read input one element at a time, transformers compute relationships between all elements simultaneously. This parallelism enabled the training of models with billions of parameters, launching the era of large language models.

The Original Design

The 2017 "Attention Is All You Need" paper introduced two configurations:

Encoder-Decoder Transformer          Decoder-Only Transformer
(used for translation)               (used for text generation)

+------------------+                 +------------------+
| Encoder          |                 | Decoder          |
| (reads input)    |                 | (generates text) |
|                  |                 |                  |
| Self-Attention   |                 | Masked Self-     |
| Feed-Forward     |---context--->   | Attention        |
| Normalization    |                 | Feed-Forward     |
+------------------+                 | Normalization    |
        +                            +------------------+
+------------------+
| Decoder          |
| (generates output)|
| Cross-Attention  |
| Feed-Forward     |
+------------------+
  • Encoder-Decoder: The encoder reads the full input and the decoder generates output while attending to the encoded representation. Used in early translation models (T5, BART).
  • Decoder-Only: A single stack of decoder layers generates text autoregressively, one token at a time. This is the architecture used by nearly all modern LLMs, including every model in the LM-Kit.NET catalog.

Inside a Transformer Layer

Every transformer model is a stack of identical layers (or "blocks"). A typical decoder-only model has 24 to 80+ layers. Each layer contains:

Input tokens (embeddings)
         |
         v
+------------------------------+
|  1. Layer Normalization      |  Stabilize activations
+------------------------------+
         |
         v
+------------------------------+
|  2. Multi-Head Self-Attention|  Learn token relationships
|     (the core mechanism)     |  [see: Attention Mechanism]
+------------------------------+
         |  + residual connection
         v
+------------------------------+
|  3. Layer Normalization      |  Stabilize again
+------------------------------+
         |
         v
+------------------------------+
|  4. Feed-Forward Network     |  Process each token independently
|     (two linear layers       |  through a wider hidden dimension
|      with activation)        |
+------------------------------+
         |  + residual connection
         v
Output (passed to next layer)

1. Self-Attention

The attention mechanism is the defining innovation. For each token, it computes how much to "attend to" every other token in the sequence. This allows the model to capture long-range dependencies ("The cat that sat on the mat was sleeping" connects "was" to "cat" across six tokens).

Self-attention operates through three learned projections:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I carry?"

In decoder-only models, causal masking ensures each token can only attend to tokens before it (not future tokens), preserving the autoregressive property needed for text generation.

2. Multi-Head Attention

Instead of one attention computation, the model runs multiple attention "heads" in parallel. Each head can learn different relationship patterns (syntactic structure, semantic meaning, positional proximity). The results are concatenated and projected back to the model dimension.

3. Feed-Forward Network

After attention aggregates information across tokens, a feed-forward network (two linear transformations with a nonlinear activation) processes each token independently. This is where much of the model's "knowledge" is stored in its weights.

4. Residual Connections and Normalization

Residual connections (adding the input of each sub-layer to its output) prevent the vanishing gradient problem in deep stacks. Layer normalization keeps activations stable across layers.


How the Transformer Connects to LM-Kit.NET Concepts

Transformer Component LM-Kit.NET Concept Why It Matters
Layer count LM.LayerCount More layers = more capacity but more memory. Distributed inference splits layers across GPUs.
Self-attention Attention Mechanism The core operation. Its compute cost scales quadratically with sequence length.
KV vectors KV-Cache Attention keys and values are cached per token to avoid recomputation during generation.
Maximum sequence length Context Windows The transformer's positional encoding limits how many tokens it can process.
Positional encoding (RoPE) LM.RopeAlgorithm How the model encodes token positions. RoPE (Rotary Position Embedding) is the standard in modern LLMs.
Weight matrices Weights, Quantization Each layer's attention and feed-forward parameters. Quantization compresses these to reduce memory.
Vocabulary projection Logits, Tokenization The final layer projects hidden states to vocabulary logits for token prediction.
Autoregressive generation Inference The model generates one token per forward pass, each conditioned on all previous tokens.
Model architecture string LM.Architecture, ModelCard.Architecture Identifies the transformer variant: "llama", "qwen2", "phi3", "whisper", etc.

Transformer Variants in the LM-Kit.NET Catalog

Architecture Models Key Innovations
LLaMA Llama 3.1, Gemma 3 RoPE embeddings, RMSNorm, SwiGLU activation
Qwen2/3 Qwen 3, Qwen 2 VL Grouped-Query Attention, extended context
Phi Phi-4, Phi-4 Mini Compact architecture with high data quality training
GPT-OSS GPT-OSS 20B Long context (131k tokens), strong tool-calling
GLM GLM 4.7 Flash Mixture of Experts (MoE) with efficient routing
Whisper Whisper family Encoder-decoder transformer for speech-to-text
BERT/Nomic Embedding models Encoder-only transformer for embeddings

Why the Transformer Won

The transformer replaced RNNs and LSTMs for three fundamental reasons:

  1. Parallelism: RNNs process tokens sequentially (token 1, then token 2, then token 3...). Transformers process all tokens simultaneously during training, enabling massive GPU parallelism and scaling to billions of parameters.

  2. Long-Range Dependencies: RNNs struggle to connect information across long sequences (the "vanishing gradient" problem). Self-attention directly connects every token to every other token, regardless of distance.

  3. Scalability: Transformer performance improves predictably with more data, more parameters, and more compute. This "scaling law" property drove the development of ever-larger models, from GPT-2 (1.5B) to modern models with hundreds of billions of parameters.


Key Terms

  • Transformer: A neural network architecture based on self-attention, processing sequences in parallel rather than sequentially.
  • Self-Attention: The mechanism that computes relevance scores between all pairs of tokens in a sequence.
  • Multi-Head Attention: Running multiple attention computations in parallel, each learning different relationship patterns.
  • Causal Masking: Restricting attention so each token can only see preceding tokens, enabling autoregressive generation.
  • Decoder-Only: A transformer variant with only decoder layers, used by most modern LLMs for text generation.
  • Encoder-Decoder: A transformer variant with separate encoder and decoder stacks, used for tasks like translation and speech-to-text.
  • Feed-Forward Network (FFN): The per-token nonlinear transformation applied after attention in each layer.
  • Residual Connection: Adding a layer's input directly to its output, preventing information loss in deep networks.
  • RoPE (Rotary Position Embedding): A positional encoding method that encodes relative token positions through rotation in embedding space.
  • Autoregressive Generation: Producing output one token at a time, each conditioned on all previously generated tokens.

  • LM: Model class exposing Architecture, LayerCount, ParameterCount, RopeAlgorithm
  • ModelCard: Static catalog with architecture metadata and capabilities
  • ModelCapabilities: Flags for what a model can do (Chat, Vision, Reasoning, etc.)


External Resources


Summary

The Transformer is the foundational architecture behind every model in the LM-Kit.NET catalog. By replacing sequential recurrence with parallel self-attention, transformers enabled the training of models with billions of parameters that capture long-range dependencies and scale predictably with compute. Each transformer layer combines multi-head attention (learning token relationships), feed-forward networks (storing knowledge in weights), and normalization with residual connections. Understanding this architecture illuminates why context windows have limits, why KV-cache accelerates generation, why quantization compresses models effectively, and why distributed inference splits layers across GPUs. The LM.Architecture and ModelCard.Architecture properties in LM-Kit.NET expose the specific transformer variant for each loaded model.

Share