Understanding the Transformer Architecture

TL;DR

The Transformer is the neural network architecture behind virtually all modern large language models. Introduced in 2017, it replaced recurrent networks by processing entire sequences in parallel through a mechanism called self-attention. All models in the LM-Kit.NET catalog (Gemma, Qwen, Llama, Phi, GPT-OSS, GLM, Whisper) are transformer-based. Understanding the transformer explains why concepts like context windows, KV-cache, tokenization, attention, and quantization work the way they do.

What is the Transformer?

Definition: The Transformer is a deep learning architecture that processes sequential data (text, audio, images) using self-attention instead of recurrence. Unlike RNNs and LSTMs, which read input one element at a time, transformers compute relationships between all elements simultaneously. This parallelism enabled the training of models with billions of parameters, launching the era of large language models.

The Original Design

The 2017 "Attention Is All You Need" paper introduced two configurations:

Encoder-Decoder Transformer          Decoder-Only Transformer
(used for translation)               (used for text generation)

+------------------+                 +------------------+
| Encoder          |                 | Decoder          |
| (reads input)    |                 | (generates text) |
|                  |                 |                  |
| Self-Attention   |                 | Masked Self-     |
| Feed-Forward     |---context--->   | Attention        |
| Normalization    |                 | Feed-Forward     |
+------------------+                 | Normalization    |
        +                            +------------------+
+------------------+
| Decoder          |
| (generates output)|
| Cross-Attention  |
| Feed-Forward     |
+------------------+

Encoder-Decoder: The encoder reads the full input and the decoder generates output while attending to the encoded representation. Used in early translation models (T5, BART).
Decoder-Only: A single stack of decoder layers generates text autoregressively, one token at a time. This is the architecture used by nearly all modern LLMs, including every model in the LM-Kit.NET catalog.

Inside a Transformer Layer

Every transformer model is a stack of identical layers (or "blocks"). A typical decoder-only model has 24 to 80+ layers. Each layer contains:

Input tokens (embeddings)
         |
         v
+------------------------------+
|  1. Layer Normalization      |  Stabilize activations
+------------------------------+
         |
         v
+------------------------------+
|  2. Multi-Head Self-Attention|  Learn token relationships
|     (the core mechanism)     |  [see: Attention Mechanism]
+------------------------------+
         |  + residual connection
         v
+------------------------------+
|  3. Layer Normalization      |  Stabilize again
+------------------------------+
         |
         v
+------------------------------+
|  4. Feed-Forward Network     |  Process each token independently
|     (two linear layers       |  through a wider hidden dimension
|      with activation)        |
+------------------------------+
         |  + residual connection
         v
Output (passed to next layer)

1. Self-Attention

The attention mechanism is the defining innovation. For each token, it computes how much to "attend to" every other token in the sequence. This allows the model to capture long-range dependencies ("The cat that sat on the mat was sleeping" connects "was" to "cat" across six tokens).

Self-attention operates through three learned projections:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I carry?"

In decoder-only models, causal masking ensures each token can only attend to tokens before it (not future tokens), preserving the autoregressive property needed for text generation.

2. Multi-Head Attention

Instead of one attention computation, the model runs multiple attention "heads" in parallel. Each head can learn different relationship patterns (syntactic structure, semantic meaning, positional proximity). The results are concatenated and projected back to the model dimension.

3. Feed-Forward Network

After attention aggregates information across tokens, a feed-forward network (two linear transformations with a nonlinear activation) processes each token independently. This is where much of the model's "knowledge" is stored in its weights.

4. Residual Connections and Normalization

Residual connections (adding the input of each sub-layer to its output) prevent the vanishing gradient problem in deep stacks. Layer normalization keeps activations stable across layers.

How the Transformer Connects to LM-Kit.NET Concepts

Transformer Component	LM-Kit.NET Concept	Why It Matters
Layer count	`LM.LayerCount`	More layers = more capacity but more memory. Distributed inference splits layers across GPUs.
Self-attention	Attention Mechanism	The core operation. Its compute cost scales quadratically with sequence length.
KV vectors	KV-Cache	Attention keys and values are cached per token to avoid recomputation during generation.
Maximum sequence length	Context Windows	The transformer's positional encoding limits how many tokens it can process.
Positional encoding (RoPE)	`LM.RopeAlgorithm`	How the model encodes token positions. RoPE (Rotary Position Embedding) is the standard in modern LLMs.
Weight matrices	Weights, Quantization	Each layer's attention and feed-forward parameters. Quantization compresses these to reduce memory.
Vocabulary projection	Logits, Tokenization	The final layer projects hidden states to vocabulary logits for token prediction.
Autoregressive generation	Inference	The model generates one token per forward pass, each conditioned on all previous tokens.
Model architecture string	`LM.Architecture`, `ModelCard.Architecture`	Identifies the transformer variant: `"llama"`, `"qwen2"`, `"phi3"`, `"whisper"`, etc.

Transformer Variants in the LM-Kit.NET Catalog

Architecture	Models	Key Innovations
LLaMA	Llama 3.1, Gemma 4	RoPE embeddings, RMSNorm, SwiGLU activation
Qwen2/3	Qwen 3.5, Qwen 2 VL	Grouped-Query Attention, extended context
Phi	Phi-4, Phi-4 Mini	Compact architecture with high data quality training
GPT-OSS	GPT-OSS 20B	Long context (131k tokens), strong tool-calling
GLM	GLM 4.7 Flash	Mixture of Experts (MoE) with efficient routing
Whisper	Whisper family	Encoder-decoder transformer for speech-to-text
BERT/Nomic	Embedding models	Encoder-only transformer for embeddings

Why the Transformer Won

The transformer replaced RNNs and LSTMs for three fundamental reasons:

Parallelism: RNNs process tokens sequentially (token 1, then token 2, then token 3...). Transformers process all tokens simultaneously during training, enabling massive GPU parallelism and scaling to billions of parameters.
Long-Range Dependencies: RNNs struggle to connect information across long sequences (the "vanishing gradient" problem). Self-attention directly connects every token to every other token, regardless of distance.
Scalability: Transformer performance improves predictably with more data, more parameters, and more compute. This "scaling law" property drove the development of ever-larger models, from GPT-2 (1.5B) to modern models with hundreds of billions of parameters.

Key Terms

Transformer: A neural network architecture based on self-attention, processing sequences in parallel rather than sequentially.
Self-Attention: The mechanism that computes relevance scores between all pairs of tokens in a sequence.
Multi-Head Attention: Running multiple attention computations in parallel, each learning different relationship patterns.
Causal Masking: Restricting attention so each token can only see preceding tokens, enabling autoregressive generation.
Decoder-Only: A transformer variant with only decoder layers, used by most modern LLMs for text generation.
Encoder-Decoder: A transformer variant with separate encoder and decoder stacks, used for tasks like translation and speech-to-text.
Feed-Forward Network (FFN): The per-token nonlinear transformation applied after attention in each layer.
Residual Connection: Adding a layer's input directly to its output, preventing information loss in deep networks.
RoPE (Rotary Position Embedding): A positional encoding method that encodes relative token positions through rotation in embedding space.
Autoregressive Generation: Producing output one token at a time, each conditioned on all previously generated tokens.

LM: Model class exposing Architecture, LayerCount, ParameterCount, RopeAlgorithm
ModelCard: Static catalog with architecture metadata and capabilities
ModelCapabilities: Flags for what a model can do (Chat, Vision, Reasoning, etc.)

Attention Mechanism: The core operation inside each transformer layer
Inference: How transformers generate text, one token at a time
Context Windows: The maximum sequence length a transformer can process
KV-Cache: Caching attention keys and values for efficient generation
Weights: The learned parameters stored in each transformer layer
Quantization: Compressing transformer weights to reduce memory
Logits: The output scores produced by the final transformer layer
Tokenization: Converting text to token IDs that the transformer processes
Distributed Inference: Splitting transformer layers across multiple GPUs
Mixture of Experts (MoE): A variant where each layer routes to specialized sub-networks
Large Language Model (LLM): Transformer-based models with billions of parameters

External Resources

Attention Is All You Need (Vaswani et al., 2017): The original transformer paper
The Illustrated Transformer (Alammar, 2018): Visual guide to the transformer architecture
RoFormer: Enhanced Transformer with Rotary Position Embedding (Su et al., 2021): The RoPE positional encoding used by most LM-Kit.NET models
Scaling Laws for Neural Language Models (Kaplan et al., 2020): How transformer performance scales with size, data, and compute

Summary

The Transformer is the foundational architecture behind every model in the LM-Kit.NET catalog. By replacing sequential recurrence with parallel self-attention, transformers enabled the training of models with billions of parameters that capture long-range dependencies and scale predictably with compute. Each transformer layer combines multi-head attention (learning token relationships), feed-forward networks (storing knowledge in weights), and normalization with residual connections. Understanding this architecture illuminates why context windows have limits, why KV-cache accelerates generation, why quantization compresses models effectively, and why distributed inference splits layers across GPUs. The LM.Architecture and ModelCard.Architecture properties in LM-Kit.NET expose the specific transformer variant for each loaded model.

Table of Contents