๐ง Attention Mechanism
๐ TL;DR
An attention mechanism enables a model to assign dynamic, content-based weights over its inputs, such as words in a sentence or regions of an image, so it can focus on the most relevant information when generating each output. It is the foundational operation inside transformers that supports long-range dependencies, parallel computation, and interpretability.
๐ What Is Attention?
Definition: Attention computes, for each output position, a weighted sum of input representations. Each input element is scored for relevance, those scores are normalized into weights, and the values are aggregated into a context vector that informs the modelโs next prediction.
The Three Steps of Attention
- Scoring: Compute similarity between a query vector and each key vector.
- Normalization: Apply a softmax to the scores to obtain attention weights.
- Aggregation: Combine the corresponding value vectors using these weights.
Mathematically, for queries Q, keys K, and values V:
where \(d_k\) is the dimension of the keys (used for scaling).
๐ Historical Note
The core transformer architecture, built entirely around self-attention and point-wise feed-forward layers, was introduced in the paper โAttention Is All You Needโ (NeurIPS 2017). It demonstrated that pure attention, without recurrence or convolutions, could achieve state-of-the-art performance on sequence tasks.
๐ Core Components
Linear Projections: Inputs \(X \in \mathbb{R}^{n \times d}\) are mapped to queries, keys, and values via learned matrices:
\[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \]Scaled Dot-Product: The dot products \(QK^\top\) are divided by \(\sqrt{d_k}\) to keep gradient magnitudes stable.
Multi-Head Attention: Splits the projections into \(h\) subspaces (heads), runs attention in parallel, and concatenates the results. This lets the model capture different types of relationships simultaneously.
Residual Connections & Layer Normalization: Each attention block is wrapped with a residual connection and followed by layer normalization, aiding training stability and gradient flow.
Masking: An additive mask can be applied to the attention logits to prevent certain positions from attending (e.g., causal masks block โfutureโ tokens in autoregressive decoding).
๐ Common Terms
| Term | Description |
|---|---|
| Query (Q) | Representation of the item seeking relevant information. |
| Key (K) | Representation of each potential source of information. |
| Value (V) | Representation carrying the content to be aggregated. |
| Head | One parallel attention computation; multiple heads learn diverse patterns. |
| Context Vector | Weighted sum of values for a given query, summarizing relevant inputs. |
| Attention Map | Matrix of normalized weights showing how much each input influences each output. |
๐ ๏ธ Variants & Extensions
Causal (Autoregressive) Attention: Masks out future positions so each token only attends to past and current tokens.
Self-Attention vs. Cross-Attention:
- Self-Attention: Q, K, and V all come from the same sequence.
- Cross-Attention: Q from one sequence attends to K and V from another (e.g., decoder attending to encoder outputs).
Sparse & Efficient Attention: Techniques like sliding-window, locality-sensitive hashing, or kernel approximations that reduce the quadratic cost to near-linear.
Relative & Rotary Positional Encodings: Methods to incorporate order information directly into the attention mechanism without fixed positional vectors.
๐ ๏ธ Attention in LM-Kit.NET
In LM-Kit.NET, attention mechanisms power all transformer-based models. Key features related to attention include:
- KV-Cache Management: LM-Kit.NET caches key-value pairs to accelerate generation by avoiding recomputation of attention for previous tokens
- Context Overflow Policies: Configure how the model handles context overflow, including KV-cache shifting strategies
- Sliding Window Attention: Supported for models like Mistral that use local attention windows
// Configure context and attention-related settings
var chat = new MultiTurnConversation(model);
chat.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;
๐ Related Glossary Topics
- KV-Cache: Caching attention key-value pairs for efficient inference
- Context Windows: Token limits and attention scope
- Large Language Model (LLM): Models powered by attention mechanisms
- Inference: The process where attention computes outputs
๐ External Resources
- Attention Is All You Need (Vaswani et al., 2017): The foundational Transformer paper
- FlashAttention (Dao et al., 2022): Memory-efficient attention implementations
- RoPE: Rotary Position Embeddings (Su et al., 2021): Modern positional encoding method
๐ Summary
Attention is the process of dynamically focusing on relevant parts of the input by computing weighted combinations of learned representations. This mechanism powers the parallelizable, long-range capabilities of transformer models and underlies many modern advances in language, vision, and multimodal AI.