Table of Contents

๐Ÿง  Attention Mechanism


๐Ÿ“„ TL;DR

An attention mechanism enables a model to assign dynamic, content-based weights over its inputs, such as words in a sentence or regions of an image, so it can focus on the most relevant information when generating each output. It is the foundational operation inside transformers that supports long-range dependencies, parallel computation, and interpretability.


๐Ÿ“š What Is Attention?

Definition: Attention computes, for each output position, a weighted sum of input representations. Each input element is scored for relevance, those scores are normalized into weights, and the values are aggregated into a context vector that informs the modelโ€™s next prediction.

The Three Steps of Attention

  1. Scoring: Compute similarity between a query vector and each key vector.
  2. Normalization: Apply a softmax to the scores to obtain attention weights.
  3. Aggregation: Combine the corresponding value vectors using these weights.

Mathematically, for queries Q, keys K, and values V:

\[ \text{Attention}(Q,K,V) = \text{softmax}\!\bigl(\tfrac{QK^\top}{\sqrt{d_k}}\bigr)\,V \]

where \(d_k\) is the dimension of the keys (used for scaling).


๐Ÿ“œ Historical Note

The core transformer architecture, built entirely around self-attention and point-wise feed-forward layers, was introduced in the paper โ€œAttention Is All You Needโ€ (NeurIPS 2017). It demonstrated that pure attention, without recurrence or convolutions, could achieve state-of-the-art performance on sequence tasks.


๐Ÿ”‘ Core Components

  • Linear Projections: Inputs \(X \in \mathbb{R}^{n \times d}\) are mapped to queries, keys, and values via learned matrices:

    \[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \]
  • Scaled Dot-Product: The dot products \(QK^\top\) are divided by \(\sqrt{d_k}\) to keep gradient magnitudes stable.

  • Multi-Head Attention: Splits the projections into \(h\) subspaces (heads), runs attention in parallel, and concatenates the results. This lets the model capture different types of relationships simultaneously.

  • Residual Connections & Layer Normalization: Each attention block is wrapped with a residual connection and followed by layer normalization, aiding training stability and gradient flow.

  • Masking: An additive mask can be applied to the attention logits to prevent certain positions from attending (e.g., causal masks block โ€œfutureโ€ tokens in autoregressive decoding).


๐Ÿ“– Common Terms

Term Description
Query (Q) Representation of the item seeking relevant information.
Key (K) Representation of each potential source of information.
Value (V) Representation carrying the content to be aggregated.
Head One parallel attention computation; multiple heads learn diverse patterns.
Context Vector Weighted sum of values for a given query, summarizing relevant inputs.
Attention Map Matrix of normalized weights showing how much each input influences each output.

๐Ÿ› ๏ธ Variants & Extensions

  • Causal (Autoregressive) Attention: Masks out future positions so each token only attends to past and current tokens.

  • Self-Attention vs. Cross-Attention:

    • Self-Attention: Q, K, and V all come from the same sequence.
    • Cross-Attention: Q from one sequence attends to K and V from another (e.g., decoder attending to encoder outputs).
  • Sparse & Efficient Attention: Techniques like sliding-window, locality-sensitive hashing, or kernel approximations that reduce the quadratic cost to near-linear.

  • Relative & Rotary Positional Encodings: Methods to incorporate order information directly into the attention mechanism without fixed positional vectors.


๐Ÿ› ๏ธ Attention in LM-Kit.NET

In LM-Kit.NET, attention mechanisms power all transformer-based models. Key features related to attention include:

  • KV-Cache Management: LM-Kit.NET caches key-value pairs to accelerate generation by avoiding recomputation of attention for previous tokens
  • Context Overflow Policies: Configure how the model handles context overflow, including KV-cache shifting strategies
  • Sliding Window Attention: Supported for models like Mistral that use local attention windows
// Configure context and attention-related settings
var chat = new MultiTurnConversation(model);
chat.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;


๐ŸŒ External Resources


๐Ÿ“ Summary

Attention is the process of dynamically focusing on relevant parts of the input by computing weighted combinations of learned representations. This mechanism powers the parallelizable, long-range capabilities of transformer models and underlies many modern advances in language, vision, and multimodal AI.