Attention Mechanism

TL;DR

An attention mechanism enables a model to assign dynamic, content-based weights over its inputs, such as words in a sentence or regions of an image, so it can focus on the most relevant information when generating each output. It is the foundational operation inside transformers that supports long-range dependencies, parallel computation, and interpretability.

What Is Attention?

Definition: Attention computes, for each output position, a weighted sum of input representations. Each input element is scored for relevance, those scores are normalized into weights, and the values are aggregated into a context vector that informs the model's next prediction.

The Three Steps of Attention

Scoring: Compute similarity between a query vector and each key vector.
Normalization: Apply a softmax to the scores to obtain attention weights.
Aggregation: Combine the corresponding value vectors using these weights.

Mathematically, for queries Q, keys K, and values V:

\[ \text{Attention}(Q,K,V) = \text{softmax}\!\bigl(\tfrac{QK^\top}{\sqrt{d_k}}\bigr)\,V \]

where \(d_k\) is the dimension of the keys (used for scaling).

Historical Note

The core transformer architecture, built entirely around self-attention and point-wise feed-forward layers, was introduced in the paper "Attention Is All You Need" (NeurIPS 2017). It demonstrated that pure attention, without recurrence or convolutions, could achieve state-of-the-art performance on sequence tasks.

Core Components

Linear Projections: Inputs \(X \in \mathbb{R}^{n \times d}\) are mapped to queries, keys, and values via learned matrices:

\[ Q = XW_Q,\quad K = XW_K,\quad V = XW_V \]
Scaled Dot-Product: The dot products \(QK^\top\) are divided by \(\sqrt{d_k}\) to keep gradient magnitudes stable.
Multi-Head Attention: Splits the projections into \(h\) subspaces (heads), runs attention in parallel, and concatenates the results. This lets the model capture different types of relationships simultaneously.
Residual Connections & Layer Normalization: Each attention block is wrapped with a residual connection and followed by layer normalization, aiding training stability and gradient flow.
Masking: An additive mask can be applied to the attention logits to prevent certain positions from attending (e.g., causal masks block "future" tokens in autoregressive decoding).

Key Terms

Term	Description
Query (Q)	Representation of the item seeking relevant information.
Key (K)	Representation of each potential source of information.
Value (V)	Representation carrying the content to be aggregated.
Head	One parallel attention computation; multiple heads learn diverse patterns.
Context Vector	Weighted sum of values for a given query, summarizing relevant inputs.
Attention Map	Matrix of normalized weights showing how much each input influences each output.

Variants & Extensions

Causal (Autoregressive) Attention: Masks out future positions so each token only attends to past and current tokens.
Self-Attention vs. Cross-Attention:
- Self-Attention: Q, K, and V all come from the same sequence.
- Cross-Attention: Q from one sequence attends to K and V from another (e.g., decoder attending to encoder outputs).
Sparse & Efficient Attention: Techniques like sliding-window, locality-sensitive hashing, or kernel approximations that reduce the quadratic cost to near-linear.
FlashAttention: A fused, IO-aware attention algorithm that tiles the computation to minimize memory reads and writes between GPU SRAM and HBM. FlashAttention delivers exact attention results (not an approximation) while reducing memory usage from quadratic to linear and significantly improving wall-clock speed.
Grouped Query Attention (GQA): A practical middle ground between standard multi-head attention and multi-query attention. GQA groups multiple query heads to share a single key-value head, cutting KV-cache memory proportionally while retaining most of the quality of full multi-head attention. Models such as Llama 3, Mistral, and Gemma use GQA.
Relative & Rotary Positional Encodings: Methods to incorporate order information directly into the attention mechanism without fixed positional vectors.

Attention in LM-Kit.NET

In LM-Kit.NET, attention mechanisms power all transformer-based models. Key features related to attention include:

KV-Cache Management: LM-Kit.NET caches key-value pairs to accelerate generation by avoiding recomputation of attention for previous tokens
Context Overflow Policies: Configure how the model handles context overflow, including KV-cache shifting strategies
Sliding Window Attention: Supported for models like Mistral that use local attention windows

// Configure context and attention-related settings
var chat = new MultiTurnConversation(model);
chat.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;

KV-Cache and Context Overflow Configuration

The following example shows how to load a model with an explicit context size, configure the KV-cache overflow policy, and run a multi-turn conversation that can gracefully handle long exchanges.

using LMKit.Model;
using LMKit.TextGeneration;
using LMKit.Inference;

// Load a model with a 4096-token context window.
var model = new LM(new Uri(
    "https://huggingface.co/lm-kit/gemma-3-4b-instruct-lmk/resolve/main/gemma-3-4B-it-Q4_K_M.lmk"),
    contextSize: 4096);

var chat = new MultiTurnConversation(model);

// KV-Cache shifting: when the context is full, the oldest cached
// key-value entries are evicted and the remaining entries are shifted
// so that generation can continue without stopping.
chat.InferencePolicies.ContextOverflowPolicy = ContextOverflowPolicy.KVCacheShifting;

// Alternative policies:
// ContextOverflowPolicy.StopInference   - stop generating when the context is full.
// ContextOverflowPolicy.Truncate        - truncate the oldest turns to free space.

Console.OutputEncoding = System.Text.Encoding.UTF8;
Console.WriteLine("Chat session started. Type 'exit' to quit.\n");

while (true)
{
    Console.Write("You: ");
    string? input = Console.ReadLine();
    if (string.IsNullOrWhiteSpace(input) || input.Equals("exit", StringComparison.OrdinalIgnoreCase))
        break;

    string response = chat.Submit(input);
    Console.WriteLine($"Assistant: {response}\n");
}

KV-Cache: Caching attention key-value pairs for efficient inference
Context Windows: Token limits and attention scope
Large Language Model (LLM): Models powered by attention mechanisms
Inference: The process where attention computes outputs
Weights: The learned parameters in attention projections
Token: Input units processed by attention
Logits: Outputs influenced by attention computation
Sampling: Token selection after attention processing
Embeddings: Input representations to attention layers
Mixture of Experts: Architecture that works alongside attention

External Resources

Attention Is All You Need (Vaswani et al., 2017): The foundational Transformer paper
FlashAttention (Dao et al., 2022): Memory-efficient, IO-aware exact attention
FlashAttention-2 (Dao, 2023): Improved parallelism and work partitioning for faster attention
GQA: Training Generalized Multi-Query Transformer Models (Ainslie et al., 2023): Grouped Query Attention paper
RoPE: Rotary Position Embeddings (Su et al., 2021): Modern positional encoding method

Summary

Attention is the process of dynamically focusing on relevant parts of the input by computing weighted combinations of learned representations. This mechanism powers the parallelizable, long-range capabilities of transformer models and underlies many modern advances in language, vision, and multimodal AI. Techniques such as FlashAttention and Grouped Query Attention continue to improve the efficiency and scalability of attention, making it practical to deploy large models with long context windows.

Table of Contents