Attention Mechanisms

The core mechanism enabling LLMs to process sequential data.

Overview

Attention allows models to weigh the importance of different tokens when processing each token.

Key Concepts

Scaled Dot-Product Attention

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Multi-Head Attention

Multiple attention heads in parallel
Each head learns different relationships
Outputs concatenated and projected

Variations

Causal/Autoregressive Attention

Masks future tokens
Used in GPT-style models

Bidirectional Attention

Full context access
Used in BERT-style models