Positional Encoding
Methods for incorporating position information into token embeddings.
Overview
Since attention is permutation-invariant, positional encodings provide sequence order.
Key Methods
Sinusoidal Positional Encoding
Used in original Transformer:
- Absolute positions
- Fixed functions based on sine/cosine
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $$
Rotary Position Embedding (RoPE)
- Relative position information
- Used in LLaMA, GPT-NeoX
- Better generalization to longer sequences
ALiBi (Attention with Linear Biases)
- Simple and effective
- No learned positional embeddings
- Linear bias based on distance