Positional Encoding

Methods for incorporating position information into token embeddings.

Overview

Since attention is permutation-invariant, positional encodings provide sequence order.

Key Methods

Sinusoidal Positional Encoding

Used in original Transformer:

  • Absolute positions
  • Fixed functions based on sine/cosine

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right) $$

Rotary Position Embedding (RoPE)

  • Relative position information
  • Used in LLaMA, GPT-NeoX
  • Better generalization to longer sequences

ALiBi (Attention with Linear Biases)

  • Simple and effective
  • No learned positional embeddings
  • Linear bias based on distance