Tokenization
How text is converted into tokens that LLMs process.
Overview
Tokenization is the first step in the LLM pipeline, converting raw text into discrete tokens.
Key Concepts
Tokenization Strategies
Byte-Pair Encoding (BPE)
- Most common approach (GPT, LLaMA)
- Merges frequent byte pairs iteratively
WordPiece
- Used by BERT
- Similar to BPE with greedy longest-match first
SentencePiece
- Language-agnostic
- Handles raw text without pre-tokenization
Vocabulary Size
- Typical sizes: 32K, 64K, 100K tokens
- Trade-off between compression and expressivity