Model Architecture
The overall structure of transformer-based language models.
Overview
LLMs are built from stacked transformer blocks with consistent patterns.
Core Components
Transformer Block
Attention Layer
- Multi-head self-attention
- Output projection
Feed-Forward Network (FFN)
- Two linear transformations
- GELU activation
- Expanded hidden dimension (typically 4x embedding size)
Layer Normalization
- Pre-norm or post-norm
Residual Connections
- Around attention and FFN
Stacking
- Number of layers (N): 12, 24, 32, 96, etc.
- Each block identical (weight-sharing in some architectures)