Overview
The Transformer, introduced in Attention Is All You Need (Vaswani et al., 2017), discards recurrence and convolutions in favor of self-attention. Every position attends to every other position in a single pass, enabling massive parallelism during training.
Core Components
Multi-Head Attention computes attention over separate subspaces and concatenates the results:
where each head is:
Scaled Dot-Product Attention prevents softmax saturation in high dimensions:
Position-wise FFN applies two linear layers with a ReLU in between:
Positional Encoding injects sequence order via sinusoidal signals added to the token embeddings.
Encoder–Decoder Structure
The original Transformer has two stacks:
- Encoder: identical layers, each with self-attention + FFN + residual connections.
- Decoder: layers with masked self-attention, cross-attention over encoder output, and FFN.
Modern language models (GPT family) use the decoder only; BERT and its variants use the encoder only.
Scaling Properties
Empirical scaling laws (Kaplan et al., 2020) show that loss decreases smoothly as a power law with model size, data, and compute — the key insight behind the large-scale pretraining paradigm.
Key Papers
- Vaswani et al. (2017) — Attention Is All You Need
- Devlin et al. (2018) — BERT
- Brown et al. (2020) — GPT-3
- Kaplan et al. (2020) — Scaling Laws for Neural Language Models