2026-04-13
语言
主题
Wiki
2026-04-07

Transformer Architecture

A sequence-to-sequence architecture built entirely on attention mechanisms, replacing recurrence and convolutions with parallelizable self-attention layers.

ArchitectureAttentionNLPLLMMLLM
Back to Wiki

Overview

The Transformer, introduced in Attention Is All You Need (Vaswani et al., 2017), discards recurrence and convolutions in favor of self-attention. Every position attends to every other position in a single pass, enabling massive parallelism during training.

Core Components

Multi-Head Attention computes attention over hh separate subspaces and concatenates the results:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,\ldots,\text{head}_h)W^O

where each head is:

headi=Attention(QWiQ,KWiK,VWiV)\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Scaled Dot-Product Attention prevents softmax saturation in high dimensions:

Attention(Q,K,V)=softmax ⁣(QKdk)V\text{Attention}(Q,K,V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

Position-wise FFN applies two linear layers with a ReLU in between:

FFN(x)=max(0,xW1+b1)W2+b2\text{FFN}(x) = \max(0,\, xW_1 + b_1)W_2 + b_2

Positional Encoding injects sequence order via sinusoidal signals added to the token embeddings.

Encoder–Decoder Structure

The original Transformer has two stacks:

  • Encoder: NN identical layers, each with self-attention + FFN + residual connections.
  • Decoder: NN layers with masked self-attention, cross-attention over encoder output, and FFN.

Modern language models (GPT family) use the decoder only; BERT and its variants use the encoder only.

Scaling Properties

Empirical scaling laws (Kaplan et al., 2020) show that loss decreases smoothly as a power law with model size, data, and compute — the key insight behind the large-scale pretraining paradigm.

Key Papers

  • Vaswani et al. (2017) — Attention Is All You Need
  • Devlin et al. (2018) — BERT
  • Brown et al. (2020) — GPT-3
  • Kaplan et al. (2020) — Scaling Laws for Neural Language Models