← back to index

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin · 2017 · arxiv arXiv:1706.03762

Replace recurrence with self-attention. Every position attends to every other position in constant depth. The architecture that made scale possible.

Q K V QKᵀ ÷ √dₖ softmax × V scaled dot-product attention

The problem

Recurrent networks process sequences one step at a time. Each step depends on the previous step's output. This sequential dependency makes parallelization impossible during training and creates a bottleneck: information from early positions must survive through every intermediate step to reach late positions.

Convolutional approaches can parallelize but need many layers to connect distant positions — the receptive field grows linearly or logarithmically with depth.

Self-attention connects every position to every other position in a single layer. Constant path length. Full parallelization. The cost is quadratic in sequence length. That tradeoff proved worth making.

The architecture

Encoder-decoder

Six identical layers in each stack. Every layer has two sub-layers (encoder) or three (decoder), each wrapped in a residual connection and layer normalization.

Encoder layer: (1) multi-head self-attention, (2) position-wise feed-forward network.

Decoder layer: (1) masked multi-head self-attention, (2) multi-head cross-attention over encoder output, (3) position-wise feed-forward network. The mask prevents attending to future positions: the decoder can only see what it has already generated.

Scaled dot-product attention

The core operation. Three inputs: queries Q, keys K, values V. Compute similarity between each query and all keys, normalize, use the result to weight values.

Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V

The 1/sqrt(d_k) scaling prevents dot products from growing large enough to push softmax into near-zero gradient regions. Without it, attention becomes approximately one-hot for large dimensions — the model stops learning.

Multi-head attention

Q, K, V head₁ head₂ ··· head₇ head₈ Concat · Wᴼ 8 heads × 64 dims ≈ 1 head × 512 dims, but each head learns a different pattern

Instead of one attention function over the full dimension, project Q, K, V into h smaller subspaces, run attention in parallel, concatenate, project back.

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) W_O
where head_i = Attention(Q W_i^Q, K W_i^K, V W_i^V)

Eight heads, each with d_k = d_v = 64, total cost comparable to single-head attention at d_model = 512. Each head can learn a different attention pattern: one head tracks syntactic dependencies, another tracks semantic similarity, another tracks positional proximity.

Position-wise feed-forward network

A two-layer MLP applied identically to each position. The hidden dimension (2048) is four times the model dimension (512). This is where per-position computation happens. Attention handles cross-position communication.

FFN(x) = max(0, x W_1 + b_1) W_2 + b_2

Positional encoding

Attention is permutation-equivariant. It does not know position. Position information is injected by adding sinusoidal signals to the input embeddings:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

The key property: PE(pos + k) is a linear function of PE(pos) for any fixed offset k. The model can learn relative-position attention from absolute-position encodings. Sinusoidal encodings also extrapolate to sequence lengths not seen during training.

Training recipe

Component Specification
Warmup + decay scheduleLinear warmup for 4000 steps, then inverse square root decay. Widely adopted.
Label smoothingepsilon = 0.1. Hurts perplexity, improves BLEU. The model learns to be less confident.
Residual dropoutP = 0.1 on sub-layer outputs and on summed embeddings + positional encoding.
Weight sharingSame matrix for encoder embeddings, decoder embeddings, and pre-softmax linear. Fewer parameters, works.

What this architecture gives you

Attention is Attend. Scaled dot-product attention is a soft lookup: query a key-value store with continuous keys, get a weighted blend of values. Soar's staged preferences do the same job with discrete operators (reject first, rank survivors). The transformer version is differentiable; Soar's is interpretable; Minsky's cross-exclusion is the simplest: winner takes all.

Multi-head projection is polyneme broadcast made precise. Each head projects Q, K, V into a 64-dimensional subspace. One head learns syntax, another semantics, another positional proximity. Minsky described polynemes as signals that activate partial states across many agencies at once. Same idea, learned projections instead of hand-wired connections.

Causal masking is the simplest possible Filter. Prevent information flow from future to past. Soar's truth maintenance is richer: auto-retract any structure whose justification no longer holds. MemAgent learns which information to mask. Causal masking doesn't learn. It enforces a static constraint. But it's the one that enabled everything else.

Residual connections are what make depth survivable. Without them, gradients vanish. With them, 96 layers. This is a Cache operation: preserve the input alongside whatever the layer computes. Soar's working memory serves the same function. The central hub that every module reads from and writes to, never overwritten, always available.

The frozen forward pass

The transformer as described here runs one direction. No memory beyond the context window. No learning at inference time. No mechanism to update its own weights. Soar has all three: episodic memory, chunking, and RL update the architecture as it runs.

Everything that makes modern LLMs useful (pre-training, RLHF, in-context learning, agent loops with tools and memory) is built on top of this paper's foundation. The foundation itself is a single forward pass. What came next is the story of bolting on the missing pieces.

Neighbors