Attention Is All You Need (and So Are You)

Rin & MeiCo-authored

Jun 24, 2026

Introduction

Transformers replaced recurrence with a single, surprisingly simple idea: let every token decide, for itself, which other tokens matter. That decision is attention — and once you see it as a weighted average, the rest of the architecture quietly falls into place.

In this post we build scaled dot-product attention from the ground up, add multiple heads, and poke at why the $\sqrt{d_k}$ term keeps everything numerically calm.

infoNoteexpand_more

Attention was introduced for translation long before Transformers — the 2017 paper’s trick was to throw away recurrence entirely and keep only attention.

Scaled dot-product attention

Given queries $Q$ , keys $K$ and values $V$ , attention scores each query against every key, normalises with a softmax, and uses the result to mix the values:

$\mathrm{Attention}(Q,K,V)=\mathrm{softmax}\!\left(\frac{QK^{\top}}{\sqrt{d_k}}\right)V$

Here $d_k$ is the dimension of the keys. The raw dot products grow with $d_k$ , so dividing by $\sqrt{d_k}$ keeps the logits in a sane range before the softmax saturates.

lightbulbTipexpand_more

In code, scale before the softmax, not after — scaling the probabilities instead of the logits silently breaks the gradient.

attention as a soft lookup table

Multi-head attention

One attention map can only emphasise one kind of relationship. Multi-head attention runs several in parallel, each with its own learned projections, then concatenates them:

$\mathrm{MultiHead}(Q,K,V)=\mathrm{Concat}(\mathrm{head}_1,\dots,\mathrm{head}_h)\,W^{O}$

where each $\mathrm{head}_i=\mathrm{Attention}(QW_i^{Q},\,KW_i^{K},\,VW_i^{V})$ learns to ask a different question of the same sentence.

Think of each head as a different lens: one tracks syntax, another long-range topic, another simple position. Concatenation lets the next layer use them all.

Why does it work?

Self-attention is $O(n^2 d)$ in sequence length $n$ — every token looks at every other token. That quadratic cost buys something precious: a constant path length between any two positions, so gradients flow without the long detours an RNN forces.

Because the graph is fully connected, information from the first token can reach the last in a single layer. Depth then stops being about reach and becomes about refinement: each block re-weights the same global context a little more sharply.

scores  = (Q @ K.transpose(-2, -1)) / math.sqrt(d_k)
weights = scores.softmax(dim=-1)
out     = weights @ V          # the soft lookup

warningWarningexpand_more

That n² memory is real: a 4k-token sequence needs a 16M-entry attention matrix per head. Watch your VRAM.

dangerousDangerexpand_more

Never feed un-masked future tokens to a decoder at train time — the model will “cheat” by attending ahead and collapse at inference.

A note on numerical stability

Softmax is shift-invariant, so subtracting the row maximum before exponentiating changes nothing mathematically but everything numerically: $\mathrm{softmax}(x)_i=\mathrm{softmax}(x-\max_j x_j)_i$ . Every sane implementation does this for free.

Symbol	Meaning	Shape
$Q$	queries	$n \times d_k$
$K$	keys	$n \times d_k$
$V$	values	$n \times d_v$

Conclusion

Attention is just a learned, differentiable lookup. Everything else — heads, scaling, positional encodings — is bookkeeping around that one idea. Next time we’ll add the positions back in. ✦

menu_book