Attention Is All You Need (and So Are You)
A friendly, illustrated walk through scaled dot-product attention — why softmax, why the √dₖ scaling, and how multi-head attention lets a model look in many directions at once.
Long-form notes with proper equations and illustrated covers — built on Material Design, with light & dark themes you can switch up top.
A friendly, illustrated walk through scaled dot-product attention — why softmax, why the √dₖ scaling, and how multi-head attention lets a model look in many directions at once.
Sines, cosines and the surprisingly cozy idea that any signal is just a chord. We build the Fourier transform from scratch, with pictures.
What does it mean for a matrix to "just stretch" a vector? An intuition-first tour of eigenvectors, spectra, and why they show up everywhere.
A tiny cat decides whether it is dinner time. Along the way we meet priors, likelihoods and the posterior — Bayes' rule, gently.
Learning rates, momentum and the small hours. A visual diary of rolling downhill toward a minimum without overshooting the valley.
One function, infinite drama. Temperature, logits and why your model is suspiciously confident — a short, warm look at softmax.