Gradient Descent at 3 A.M.

Mei

Apr 30, 2026

Rolling downhill

The update everyone knows:

$\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}(\theta_t)$

The learning rate $\eta$ is the whole personality of the optimiser: too big and you bounce out of the valley, too small and it’s 3 a.m. before you converge.

warningWarningexpand_more

A loss that suddenly explodes to NaN almost always means $\eta$ is too large. Halve it before blaming the data.

Momentum helps

Averaging past gradients smooths the descent and powers through flat spots.