λmem.ac
imagesmodepost cover

Gradient Descent at 3 A.M.

Rolling downhill

The update everyone knows:

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \, \nabla_\theta \mathcal{L}(\theta_t)

The learning rate η\eta is the whole personality of the optimiser: too big and you bounce out of the valley, too small and it’s 3 a.m. before you converge.

warningWarningexpand_more

A loss that suddenly explodes to NaN almost always means η\eta is too large. Halve it before blaming the data.

Momentum helps

Averaging past gradients smooths the descent and powers through flat spots.