imagesmodepost cover
Gradient Descent at 3 A.M.
Rolling downhill
The update everyone knows:
The learning rate is the whole personality of the optimiser: too big and you bounce out of the valley, too small and it’s 3 a.m. before you converge.
warningWarningexpand_more
A loss that suddenly explodes to NaN almost always means is too large.
Halve it before blaming the data.
Momentum helps
Averaging past gradients smooths the descent and powers through flat spots.