momentum

Momentum

Momentum is a popular heuristic used along with SGD in place of full second order optimization: It updates weights based on an exponentially weighted moving average of the past gradients.
This helps to smooth out the updates/approximate curvature to find a better path towards the minimum. Essentially: If past updates all point in the same direction, we’re more confident of taking large steps in that direction.
The update rule for gradient dececnt with momentum is:
TODO
$α > 0, μ \in [0, 1]$

(reminiscent of the effect of mini-batch gradient descent)

redo the equations after reading the nesterov paper… NOTE: $v$ likely stands for velocity!

$g_{t} g_{t} θ_{t} \leftarrow λ g_{t - 1} + λ \nabla L (θ_{t}) \leftarrow μ g_{t - 1} + (1 - τ) g_{t} \leftarrow θ_{t - 1} - g_{t} weight decay λ, learning rate α momentum μ, friction/dampening τ$
got confused with (https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) which imlpements differently (see note)

References

On the importance of initialization and momentum in deep learning (nesterov)

Adam

Max Wolf's Second Brain

Explorer

momentum

References

Graph View

Backlinks