## Optimization methods

One of the biggest mysteries for beginners in neural networks is figuring out when to use which optimization method. This post focuses on describing popular first-order methods: SGD, Momentum, Nesterov Momentum, Adagrad, RMSProp and Adam.

Second order methods are not included largely because inverting the Hessian takes too much computation power, and most deep learning researchers find it impractical to use second order methods.

Using stochastic gradient descent, we simply follow the gradient. The gradient is controlled by a learning rate, which typically tends to be a linearly decreasing number.

The problem with SGD is that chosing a proper learning rate is difficult and it’s easy to get trapped in a saddle point when optimizing a highly non-convex loss function.

### Momentum

Stochastic Gradient Descent can oscillate and have trouble converging. Momentum tries to improve on SGD by dampening the oscillation and emphasizing the optimal direction.

If we call the update $v$, the SGD equation becomes $\theta = \theta - v$. Momentum combines the past update along with the current update, in order to stablize the updates.

The value of $\mu$ is usually close to 1 e.g. 0.9, 0.95.

### Nesterov Momentum

Momentum is an improvement on SGD because $v_t$ is combined with $t_{t-1}$. However, because $v_t$ is very dependent on $v_{t-1}$, momentum by itself is slow to adapt and change directions.

Nesterov momentum builds on raw momentum. Instead of calculating $v_t$ via $\nabla J(\theta_t)$, Nesterov tries to calculate $\nabla J(\theta_{t + 1})$. But how can we calculate the gradient of the parameters in the future? We can’t. However, we can approximate the future parameter by assuming that $v_t = \mu \cdot v_{t-1}$, which is largely true.

Adagrad, RMSProp and Adam take a different approach on how to improve SGD. SGD, Momentum and Nesterov Momentum all have a single learning rate for all parameters. The following 3 methods instead adaptively tune the learning rates for each parameter.

### RMSProp

RMSProp improves on adagrad by decaying the size of the cache. The cache is “leaky” and prevents the updates from becoming 0. Hinton recommends setting $\gamma$ to 0.99 and $\alpha$ to 0.01.