Optimization in Deep Learning

Algorithms that optimize the parameters of Deep Neural Networks \(f(\bm{x};\bm{\theta})\) by minimzing some loss function \(L(f(\bm{x};\bm{\theta}),y)\) mostly (if not all) variants of Gradient Descent, where the update rule is \(\bm{\theta}_t \leftarrow \bm{\theta}_{t-1} -\eta \nabla_\bm{\theta} L(f(\bm{x};\bm{\theta}),y)_{\bm{\theta}=\bm{\theta}_{t-1}}\). Here we assume a supervised learning setting where the loss function is the empirical risk, i.e. the average of the loss function over all training samples assuming the generated data is i.i.d. In the deep learning literature this is often called batch gradient descent.

If instead, we take the average over a mini-batch of \(m\) randomly selected samples during each parameter update, it is called mini-batch gradient descent, with the special case \(m=1\) often called stochastic gradient descent.

In practice, it is common to use momentum, i.e. an exponential average of previous gradients \(\bar{\nabla}_\bm{\theta}^{(t)}\leftarrow \nabla_\bm{\theta}L(f(x;\bm{\theta}),y)_{\bm{\theta}=\bm{\theta}_{t-1}} +\beta \bar{\nabla}^{(t-1)}_\bm{\theta}\), which enables passing plateaus more quickly according to Ruslan Salakhutdinov.

Adaptive learning rates (i.e. one learning rate/parameter) work great currently, for example Adagrad scale the learning rate by the square root of the cumulative sum of gradients squared (+ small epsilon). The idea is to reduce the learning rate if the variance is high, and increase it otherwise.

RMSProp use the exponential moving average instead of the cumulative sum in Adagrad, and Adam combines RMSProp with momentum.