# Optimization in Deep Learning

Algorithms that optimize the parameters of Deep Neural Networks
\(f(\bm{x};\bm{\theta})\) by minimzing some loss function \(L(f(\bm{x};\bm{\theta}),y)\)
mostly (if not all) variants of Gradient Descent, where the update
rule is \(\bm{\theta}_t \leftarrow \bm{\theta}_{t-1} -\eta
\nabla_\bm{\theta} L(f(\bm{x};\bm{\theta}),y)_{\bm{\theta}=\bm{\theta}_{t-1}}\). Here we assume a
supervised learning setting where the loss function is the empirical
risk, i.e. the average of the loss function over all training samples
assuming the generated data is i.i.d. In the deep learning literature
this is often called **batch gradient descent**.

If instead, we take the average over a mini-batch of \(m\) randomly
selected samples during each parameter update, it is called
**mini-batch gradient descent**, with the special case \(m=1\) often called
stochastic gradient descent.

In practice, it is common to use **momentum**, i.e. an exponential average
of previous gradients \(\bar{\nabla}_\bm{\theta}^{(t)}\leftarrow
\nabla_\bm{\theta}L(f(x;\bm{\theta}),y)_{\bm{\theta}=\bm{\theta}_{t-1}}
+\beta \bar{\nabla}^{(t-1)}_\bm{\theta}\), which enables passing
plateaus more quickly according to Ruslan Salakhutdinov.

Adaptive learning rates (i.e. one learning rate/parameter) work great currently, for example Adagrad scale the learning rate by the square root of the cumulative sum of gradients squared (+ small epsilon). The idea is to reduce the learning rate if the variance is high, and increase it otherwise.

RMSProp use the exponential moving average instead of the cumulative sum in Adagrad, and Adam combines RMSProp with momentum.