Gradient Descent
A descent direction based optimization algorithm to optimize \(f(\mathbf{ x }) : \mathbb{ R }^p \to \mathbb{ R }\) where the descent direction is taken to be \(\mathbf{d}^{(k)} = -\frac{\nabla_{\mathbf{x}} \hat{f}(\mathbf{x}^{(k)})}{||\nabla_{\mathbf{x}} f(\mathbf{x}^{(k)})||}\).
The idea is that \(\nabla_\mathbf{ x }f(\mathbf{ x })\) (the vector whose components tell how fast \(f\) is changing with respect to the standard basis) Two reasons why this makes sense:
- Given we are at some point \(f(\mathbf{ x })\), we can try to figure out which direction to move by looking at directional derivative \(\lim_{h\to 0}\frac{f(\mathbf{ x }+h\mathbf{ v })-f(\mathbf{ x })}{h}\) (rate of change of \(f\) along the direction of \(\mathbf{ v }\) where \(\mathbf{ v }\) is a unit vector). This is maximized by choosing \(\mathbf{ v }\) to be the \(\nabla_\mathbf{ x }f(\mathbf{ x })\) since the directional derivative is equivalent to \(\nabla_\mathbf{ x }f(\mathbf{ x })\cdot \mathbf{ v }=\nabla_\mathbf{ x }f(\mathbf{ x })\cos\theta\) which is maximized when \(\theta=0\) i.e. \(\mathbf{ v }\) points in the same direction of \(\nabla_\mathbf{ x }f(\mathbf{ x })\).
- The Taylor expansion \(f(\mathbf{ x }+\delta \mathbf{ x }) =f(\mathbf{ x })+\nabla_\mathbf{x}f \cdot \delta\mathbf{ x } + \text{HOT}\) is maximized when \(\delta\mathbf{ x }\) points in the same direction as \(\nabla_\mathbf{ x }f\).
The update rule is thus \(\mathbf{x}^{(k+1)} \leftarrow \mathbf{x}^{(k)} - \alpha^{(k)}\mathbf{d}^{(k)}\) with \(\alpha^{(k)} = \text{argmin}_\alpha f(\mathbf{x}^{(k)}+\alpha \mathbf{d}^{(k)})\). For the optimal \(\alpha^{(k)}\), we have that \(0=\frac{\partial f(\mathbf{x}^{(k+1)})}{\partial \alpha^{(k)}} = \frac{\partial f(\mathbf{x}^{(k+1)})}{\partial \mathbf{x}^{(k+1)}}\frac{\partial \mathbf{x}^{(k+1)}}{\partial \alpha^{(k)}}\) which implies the directions taken in step \(k, k+1\) are orthogonal to each other.