Numerics of Machine Learning 2023 (University of Tuebingen)
Introduction
TODO Link to YouTube playlist. Course focused on the numerical algorithms used in machine learning, and treating the computations as machine learning problems themselves.
Content
Introduction
Modern machine learning (ML) brings together data and our models using computations. Compared to rule-based AI, modern ML use numerical computation.
Numerical methods (e.g. conjugate gradients, linear system solvers) are different than atomic operations (e.g. trigonometric functions, exponentials and logarithms).
Numerical computations are not as straightforward as how the corresponding mathematical formula may make them look like, for e.g. matrix inversion should always done using Cholesky decomposition.
Matrix inversion itself is often touted as having cubic complexity, but given the right structure, we could get up to exponential speed ups - not to mention the cost is not cubic if we tolerate imprecise computations.
Simulation is another area where numerical computations are used - simulations estimate the trajectory of a dynamical system, and include Ordinary Differential Equations (ODEs), Partial Differential Equations (PDEs), differential algebraic equation. They are useful when we know about some structure of a prediction task, where off the shelf deep learning and kernel machines may not work, e.g. SIR model with COVID-19 data. We will see later that simulation is inference from information operators.
Integration is an important problem that comes in areas such as Bayesian inference. Classical quadrature rules are the go-to numerical methods, which work well in low dimensions. These can be thought of as a form of Gaussian Process (GP) regression.
Much of modern ML is based on minimizing some loss function. This is an optimization task, and the availability of differentiable programming allows us to use gradient-based optimization algorithms like Gradient Descent (GD). In practice we use Stochastic Gradient Descent (SGD) which uses a uniform random sample instead of all data points to get the gradient. This means each gradient in SGD is a realization of a random vector, whose mean is the gradient over all data points. This random vector is Gaussian. (#TODO Think a bit more on that variance). Overall training deep networks with SGD is considerably more cumbersome than for e.g. using the LBFGS algorithm.
A common pattern arises - we can treat a numerical method as an algorithm that infers a latent quantity from tractable observations. This is also what Bayesian inference does. The difference is the data comes from the chip instead of the disk - so the tools we use to train machines are also machines themselves.