Gaussian Processes (Aalto University CS-E4895 2023)
Introduction
Course content:
- Gaussian Process (GP) regression and classification.
- Kernel learning.
- Model selection.
- Approximate inference and speeding up GPs.
- Spatio-temporal modeling.
- Latent modeling.
- Connection to deep learning.
- GP theory.
Lectures:
- GP introduction.
- Bayesian regression.
- GP regression.
- Integration and model selection.
- Kernel learning.
- GP classification.
- Large-scale GPs.
- GP theory.
- Deep GPs.
- Latent modeling and unsupervised learning.
- State-space GPs.
- Sequential decision making.
6 exercise sheets, done at home, discussed in the first exercise class and handed before Thursday the week after.
Content
Introduction
The 3 elements of data-driven science:
- Data - what we observe.
- Models - a formalization of your beliefs of what generated the data.
- Algorithms - how to match data and models, make predictions etc.
(Definition) A random vector \(\mathbf{x} \in \mathbb{R}^d\) is said to have a multivariate Gaussian distribution if for all \(\mathbf{a} \in \mathbb{R}^d\), \(y=\mathbf{a}\cdot\mathbf{x} \sim \mathcal{N}(m,v)\).
A Gaussian Process (GP) is a collection of random variables over some indexing set such that any finite subset of them have a joint Gaussian distribution.
GPs can be considered as a distribution over functions \(f:\mathcal{X}\to\mathbb{R}\) \[ f(\mathbf{x}) \sim \mathcal{GP}(\mu(\mathbf{x}),\kappa(\mathbf{x},\mathbf{x}')) \] It is characterized by a mean function \(\mu\) and kernel function \(\kappa\), such that \(\mathbb{E}[f(\mathbf{x})]=\mu(\mathbf{x}), \text{Cov}(f(\mathbf{x}),f(\mathbf{x}'))=\kappa(\mathbf{x},\mathbf{x}')\).
The distribution over any subset of function values \(\mathbf{f} = (f(\mathbf{x}_1),\cdots,f(\mathbf{x}_N))\) at any collection of inputs \(\mathbf{x}_i\) is Gaussian, specifically \(p(\mathbf{f}) = \mathcal{N}(\mathbf{f}|\mathbf{m},\mathbf{K})\) where \(\mathbf{m},\mathbf{K}\) are the mean and kernel functions evaluated at the inputs.
When \(\mathcal{X}=\mathbb{R}^d\), the GP prior describes infinitely many random variables, even though in practice we only deal with a finite subset corresponding to the data set at hand and where we want to evaluate the function by applying it to test data - giving rise the non-parametric nature of GPs.
The kernel function encodes prior beliefs about the data-generating process, for e.g.
- Continuity (#ASK are there kernels that give discontinuous GPs).
- Differetiability i.e. smoothness.
- Periodicity.
- Invariances.
Kernel functions typically have hyperparameters.
In machine learning, a GP model typically consists of the following generative process: \[ f(\mathbf{x})\sim \mathcal{GP}(\mu(\mathbf{x}),\kappa(\mathbf{x},\mathbf{x}')) \] \[ \mathbf{y}|\mathbf{f} \sim \prod_i p(y_i|f(\mathbf{x}_i)) \] In simple GP regression, the likelihood is Gaussian, and the posterior mean and variances on unseen test inputs can be derived in closed form, while hypeparameters can be learnt by maximizing the log marginal likelihood which is also available in closed form.
When the likelihood is non-Gaussian, approximations are needed.
In a nutshell, GPs can:
- Fit non-linear functions to data.
- Make predictions to new inputs.
- Provide sensible uncertainties.
- Adjust model complexity to data (non-parametric).
Some challenges:
- Scaling to large data - there is a matrix inverse which is cubic in the
number of data points. Some approaches to tackle this:
- Exploiting structure in the data e.g. data on equidistant grid.
- Exploiting structure in the GP prior e.g. separability over input dimensions.
- Solving linear systems approximately e.g. conjugate-gradient methods.
- Split problem into smaller chunks.
- Approximate the problem e.g. low rank approximation for kernels.
- Approximate the problem solution e.g. sparse variational methods.
- Non-conjugate likelihoods require approximate inference. Some approaches
to tackle this:
- MCMC sampling.
- Laplace approximation.
- Not all problems are amenable to specification of meaningful priors, for
e.g. in image classification it is hard to decide what a cat prior
would look like. Some approaches to tackle this:
- GPs can beused as a building block of a larger model.
Alternative representations of GPs:
- Moment representation.
- Spectral representation.
- State-space representation.
Some applications of GPs:
- Regression.
- Time series analysis/dynamical models.
- EEG brain imaging.
- Survival analysis.
- Robot dynamics.
- Spatial modeling.
- Classification.
- Image recognition.
- Brain decoding.
- Dimensionality reduction.
- Optimization of black box functions.
- Numerical integration (Bayesian quadrature).
- Solving differential equations (Probabilistic numerics).
- Experimental design/active learning.
- Reinforcement learning.
Now, for some more details on the multivariate Gaussian distribution.
The multivariate Gaussian is completely characterized by its mean and covariance matrix. In the 2D case, we can visualize how the covariance matrix elements affect the density by plotting the density contours.
Gaussians have interesting properties:
- They are closed under addition.
- They are closed under affine transformations.
- They are closed under marginalization.
- They are closed under conditioning.
Bayesian Regression
Gaussian Process Regression
Recall that we have a weight space view and function space view for functions. GPs can be seen as a prior over functions, and are characterized by the mean and a kernel function. Combining this with a likelihood, we can model systems of interest where we infer the unknown latent function via Bayesian inference. We can use this posterior to make predictions at test points. For GP regression with Gaussian likelihoods, we have closed form distributions for predictions at test points.
Some properties of kernel functions (/covariance functions) include being stationary (depend only on the difference of inputs), or isotropic (/rotation invariant).
One of the more commonly used kernels are the squared exponential kernel and kernels from the Matern family, (and maybe the rational quadratic kernel often used in spatial statistics). We can also induce specific structure, for e.g. periodicity with the periodic kernel.
Kernels can also be built by combining different kernels with operations such as addition etc. This can be helpfult to model data with specific e.g. the Mauna Loa dataset.
How do we choose the kernel to use, which hyperparameters to choose for it, which mean function to use, etc? This is called the model selection problem.
To get the kernel hyperparameters, ideally we would put prior distributions over them and compute the posterior, but in this case the posterior is intractable.
Another approach is to use a Maximum A Posteriori (MAP) estimate, i.e. optimize the log joint distribution to get a point estimate. This is also called a Maximum Likelihood Type 2 estimate. Getting GP kernel hyperparameters via MAP estimate is more robust than fitting parameters this way, since the model parameters themselves are inferred the Bayesian way.
For GP kernels with a lengthscale parameter, this parameter controls the effective model complexity.
When the marginal likelihood is available in closed form (e.g. in GP regression with conjugate likelihood), we can use gradient based optimization to find optimal hyperaparamters - this involves no cross-validation.
To evaluate the model in a regression setting, one could:
- Get the mean square metric, but it does not take uncertainty into account.
- Use pointwise mean log posterior predictive density (MLPPD), also called the negative log likelihood or negative log predictive density.
In general, computing the predictive density in GP regression has cubic cost due to the matrix inverse invovled.
Also note that the kernel hyperparameters from optimizing the marginal likelihood are used in the predictive distribution as a point estimate, thus the predictive distribution does not marginalize out these parameters, meaning we are not doing full Bayesian inference in this case.
Model Selection
GPs are used in many applications:
- Learning the nonlinear relation between complex inputs in material science to quality of the products.
- As priors for drug concentration regression, where there is few samples and we have some domain knowledge e.g. a monotonic maturation effect.
When it comes to model selection:
- Point estimates work sometimes, but integration is better in general.
- Approximations such as the Laplace approximation and Variational Inference are useful, but can also fail catastrophically.
- MCMC is often very accurate but slower.
Point estimates do not work when we use MLE with estimating the Bernoulli probability when there are no success (or failure) outcomes observed among \(n\) samples. Point estimates can also fail when there are few samples, e.g. in homoskedastic GP is shown.
For heteroskedastic GPs, the MAP point estimates would result in overfitting, while MCMC would model the non-stationary aspects without overfitting. The difference between the MAP estimate and the MCMC samples can be seen when they are plotted on the same axes.
Note that MCMC often requires many evaluations of the log joint density. One approximation is to use a linear combination of finite basis functions - this however works well only in lower dimensions. Such an approximation can give a 160x speedup without a practical difference in model accuracy (see also (, ), (, a)).
Machine learning folklore says that variational inference is fast. However, you can only choose 2 of:
- Fast
- Blackbox
- Accurate
In the heteroskedastic GP example, Automatic Differentiation Variational Inference (ADVI) with a normal mean field approximation is unable to capture the funnel structure of the posterior - this is a great example which shows that VI is not necessarily mode-seeking and not necessarily underestimating the variance.
In the GPML book, chapter 5 calls finding the kernel hyperparameter values as model selection. In statistics, model selection usually refers to selecting from a discrete model space such as:
- Different observational models (e.g. homoskedastic vs. heteroskedastic).
- Covariate selection.
- Functional shapes (e.g. covariance function).
When MAP or Type 2 MAP (integrating out \(f\)) works, marginal likelihood may be useful for model selection.
In comparison, cross-validation is easy (see also (, a), (, a)).
In general, full joint posteriors with GPs have difficult geometry, thus MCMC is slow and distributional approximations are likely bad. Normalizing flows may be an alternative, but needs more experiments.
There is also flexibility that we want to incorporate, specifically, there are different observation models which have nice conjugacy properties, multiple latent values, censored/missing data, derivative observations etc.
There is also a combinatorial explosion of different features that need to work together:
- Approximate computations related to the convariance matrix.
- Approximate integration.
- Different observation models.
- Different priors.
- Combining with other models like ODEs.
Note that the speed in autodiff systems is not automatic. (#DOUBT What does that mean?). For example - what is a node in autodiff, forward/reverse/mixed/adjoints etc. are chosen to be application specific (e.g. for neural networks).
In general, inference speed depends on:
- Computational cost of a single (marginal) log density. (#DOUBT Is this the cost of evaluating the log joint density?).
- Posterior geometries.
- Integration vs. maximizing marginal log likelihood.
It is unlikely that one software would be best for everything, there is a trade-off between:
- Flexibility.
- Speed.
- Additional implementation effort.
The future is hopeful - we are heading towards more modular and interoperable approaches between computational autodiff engines and probabilistic programming interfaces and more.
Kernel Learning
In GPs, everything that happens is related to the kernel.
Recall that the kernel hyperparameters can be found by optimizing the log marginal likelihood in classical GP regression as we have it in closed form. This is relatively robust to overfitting, and does not require cross-validation.
Kernels must induce symmetric positive semidefinite matrices. A stationary kernel depends on the difference between the inputs, and an isotropic kernel depends only on the distance between the inputs.
Kernels help map input features to another (often higher dimensional) feature space. For e.g. if the input space is 2D, the kernel \(k(\mathbf{x},\mathbf{y}) = (\mathbf{x}^T\mathbf{y})^2\) results in a kernel which is the inner product of features in 3D space. This may result in a linearly separable dataset in the 3D setting while not being linearly separable in 2D space. (#DOUBT Markus mentions that this mapping results in a "kernel that is nonlinear in the 2D space and linear in the 3D space" - can a kernel be nonlinear ? Isnt it the data that is being linear separable and not linearly separable).
Any kernel can always be represented as an inner product in some feature space.
- Kernels can be seen as feature extractors.
- There is always a space where the kernel is linear, i.e. they can be seen as linear functions in some higher dimensional space. (#DOUBT Same as doubt above).
When choosing a kernel, we should consider properties of the function of interest, e.g. function smoothness, periodicity, non-stationarity etc.
Spectral kernels can learn arbitrary kernel forms.
Non-stationary kernels can learn evolving processes.
Structured kernels can learn over discrete inputs.
Deep kernels can learn arbitrary feature representations.
The default kernel is the Gaussian kernel - the Automatic Relevance Detection (ARD) kernel specifically since it can take into account the relevance of each input dimension.
We can construct kernels piece by piece, starting with a dictionary of primitive kernels and do a bottom up search using exhaustive search and optimizing the marginal log likelihood along the way (, a).
The Neural Kernel Network creates a neural network from a dictionary of kernels, and the task is to learn the network weights.
Deep Kernel Learning (also called Neural Network Gaussian Process) learn a neural network used as a feature extractor, and this vector is used in the kernel. This however could give pathological results, since the neural network can map the inputs in almost any way, the model can be too flexible and overfit to the data. For e.g. after fitting a deep kernel GP we may have a fit that looks good visually. However, if we look at the similarity between \(m\) fixed input points to other points, we can see the similarity being very high even for points very far away from each of the \(m\) points. Meanwhile, a proper RBF GP would show a Gaussian similarity centered at each of the \(m\) points. Deep kernel GPs make sense for data with many covariates where we may not know how to combine them, and if we limit the neural network capacity we may avoid pathological results.
On structured spaces, i.e. if we want to perform regression on non-vectorial inputs, if we can define a kernel on this space we are good. $k$-mers in bio-informatics and fingerprints for features in molecules are examples. For images, we can define a GP on \(k\times k\) size patches of the image. We can have one GP that maps the patches to \(\mathbb{R}\) and another GP using the sum of the kernels for all pairs of patches (since kernels are closed under addition). (#DOUBT Reread previous sentence more carefully). This kernel compares patch responses across all positions, and represents a non-linear discrete convolution - 2 images are similar if all of their patches are similar on average.
Spectral kernels are great are great at extracting periodic patterns in data. Recall the Fourier transform. Suppose we apply the Fourier transform to a stationary kernel \(K(\tau)\). By Bochner's theorem, \(K(\tau)\) and its spectral density \(S(\omega)\) are Fourier duals. That is, if someone gives a kernel \(K(\tau)\) we can solve the frequencies it considers by getting the Fourier transform. On the other hand, all spectral densities define a stationary covariance function \(K(\tau)\) - i.e. given some spectral density \(S(\omega)\) we know its inverse Fourier transform defines a kernel. This allows for inter-domain kernel learning, where we parametrize the kernel in the spectral domain and learn it in that frequency space.
One nice result is that assume a symmetric frequency distribution (#NOTE I guess frequency distribution = spectral density) \(S(\omega)\), we can write the kernel as \(K(\tau) = \mathbb{E}_{S(\omega)}[\cos 2\pi\tau\omega]\). Setting \(S(\omega)\) to be a Dirac delta at \(\pm \theta\) gives \(K(\tau) = \cos 2\pi\tau\theta\).
Using custom spectral densities, we can construct kernels e.g. Sparse Spectrum (SS) kernel (, a) which uses a Diract distribution over discrete points (prone to overfitting) or Spectrail Mixture (SM) kernelo (, a) which uses a mixture of Gaussians for the spectral density (this representation can represent any stationary kernel). For the SM kernel, there are 3 hyperparameters for each mixture in the spectral density, which results in a marginal likelihood that is highly non-convex. (#DOUBT Try to understand the spectral density corresponding to the GP fitted on the CO2 concentration dataset).
Recall that in standard GP regression, the noise variance and kernel function hyperparameters is assumed to be the same over all inputs. If the noise depends on the inputs, we have a heteroskedastic GP, and if the function dynamics itself depends on the inputs, we have non-stationary GPs. Recall that stationary kernels are translation invariant, which is a not good model for some cases, for e.g. this would assume some variable like sleeping time to covary similar to 1 and 2 year olds the same as 23-24 year olds. The simples non-stationary kernel is the Euclidean inner product.
Non-stationary kernels like the Gibbs kernel have hyperparameters that depend on the input space, for e.g. the lengthscale of the GP will now depend on the input space, this allows for e.g. to have a high varying function in some areas and a smoothly varying function in other areas. Overfitting in such a model is prevented by the kernel chosen for the GPs defined on the hyperparameters - they should not be too flexible. We can add hyperpriors to prevent this.
Non-Gaussian Likelihoods - Classification
Recall some problems with GPs:
- Closed form formulas are only available for Gaussian likelihoods.
- Computational complexity.
Many problems involve observational samples which are constrained, e.g. in classification, the observed values are class labels which come from a finite set, and for count data, it is a non-negative integer.
For classification, we can use a Bernoulli likelihood - GPs can model non-linear boundaries and adapt the decision boundary to the data (i.e. it is nonparametric). GPs also give model uncertainties.
For regression data with outliers, we could use a Student-t likelihood which models heavy-tailed noise. For count data, we can use a Poisson likelihood and so on.
Recall that the likelihood \(p(y|f)\) is a function of \(f\) and \(y\). Given a likelihood and a latent variable, we need link functions (which should be invertible) to bring them together - for example the parameter for a Bernoulli distribution lies between \([0,1]\). Log-concavity (#DOUBT of the likelihood?) is also important when doing approximate inference.
Non-Gaussian likelihoods result in intractable distributions for the predictive posterior, what follows is approximate inference.
Consider the simplest approach to approximate a distribution - a point estimate at the mode. Another option is to use a mixture of delta distributions like in MCMC or Neural network ensembles. Another option is to use a Gaussian, or a mixture of Gaussians.
We will consider some approaches which use the Gaussian distribution for posterior approximation. The question is now how to find the mean and covariance of the approximating Gaussian model. In short, the Laplace approximation locally matches the mean and variance at a point (specifically the mode), meanwhile Variational Inference (VI) and Expectation Propagation (EP) perform a global approximation by minimizing the KLd divergence between the true posterior and the approximating Gaussian.
Laplace approximation is based on the Taylor expansion of the log posterior at its mode - which can be found efficiently (and with convergence) if the likelihood is log-concave. The mode and the surrounding curvature is then matched with the mean and covariance of a Gaussian to get the approximation.
VI represents a general framework for approximate Bayesian inference. It is used for approximating models with non-Gaussian likelihoods, GPs for big data, Variational Auto-Encoders (VAEs) etc.
The general recipe is to choose a family of distributions \(\mathcal{Q}\) and find \(q\in \mathcal{Q}\) to approximate the posterior \(p\), using some form of a distance between \(p\) and \(q\). Using the Gaussian approximation for \(q\), we can use Jensen's inequality to get a lower bound (ELBO) on the log likelihood which we can optimize with respect to the parameters of the variational approximation (in this case, the parameters are the mean and cholesky factor of the covariance matrix).
(See The Science behind the Art, Cools, paper on quadrature etc).
VI is principles (minimizes divergence from true posterior), mode seeking, minimizes a true lower bound thus converges eventually, however it typically underestimates variance.
Large-scale Gaussian Processes
Recall the main bottleneck is related to the Gramm matrix - specifically storing/computing (quadratic) and/or inverting it (cubic).
(Small note on LAPACK Linear Algebra PACKage, the classic FORTRAN numerical computing library).
Some approximations:
- When the data lies on a grid, we can exploit this structure by using a separable kernel. In this case, the kernel can be written as a Kronecker product, and it allows us to compute inverses on the smaller matrices that make up the Kronecker product.
- If the inputs lie on a regular grid and the kernel is stationary, (i.e. has Toeplitz structure), we can use the Fast Fourier Transform.
- If the inputs are not exactly on a grid but close to a grid, one can do cubic interpolation on the kernel matrix between the grid points - this method is called KISS-GP. This works well in small dimensions (up to 4).
- Other formulations with special structure allow for approximations, e.g. additive/product nature of kernels, and state space formulation results in linear time algorithms.
("You know what they say about ML people. Machine learning people are the cowboys of statistics, they drive in with a horse, shoot around with their models and ride out")
From a linear algebra perspective, matrix inverses can be thought of quadratic optimization. From this perspective we can have conjugate gradient methods which work iteratively.
Loca approximation methods involve splitting the problem into smaller chunks. One approach combines so called "local experts" - the dataset of size \(N\) is broken into \(M\) subsets, and \(M\) independent GPs with shared kernel hyperparameters are fit. The predictions involve aggregations - adding (mixture of experts) and products (products of experts).
Global approximation methods often involve inducing points. Conceptually, in the real world, lots of data does not imply the data generating process is complex - there maybe redundancies. With redundant data, the covariance matrix becomes low rank.
Sparse GPs refer to approximations that somehow reduce the computations related to the kernel. This includes:
- Nystrom approximation on the kernel. (Applied to GPs in (, a), where a random subset of the training data is selected to get \(\mathbf{ K }_{zz}\) to make the approximation \(\mathbf{ K }_{ff}=\mathbf{ K }_{fz}\mathbf{ K }_{zz}^{-1}\mathbf{ K }_{zf}\) and used alongside the Woodbury matrix identity). Note that the Nystrom approximation is not specific for GPs.
- Fully Independent Training Conditional (FITC).
- Variational Sparse Gaussian Processes.
Redundant data means the covariance matrix has low rank.
Inducing point methods aim to "represent" information from the full dataset by a smaller (sometimes virtual) dataset. Compared to the Nystrom approximation, inducing point approaches are more flexible since we can select the points from the dataset ourselves, or treat them as parameters to optimize as well.
The sparse GP approximation starts by augmenting the GP model with inducing variables to have \(p(\mathbf{y},\mathbf{f},\mathbf{u})\) where \(\mathbf{u }\) comes from a GP evaluated at inducing points \(\mathbf{ z }_i\). If we can choose the inducing points ourselves, we will have a more flexible approach compared to selecting a subset of the existing data as in the Nystrom approximation. Learning the inducing variables can be formulated as an elegant variational approximation for GPs.
We can decompose the augmented model as: \[ p(\mathbf{y,f,u}) = p(\mathbf{y}|\mathbf{f})p(\mathbf{f,u}) \] Where the observations depend only on the latent function values at \(\mathbf{x}_i\), not \(\mathbf{z}_i\). The original model can be recovered by marginalizing over \(\mathbf{u}\).
We can further decompose the augmented model as: \[ p(\mathbf{f,y,u})=p(\mathbf{y}|\mathbf{f})p(\mathbf{f}|\mathbf{u})p(\mathbf{u}) \]
If we make the variational approximation \(q(\mathbf{f,u})=p(\mathbf{f}|\mathbf{u},\mathbf{x}_i,\mathbf{z}_i)q(\mathbf{u})\) where \(q(\mathbf{u})=\mathcal{N}(\mathbf{u}|\mu, \Sigma)\) and plug it into the ELBO, we get a relatively nice optimization objective that depends on the inducing inputs in addition to the parameters of \(q(\mathbf{u})\). Since all of these are variational parameters, we do not get overfitting (#DOUBT Why do we not get overfitting exactly?).
Recall mini-batch learning, where we approximate the optimization objective (which are expectations) by subsampling the data which results in an unbiased estimator. Note however the collapsed ELBO for sparse GPs is not an expectation - so mini-batching cannot be applied straightforwardly. If we look at the uncollapsed bound however, we see that it is an expectation (w.r.t. data) thus mini-batching could be applied.
Other important points:
- Kernel matrix may have high condition number (especially the squared exponential kernel) - meaning the direct inverse is numerically unstable. Instead, using the Cholesky decomposition based inversion is a better option.
- Adding a small jitter term (~\(10^{-6}\mathbf{I}\)) to the covariance matrix diagonal to avoid numerical instabilities.
Whitening: The prior \(p(\mathbf{u})\) strongly correlates to the elements of \(q(\mathbf{u})=\mathcal{N}(\mathbf{u}|\mathbf{m},\mathbf{S})\), and this results in optimization over \(\mathbf{m,S}\) being challenging. One approach to remedy this is to use whitening - where we:
- Define \(\mathbf{u}=\mathbf{Lv} + \text{Prior mean}\), where \(\mathbf{L} = \text{Cholesky}(\mathbf{K})\)
- Let \(p(\mathbf{v}) = \mathcal{N}(\mathbf{v}|\mathbf{0},\mathbf{I})\) which is completely uncorrelated (#DOUBT To the prior \(p(\mathbf{u})\)?).
- Optimize over \(q(\mathbf{v})=\mathcal{N}(\mathbf{m}',\mathbf{S}')\) instead.
- Convert to approximate posterior \(q(\mathbf{u})=\mathcal{N}(\mathbf{Lm}', \mathbf{LS'K^T})\).
(Similar idea to preconditioning).
RKHS
A Hilbert space is an inner product space that is complete (call Cauchy sequences are convergent).
Recall that a function \(k(\cdot,\cdot) : \mathcal{X}\times\mathcal{X}\to \mathbb{R}\) is a kernel function if and only if there exists a Hilbert space \(\mathcal{H}\) and a map \(\phi:\mathcal{X}\to\mathcal{H}\) such that \(k(x,y) = \langle \phi(x), \phi(y) \rangle\).
To get a feeling that given a kernel we will have a corresponding Hilbert space, first choose \(\phi(x) = k(x,\cdot) : \mathcal{X}\to\mathbb{R}^\mathcal{X}\). Then, let \(\mathcal{G}\) be the vector space spanned by \(\{k(x,\cdot)|x\in \mathcal{ X }\}\) (i.e. all possible linear combinations of \(k(x,\cdot)\). Then, \(\mathcal{G} \subseteq \mathbb{R}^\mathcal{X}\). We can define an inner product on \(\mathcal{G}\) : \(f = \sum_{i}^{}\alpha_i k(x_i,\cdot), g=\sum_{j}^{}\beta_j k(y_j,\cdot)\) which are both in \(\mathcal{G}\), then \[ \langle f,g\rangle = \langle \sum_{i}^{}\alpha_i k(x_i,\cdot),\sum_{j}^{}\beta_jk(y_j,\cdot)\rangle\\= \sum_{ij}^{}\alpha_i\beta_j \langle k(x_i,\cdot),k(y_j,\cdot)\rangle=\sum_{ij}^{}\alpha_i\beta_j k(x_i,y_j) \]
Reproducing Kernel Hilbert Spaces (RKHSs) have an extra property called the reproducing property: \[ \langle k(x,\cdot),f(\cdot)\rangle =\langle k(x,\cdot), \sum_{i}^{}\alpha_i k(x_i,\cdot) \rangle = \sum_{i}^{}\alpha_i\langle k(x,\cdot),k(x_i,\cdot)\rangle = \sum_{i}^{}\alpha_i k(x,x_i)=f(x) \] Given a kernel, there is a unique RKHS, and given an RKHS, there is a unique kernel (Moore-Aronszajn theorem).
Representer theorem: Suppose we have a kernel \(k\), and let \(\mathcal{H}\) be the corresponding RKHS. We wish to learn a linear function \(f(\mathbf{x})\) from a finite dataset. The representer theorem sayas that the risk minimization problem of the form: \[ \min_{f\in\mathcal{H}} R_n(\mathbf{y},\mathbf{f}) + \lambda \Omega(||f||_\mathcal{ H }) \]
where \(\mathbf{f}=f(\mathbf{x}_1),\cdots,f(\mathbf{x}_n)\) and \(\mathbf{y}=y_1,\cdots,y_n\) has an optimal solution of the form \[ f(\mathbf{x}) = \sum_{i=1}^{n}\alpha_i k(\mathbf{x}_i,\mathbf{x}) \]
Choosing \(R_n\) to be the squared loss and the regularizer \(\Omega\) to be the squared norm, the resulting minization problem, called kernel ridge regression has the optimal solution that corresponds to the predictive mean of posterior GP given the same data and kernel.
Infinite Width Neural Networks
In classification problems, NNs are very confident even in regions where there is no data - being Bayesian however (for e.g. using a Laplace approximation at the last layer) helps overcome this problem.
Putting priors on a single layer Multi-Layer Perceptron and increasing the hidden widths to infinity with the hidden units sampled from a zero centered Gaussian results in a GP. One thing that is lost here is the represent learning capacity of neural networks since in the limit the kernel does not have trainable weight parameters.
The Neural Tangent Kernel (NTK) links gradient descent training in NNs and kernel methods. It aims to describe the training dynamics when training NNs. Consider NNs of the form \(f(\mathbf{x};\theta) = \sum_{i=1}^{K}a_i \sigma(\sum_{d=1}^{D}b_{i,d}x_d)\). In a regression case, consider the Gradient Descent update term. Empirically, if \(K\) is large, it has been observed that the NN parameters learnt with GD do not change much over time, a phenomenon called "lazy training". This lazy training observation motivates takin a first order Taylor approximation of \(f(\mathbf{x};\theta) \approx f(\mathbf{x};\theta_0) + \nabla_\theta f(\mathbf{x};\theta_0)^T (\theta-\theta_0)\) - where the gradient term does not depend on \(\theta\) and is nonlinear in \(\mathbf{x}\). The NTK is then defined to be the inner product induced by takin the feature vector to be \(\phi(\mathbf{x}) = \nabla_\theta f(\mathbf{x};\theta_0)\).
The NTK allows us to derive the kernel induced from an infinite width NN in closed form, and allows us to analyze the training dynamics, even in the continuous case (gradient flow).
Deep Gaussian Processes
Standard GPs make strong assumptions via the kernel functions. This results in some undesired behaviour in:
- Hard to capture discontinuities and jumps, and variance becomes higher even in regions with data.
- Outliers are given the same weight, which can result in the likelihood variance being too high for non-outlier samples.
- Hard to capture non-stationarity which is seen in many real-world datasets.
Modern machine learning is dominated by deep learning, and function compositions are at the heart of deep learning. Similarly, deep GPs also compose functions (from GPs), whilst incorporating uncertainty and prior knowledge.
The composition of deep GPs alows for more complex, non-stationary functions, where each "layer" is a warping of the inputs before being fed to the next layer.
Note that multiple function compositions lead to "clumping" of function values. At each predecessor layer, there will be functions where inputs (which can be distant) will have function outputs that are close together. Depending on the kernel, they will highly correlate together - since at the next layer we evaluate the kernel over these function outputs. Once we clump points together, we cannot unclump them, but we can separate clumps from each other. (#NOTE Would Shreyas's paper apply here?).
One question is how to avoid overfitting in deep GPs. See slide 11 for deep GP model. (#TODO Write down the deep GP model).
A Gaussian propagated through a nonlinearity is not Gaussian - similarly a GP propagated through a nonlinearity is not a GP.
(#DOUBT Understand the plot in Slide 13).
Inference in Deep GPs can be done via VI (e.g. Importance-weight VI with latent variables), EP (Deep GP Expectation Propagation), or Hamiltonian MCMC. We will focus of sparse, stochastic VI.
By introducing inducing inputs and outputs at each layer, we can derive an ELBO term similar for the single GP ELBO. One main difference for the deep GP bound is that the marginal \(q(f_{L,n})\) to which we compute the expectation of the likelihood over is not Gaussian. Fortunately, sampling from this marginal is efficient - this is called doubly stochastic VI.
Kernel compositionality ensures the resulting samples are still Gaussian, however GP function composition does not preserve Gaussanity.
Deep GPs have some issues:
- More sensitive to initializations than standard GPs.
- Training can be slow.
- Training is more susceptible to local minima.
- Current VI approaches tend to "turn off" layers/reduce their variance to near 0.
Deep GPs are however:
- Great performance on many medium-large machine learning tasks, with good uncertainty estimates.
- Have been combined with convolutional kernels to product SOTA results on image classification.
- Performance is okay compared to e.g. deep CNNs, but improves uncertainty quantification in predictions.
Latent variable models and Unsupervised learning
Latent Variable Models (LVMs) are a useful tool assuming the manifold hypothesis - that high dimensional data often lie on a lower dimensional manifold, whose dimensions are called the intrinsic dimensionality. In unsupervised learning, we do not have output covariates, and instead want to find a lower dimensional representation of the data.
Dimensionality reduction is the task of taking some data \(\mathbf{ x }\) and learning a projection onto a lower dimensional embedding \(\mathbf{ z }\). Manifold learning is the task of learning the embedding and a map \(g: \mathbf{x}\to \mathbf{ z }\).
Consider Principal Component Analysis (PCA): \[ \mathbf{ x }_n = \mathbf{ W }\mathbf{ z }_n +\epsilon_n \] Where \(\epsilon \sim \mathcal{ N }(0,\sigma^2\mathbf{ I }_D)\) and \(\mathbf{ W }:\mathbb{ R }^D\to\mathbb{ R }^Q\) is the projection such that the new basis \(\mathbf{ z }\) consists of principal components which space the directions of greatest variance.
Performing PCA as above on the swiss roll data does is not optimal, and the model above does not model nay uncertainty.
Consider the Probabilistic PCA model, where we have the following:
- Likelihood \(p(\mathbf{ X }|\mathbf{ W },\mathbf{ Z},\sigma) = \prod_{n=1}^{N}\mathcal{ N }(\mathbf{ x }_n|\mathbf{ Wz }_n,\sigma^2\mathbf{ I })\).
- Prior over embeddings \(p(\mathbf{ Z })=\prod_{n=1}^{N}\mathcal{ N }(\mathbf{ z }_n|0,\mathbf{ I })\).
When integrating out the latent variables \(\mathbf{ Z}\), we get the marginal likelihood in closed form: \[ p(\mathbf{ X }|\mathbf{ W },\sigma) = \prod_{n=1}^{N}\mathcal{ N }(\mathbf{ x }_n|0,\mathbf{ WW }^T+\sigma^2\mathbf{ I }) \]
Now, consider Dual PPCA, where in the likelihood, we take the product over the embedding dimensional rather than the samples, and place a prior over \(\mathbf{ W }\).
- Likelihood \(p(\mathbf{ X }|\mathbf{ W },\mathbf{ Z },\sigma)=\prod_{d=1}^{D}\mathcal{ N }(\mathbf{ x }_d|\mathbf{ Z }\mathbf{ w }_d^T,\sigma^2\mathbf{ I })\).
- Prior \(p(\mathbf{ W })=\prod_{d=1}^{D}\mathcal{ N }(\mathbf{ w }_d|0,\mathbf{ I })\).
Integrating out \(\mathbf{ W }\) now gives the following marginal likelihood: \[ p(\mathbf{ X }|\mathbf{ Z },\sigma)=\prod_{d=1}^{D}\mathcal{ N }(\mathbf{ x }_d|0,\mathbf{ ZZ }^T+\sigma^2\mathbf{ I }) \]
Notice here the covariance is a linear kernel - this suggests we can perform the kernel trick here. Doing so results in the GPLVM, where for each dimension \(d\in [1,\cdots,D]\) we have: \[ \mathbf{ x }_d(\mathbf{ z }) \sim \mathcal{ GP }(m(\mathbf{ z }),k(\mathbf{ z },\mathbf{ z' })) \]
Training is done via maximum marginal likelihood (MLL) using gradiant based optimization over \(\mathbf{ z },\theta,\sigma\) where $\th$$θ$ are the kernel hyperparameters - or maximum a marginal posterior (MAMP) where an additional prior is put over \(\mathbf{ Z }\) and then optimized alongside the marginal likelihood.
Being Bayesian involves putting a prior over \(p(\mathbf{ Z })\) and integrating it out. That is, we now have:
- (Marginal) Likelihood \(p(\mathbf{ X}|\mathbf{ Z })=\prod_{d=1}^{D}\mathcal{ N }(\mathbf{ x }_d|0,\mathbf{ K }_{\mathbf{ ZZ }}+\sigma^2\mathbf{ I })\).
- Prior \(p(\mathbf{ Z })=\prod_{n=1}^{N}\mathcal{N}(\mathbf{ z }_n|0,\mathbf{ I})\).
Since \(\mathbf{ z }\) appears non-linearly in the inverse of the kernel, this is intractable (#TODO Think a bit more about this).
A variational approach can be used, by first introducing a variational approximation \(p(\mathbf{ Z })=\prod_{n=1}^{N}(\mathbf{ z }_n|\mathbf{ m }_n,S_n)\) and applying an inducing point approach.
GPLVMs have many flavours:
- Shared GPLVMs which functions from a shared latent space to separate observation spaces.
- Back constrained GPLVM which preserves locality in the image map.
- Dynamic GPLVM which adds a dynamic prior for supervised learning.
- Using Deep GPs instead of just GPs.
Multi-Output Gaussian Processes
Often we may have multiple processes which are correlated.
Intrinsic Model of Coregionalization (IMC):
- Sample \(S\) iid functions from some shared underlying process \(\mathbf{ u }^{(s)}\) which is a GP in this case.
- The vector valued process is the weighted sum \(\mathbf{ f }(\mathbf{ x })=\sum_{s=1}^{S}\mathbf{ a }^{(s)}\mathbf{ u }^{(s)}(\mathbf{ x })\) where \(\mathbf{ f },\mathbf{ a }\) are \(P\) dimensional. (#DOUBT Is \(\mathbf{ f }\) \(P\) dimensional only because \(\mathbf{ a}\) is?).
The corresponding kernel for the GP over \(\mathbf{ f}\) can be expressed using Kronecker products.
If we use 1 sample from a different kernel \(k_q\) for each \(\mathbf{ u }\), we get the Semiparametric Latent Factor Model (SLFM). If we use \(S_q\) samples from \(Q\) different processes, we get the Linear Model of Coregionalization.
In both of the extended models above, \(\mathbf{ f }\) is still a GP whose kernel can be defined via Kronecker products.
Other interesting models:
- Gaussian Process Regression Networks (GPRNs).
State-Space Gaussian Processes
These models are motivated by temporal models, which are one-dimensional models indexed by time. The data has a natural ordering. Adding spatial information gives spatio-temporal models, where something develops over time. Often we will have the time unbounded, e.g. sensor data streams.
Stationary GPs can be viewed from 3 perspectives:
- Kernel (moment) based representation.
- Spectral (Fourier) based representation.
- State-space (path) based representation.
The state-space (path) representation aims to define GPs as a solution to a Linear Time Invariant (LTI) Stochastic Differential Equation (SDE): \[ d\mathbf{ f }=\mathbf{ F }\mathbf{ f}dt+\mathbf{ L }d\mathbf{ \beta } \]
Equivalently but more informally: \[ \frac{d\mathbf{ f }(t)}{dt}=\mathbf{ F }\mathbf{ f }(t) + \mathbf{ L }\mathbf{ w }(t) \]
Here:
- \(\mathbf{ w }(t)\) is white noise with spectral density \(\mathbf{ Q_c }\in\mathbb{ R }^{s\times s}\).
- \(\mathbf{ F }\in \mathbb{ R }^{m\times m}\) is the drift matrix.
- \(\mathbf{ L }\in\mathbb{ R }^{m\times s}\) is the diffusion matrix.
The scalar value GP is recovered via \(f(t)=\mathbf{ H }\mathbf{ f }(t)\).
The initial stateis given by a stationary state \(\mathbf{ f }(0) \sim \mathcal{ N }(0,\mathbf{ P }_\infty)\) where we have \[ \mathbf{ FP }_\infty \mathbf{ P }_\infty\mathbf{ F }^T +\mathbf{ LQ }_c\mathbf{ L }^T=0 \]
The covariance function at the stationary state can be recovered by: (#TODO Get Equation in slide 10/41)
For discrete time points \(\{t_i\}_{i=1}^n\) the resulting discrete state-space model involves a Gaussian whose covariance can be computed in closed form in the stationary case.
For the state-space formalism, the covariance function must be Markovian.
Inference involves filtering (the Kalman filter, prediction + update) followed by smoothing (the Rauch-Tung-Striebel smoother). The log marginal likelihood is also