Strengthening Nonparametric Bayesian Methods with Structured Kernels - 2022
Details
Title : Strengthening Nonparametric Bayesian Methods with Structured Kernels Author(s): Shen, Zheyang Link(s) : https://aaltodoc.aalto.fi/handle/123456789/117726
Rough Notes
Thesis starts with making nonstationary analogues to the Spectral Mixture (SM) kernel (, ) who argue that kernels are equivalent to using a fixed feature representation in Bayesian Neural Networks (BNNs). The SM kernel allows for expressivity of the basis functions, which is done by starting from a spectral density, model it as a mixture of Gaussians, and then use the resulting kernel, which is guaranteed to exist by Bochner's theorem. A less flexible approach is to model the spectral density as a mixture of Dirac distributions, the resulting kernel is called the Sparse Spectrum (SS) kernel.
It is noted that complex-valued kernels can be turned into real-valued kernels by "symmatrizing" the spectral density \(\tilde{\psi}(\mathbf{\omega})=\frac{\psi(\mathbf{\omega})+\psi(-\mathbf{\omega})}{2}\).
To obtain spectral representations of a nonstationary kernel, one could use the Wigner transform to obtain a Wigner Distribution Function (WDF), which is input dependent. The WDF of Locally Stationary (LS) kernels (i.e. kernels of the form \(k(\frac{x+x'}{2})k(x-x')\) have WDFs which can take negative values, hence they are not probability densities in general.
A kernel \(k_{HM}\) is said to be harmonizable if it can be represented as the generalized Fourier transform of a Lebesgue-Stieltjes measure associated to a positive-definite bimeasure \(\mathbf{\psi}\) with bounded variation - which is called the Generalized Spectral Distribution (GSD).
\[ k_{HM}(x,x') = \int \int_{\mathbb{R}^D \times \mathbb{R}^D} \text{exp}(2i \pi(\langle \omega, x\rangle - \langle \omega',x'\rangle))\mathbf{\psi}(d\omega, d\omega') \]
This kernel class captures a lot of important kernels of interest, meanwhile kernels outside of this class are a bit "unusual, even pathological" - see Yaglom 1987.
The joint GSD reduces to the spectral density in Bochner's theorem when the kernel is stationary. The GSD is also not a proper probability density function , even though they are positive semi-definite and have bounded variation (Yaglom 1987).
Starting from the GSD, one can go back to kernels, for e.g. letting \(\psi(\omega,\omega')=\sum_{1\leq i,j\leq M}\mathbf{B}_{ij}\delta_{\omega=\x_i}\delta_{\omega'=\x_j}\) with positive semi-definite \(\mathbf{B}\) generalizes the SS kernel and is called the Harmonizable SS (HSS) kernel. The kernel resulting form multiplying the HSS kernel with an LS kernel is called the Harmonizable Mixture (HM) kernel, which happens to admit a closed-form Wigner transform.
Another approach to constructing non-stationary kernels is to make the lengthscales input dependent.
#TODO Read Matthews et al 2016 for difference between VFE and VI.