Marrying Graphical Models Deep Learning - Max Welling - MLSS 2017 - 2017

Details

Title : Marrying Graphical Models Deep Learning - Max Welling - MLSS 2017 Author(s): Max Planck Institute for Intelligent Systems Link(s) : https://www.youtube.com/watch?v=H0hGn78SL2E

Rough Notes

The main quantity in this lecture is: \(\mathbb{E}_{Q(V)}[\log P(X|V)] - KL(Q(V)||P(V))\)

This can be interpreted as:

(Physics) Energy minus entropy.
(Compression) Error of predictions with penalization for complexity.

Modern machine learning assumes data is generated from some unknown distribution, and performs optimization - either of the maximum likelihood objective or some objective function based on some loss. But it is better to think of the problem as a statistical problem instead of just an optimization problem - mainly since if we happen to optimize the objective on different realizations of data samples, there is still uncertainty in the fitted parameters since we have only have finite data.

Bias variance trade-off: Given \(Y=f(X)+\epsilon, \epsilon \sim \mathcal{N}(0,\sigma_\epsilon)\) Error = \(\mathbb{E}[(Y-\hat{f}(x))^2]\) (add and subtract \(\mathbb{E}[\hat{f}(x)]\)) Error = Bias squared + Variance + Irreducible error

Latent variable models: Assume the observed variables are generated from some unknown (latent) variables, e.g. the generative model for observations \(X\) is assumed to factor as \(P(X)=\int_Z P(X|Z)P(Z) dZ\).

Getting \(P(Z|X)\) is intractable, and is needed for learning/inference (#DOUBT Difference between learning and inference here). As a result we turn to approximate inference - of which variational inference and sampling are the most popular examples.

The most common sampling approaches are based on Markov Chain Monte Carlo (MCMC), for e.g. Metropolis-Hasting. However, computing the accept-reject ratio for this depends on the whole dataset, i.e. it is \(O(N)\) - in the era of big data we would rather want this to use a mini-batch. Variational inference on the other hand introduces a distribution \(Q(Z|X)\) to approximate \(P(Z|X)\) and minimizes \(KL(Q(Z|X)||P(Z|X))\) or equivalently maximize over \(\Phi\) the objective \(L(\Phi)=\int Q(Z|X,\Psi)(\log P(X|Z,\Theta)P(Z) - \log Q(Z|X,\Psi))dZ\)

In learning, we want to maximize the log probability of the observed variables \(X\) given the parameters \(\Theta\) - for a latent variable model is this \(\log P(X|\Theta)=\log \sum_Z P(X|Z,\Theta)P(Z)\geq L(\Phi)\). Denote \(B(\Theta,\Phi)\) to be the gap between \(L(\Phi)\) and \(KL(Q(Z|X)||P(Z|X))\), then the variational EM algorithm consists of the E-step \(\text{argmax}_\Phi B(\Theta,\Phi)\) (variational inference) and the M-step involves \(\text{argmax}_\Theta B(\Theta,\Phi)\) (approximate learning). This however assumes a separate density \(Q_i\) for each data point \(i\) - amortized inference instead consists of having a distribution \(Q_\phi(Z|X)\) which takes in values \(X\) and produces values \(Z\), with parameters \(\phi\) (e.g. of a neural network) shared for all samples. From this perspective, "deepifying" factor analysis gives the Variational Autoencoder (VAE) whose encoder \(Q_\phi(Z|X)\) and decoder \(P_\theta(X|Z)\) are parametrized by neural networks.

Prior work often made use of the wake-sleep algorithm, where the updates for \(Q,P\) were optimizing different objectives - this however results in very high variance gradient terms.

Slide at 53:30 puts VAEs as a mix of generative and discriminative models.

Around 1:06:20, 3 real-world successes of deep learning in medical applications are shown.

Variational Bayes can be seen from a compression viewpoint (1:17:00) - in fact we can compress neural networks by zeroing the weights, with a compression rate of around 400 times in some cases.