Machine Learning Advanced Probabilistic Methods (Aalto University CS-E4820 2023)

Introduction

Content

#TODO Fill in content from Lecture 1-4.

Probabilistic Modeling

The goal of probabilistic modelling to answer questions about data where we quantify uncertainty using probabilities.

In a nutshell, the process is:

Select a model (e.g. linear models, Gaussian mixture models etc).
Infer the model parameters i.e. train/fit the model (e.g. via maximum likelihood, expectation maximization, variational Bayes etc).
Use the fitted answer to answer the questions of interest.

In practice, several models are considered, which leads to the problem of model selection.

Important concepts from probability theory:

Random variables.
Expectations.
Marginalization.
Independence.
Conditional distributions.
Conditional independence.

Bayesian Networks

Given a Bayesian network is a DAG \(G\) alongside a conditional distribution which can be written as \(p(x_1, \cdots, x_D) = \prod_{i=1}^{D}p(x_i|PA(x_i))\).

Given a Bayesian Network, d-separation in Directed Acyclic Graphs (DAGs) implies conditional independence in the distribution. The main substructures to consider in the DAG are forks, chains and colliders which each consist of 3 nodes/variables. By factorizing their conditional densities of these according to the graph structure, one can be the conditional independences implied by them.

The Markov Equivalence Class (MEC) consists of all possible graphs that share the same skeleton and v-structures - given observational data from a Bayesian Network we can only identify the corresponding DAG up to its MEC.

Gaussian distribution

Single and multivariate Gaussian densities.
Characterization
Conjugate priors.
Marginal and conditional distributions.
Linear transformations.

ML-II and Laplace approximations

ML-II, also called evidence maximization, empirical Bayes, or maximum marginal likelihood involves choosing hyperparameters which maximize the marginal likelihood.
Hyperparameter selection can also be done by choosing hyperparameters which minimize prediction error in a separate validation dataset, or maximizing the marginal likelihood on the validation data.
Laplace Approximation, and its application to Bayesian logistic regression with a Gaussian prior on the weights.

Gaussian Mixture Models (GMMs)

Introduce latent variables \(z_n = (z_{n1},\cdots, z_{nk})\) which specifies the componenet \(k\) of observation \(x_n\), where \(z_n\) is a one-hot vector \(e_k\).

Define \(p(z_n)=\prod_{k=1}^K \pi_k^{z_{nk}}\) and \(p(x_n|z_n) = \prod_{k=1}^K N(x_n|\mu_k, \Sigma_k)^{z_{nk}}\).

\(\pi_k\) are called mixing coefficients, and they sum to 1.

The marginal distribution \(p(x_n)\) is a GMM: \[ p(x_n) = \sum_{k=1}^K \pi_k \mathcal{N}(x_n|\mu_k, \Sigma_k) \]

Expectation Maximization

Given a model with observed data \(x\) and latent variables \(z\), the MLE is the solution to the following optimization objective: \(\text{argmax}_\theta \int_z \log p(x,z|\theta)dz\).

In Expectation Maximization (EM), we assume we have some idea of \(z\) from the estimate of \(\theta\) at time \(k\), \(\hat{\theta}_k\). This is done by first computing the posterior \(p(z|x,\theta_k)\) (E-step). We can use this expectation to estimate the optimization objective, meaning we can then optimize \(\mathbb{E}_{p(z|x,\theta_k)}[\log p(x,z|\theta)] := Q(\theta_0,\theta)\) with respect to \(\theta\) (M-step). This is repeated until convergence.

Each step is guaranteed to improve the estimate since \(Q(\theta_0, \theta)\) is a lower bound to the log likelihood \(p(x|\theta)\). However it could get stuck in local minima.

In general it is applicable when the observed data \(x\) can be augmented with some other variable \(z\) per datapoint, and \(p(x,z|\theta)\) is easy to optimize with respect to \(\theta\).

Note that MLE is not well defined for GMMs since given a component with a single data point, the likelihood tends to infinity when the variance of the Gaussian corresponding to that component tends to 0.

There is also the label switching phenomenon where switching the cluster labels leads to the same likelihood, introducing a form of non-identifiability.

In practice, EM for GMMs is initialized using some classical clustering approach such as k-means.

Variational Inference

Recall we want to learn the posterior distribution \(p(z|x)\) of latent variables given observed variables \(x\). In a fully Bayesian model, the parameters \(\theta\) have priors and are included as part of \(z\) unlike EM.

Variational Inference (VI) is one form of inference used to approximate the posterior.

The idea is to approximate \(p(z|x)\) with a tractable distribution \(q(z)\), which has a simpler form, e.g it is Gaussian and/or factorized. Given such an approximation \(q(z)\), we have that \[ \log p(x) = \mathcal{L}(q(z)) + \text{KL}(q(z)||p(z|x)) \] where \(\mathcal{L}\) is a lower bound for \(\log p(x)\), also called the Evidence Lower BOund (ELBO), and KL refers to the Kullback-Leibler (KL) Divergence.

It should be noted that VI underestimates uncertainty. (#TODO Explain why).

Since \(\log p(x)\) is a constant, maximizing \(\mathcal{L}\) minimizes the KL divergence between the posterior and our approximation \(q(z)\).

The ELBO itself can be written as: \[ \mathbb{E}_{q(z)}[\log p(x|z)] - KL(q(z)||p(z)) \]

Where the first term can be seens as the reconstruction error, where we want \(q(z)\) to give higher mass to \(p(x|z)\) and the second term can be seen as reducing the complexity of \(q(z)\) by forcing it to not too far from the prior \(p(z)\).

Mean-field Variational Bayes (VB) assumes that \(q(z)\) factorizes, and each \(q_i(z_i)\) is called a factor. Suppose we update each factor \(q_i(z_i)\) separately while others are fixed - then we can show that the ELBO can be updated by setting \(q_i(z_i)\) as \[ \log q_i^*(z_i) = \mathbb{E}_{q_{-i}(z_{-i})}[\log p(x,z)] + \text{const} \] Where \(q_{-i}(z_{-i}) = \prod_{i\neq j}q_i(z_i)\).

The learning algorithm now consists of updating each factor separately.

Note the similarity with the Expectation Maximization (EM) algorithm - it can shown that EM is a special case of VI.

Model Selection

Reading: Barber Chapter 12.

Possible goals may be:

Get the most useful model, e.g. one that best predicts the future.
The most probably model, e.g. when comparing between scientific hypotheses.

Bayesian Model Selection involves comparing \(m\) models \(M_i\) each with parameters \(\theta_i\) with \(\mathbb{P}( x, \theta_i | M_i ) = \mathbb{P}( x|\theta_i,M_i )\mathbb{P}( \theta_i | M_i )\).

The model posterior probabilities are then \[ p(M_i|x) = \frac{p(x|M_i)p(M_i)}{p(x)} \] where we have the marginal likelihood as \(p(x|M_i)=\int p(x|\theta_i, M_i)p(\theta_i|M_i)d\theta_i\) and the model evidence as \(p(x)=\sum_i p(x|M_i)p(M_i)\).

The posterior odds is defined as \[ \frac{p(M_i|x)}{p(M_j|x)} = \frac{p(x|M_i)p(M_i)}{p(x|M_j)p(M_j)} \]

With \(\frac{p(x|M_i)}{p(x|M_j)}\) being called the Bayes factor - it is the ratio of the marginal likelihoods of the models \(M_i,M_j\) under consideration.

Recall that \(\log p(x) = \mathcal{L}(q) + KL(q||p)\), where the ELBIO \(\mathcal{ L }\) can be computed directly if we use exponential families and conjugate priors. This ELBO gives:

Different way to define factor updates by maximizing \(\mathcal{ L }\).
Check for the algorithm - \(\mathcal{ L }\) should never decrease.
Criterion to monitor convergence.
An estimate of \(\log p(x)\) which can be used for model selection.

In the context of Bayesian model selection, the ELBO is a relevant quantity.

There are also predictive model selection criteria - they aim to approximate the expected prediction accuracy in a new data set. Often this involves:

Analytic computation (e.g. AIC, DIC, WAIC).
Efficient sample re-use (e.g. cross validation).

Hence the goal is to find a model that is suitable for prediction.

Asymptotically, AIC and leave-one-out cross validation are equivalent.

Bayesian Model Selection	Predictive Model Selection
Asymptotically consistent	No consistency guarantees
Better "true" model given distinct interpretable choices	No need to assume true model
Heavily penalizes complexity	Less complexity penalization
Maybe sensitive to priors

In practice, two approaches used interchangeably.

Factor Analysis

Reading: Barber Chapter 21.

Consider an \(N\times D\) data matrix. We may be interested in knowing about:

Samples (i.e. the rows) - techniques include clustering, multi-dimensional scaling, disciminant analysis, where we start from similarities between the samples themselves.
Features (i.e. columns) - techniques include factor analysis, Principal Component Analysis, Canonical Correlation Analysis, where we start from the correlations between the variables.

In Factor Analysis (FA), the aim to model the correlation between many variables \(\mathbf{ v }\) as arising from a smaller number of hidden factors \(\mathbf{ f }\). The factors are not observed directly, while the visible errors depend on the factors and some random error.

The relation is as follows: \[ \mathbf{ v }_n = F\mathbf{ h }_n + \mathbf{ c }+\epsilon_n \]

Where \(\epsilon_n \sim \mathcal{ N }(0,\Psi)\) and \(\Psi = \text{diag}(\psi_1,\cdots,\psi_D)\), and the factor loading matrix \(F\) (which is estimated from the data) tells how the factors affect the observations, i.e. \(F_{ij}\) is the effect of the factor \(h_j\) on variable \(v_i\).

The ful generative model can be written as: \[ \mathbf{ h } \sim\mathcal{ N}(0,\mathbf{ I }_H) \\ \mathbf{ v |h} \sim \mathcal{ N }(F\mathbf{ h }+\mathbf{ c },\Psi) \]

Integrating over \(\mathbf{ h }\) gives the marginal as \(p(\mathbf{ v }) = \mathcal{N}(\mathbf{ c },FF^T+\Psi)\).

Note that the marginal likelihood is invariant to rotations, i.e. given a rotaiton matrix \(R\), since \(FR(FR)^T + \Psi = FRR^TF + \Psi = FF^T + \Psi\).

Often, \(R\) is selected to give interpretable factors (varimax rotation), which makes columns of \(F\) to have small number of large values. The rotation itself is not relevant for fitting the model, but is relevant for interpreting the learnt factors.

When relaxing \(\Psi\) to be \(\sigma^2 \mathbf{ I }\), we get Probabilistic PCA (#TODO Link to actual paper). For example, if we train both models on some MNIST digit, the FA model learns the noise in the edges better than the PPCA model, as it can model the variance of each pixel separately.

Compared to GMMs whose hidden variables stem from a categorical distribution (which decides the component that a sample belongs to), the hidden variables in FA are from a Gaussian.

Geometrically, FA assumes that the data lies close to a low-dimensional linear manifold (#TODO Formalize what is meant by this).

We can extend and combine these models, for e.g. make a mixture of FA models.

Some approaches to determining how many factors to include:

Bayesian model selection ideas.
Cross validation.
Automatic Relevance Determination (ARD) - shrinkage priors etc.
Bayesian nonparametrics (e.g. Dirichlet Processes for clustering, Beta process prios for Factor Analysis).

Differentiable Variational Inference

Reading: Weight Uncertainty in Neural Networks, Blundell et al ICML 2015.

Bayesian Neural Networks (BNNs) have a distribution over the network's weights. BNNs capture uncertainty better than classical NNs which are severely confident in out of data regimes.

Classical Variational Bayes often works out given conjugate distributions., and the derivations need to be redone for different models. Hence for BNNs, we consider VB by backpropagation. The main ideas are:

Use SGD to compute gradients of the ELBO, avoiding the need for model-specific derivations.
Compute the updates using mini-batches, increase the speed when there is a lot of data.
Use Monte Carlo estimates to compute required expectations, removing the need for conjugate priors.

The reparametrization trick is used when computing ELBO updates.

Anki

How does conditional independence relate to d-separation?

In a Bayesian network, if two variables \(A,B\) are d-separated by some set of variables \(C\) then \(A,B\) are conditionally independent given \(C\). Note that this conditional independence does not imply d-separation.

Given an example of conditional independence statements that cannot be represented by a Bayesian Network.

(Lecture 2, 2023, Slide 14/29)

T1 independent of T2,Y2 T2 independent of T1,Y1

What is the marginal likelihood?

Given data \(\mathcal{ D }\), the (frequentist) Maximum-Likelihood approach is to find parameters that maximize \(\log p(\mathcal{ D }|\mathbf{ w },\lambda)\) where \(\mathbf{ w }\) are model parameters and \(\lambda\) are hyperparameters.

In the Bayesian setting, we integrate out the model parameters to get the marginal likelihood as \(\log p(\mathcal{ D }|\lambda)=\int p(\mathcal{ D }|\mathbf{ w },\lambda)p(\mathbf{ w }|\lambda)d\mathbf{ w }\). This is a function of the hyperaparameters \(\lambda\), and finding the optimal \(\lambda\) via the marginal likelihood is called Maximum-Likelihood Type 2, or evidence maximization, or empirical Bayes, or Maximum Marginal Likelihood.

Define odds and the odds ratio

Given 2 classes, let \(p_i\) be the probability of some sample belonging to class \(i\). We have that \(p_0 + p_1=1\).

The odds of belonging to class \(i\) is \(\frac{p_i}{1-p_i}\).

In the context of binary linear logistic regression i.e. \(p(c=1|x)= \sigma(w_0 + \sum_{i}^{}w_i x_i)\) if the feature \(x\) is also binary, then \(e^{w_1}\) is called the odds ratio.

Describe the EM algorithm alongside any assumptions involved

We are given some model with latent variables \(\mathbf{ z }\) and want to fit parameters by maximizing the likelihood.

Denote \(\mathbf{ x }\) as the incomplete data. Then, \(\mathbf{ x,z }\) is called the complete data.

One assumption is that \(\log p(\mathbf{ x,z }|\theta)\) is easy to maximize.

The EM algorithm is about maximizing the following with respect to \(\theta\) given its current estimate \(\theta_0\): \[ Q(\theta,\theta_0) = \mathbb{ E }_{\mathbf{ z }\sim p(\mathbf{ z }|\mathbf{ x },\theta_0)}[\log p(\mathbf{ x,z }|\theta)] \]

The E-step is about computing the expectation in \(Q(\theta,\theta_0)\), which needs computing \(p(\mathbf{ z }|\mathbf{ x },\theta_0)\). The M-step is about maximizing \(Q(\theta,\theta_0)\), and setting the maximizer to \(\theta_0\).

Given a variational distribution \(q(z_1,\cdots,z_M)\) and the true log joint distribution \(\log p(\mathbf{ x },\mathbf{ z })\), what is the mean-field variational update?

The mean-field variational update first assumes the variational distribution completely factorizes into \(\prod_{i=1}^{M}q_i(z_i)\). One can then show that the evidence lower bound is maximized by updating factor \(q_i(z_i)\) as \[ \log q_i^*(z_i) = \mathbb{ E }_{q(\mathbf{ z }_{-i})}[\log p(\mathbf{ x,z })] \]