Bayesian Data Analysis (Aalto University CS-E5710 2022)

Links

Content

Introduction

Some real world example of Bayesian statistics:

Predicting the compressive strength, which is a measure of

concrete quality. Here we want to ask ourselves if we trust the predictions, instead of just making predictions. It was discovered that the concentration of natural sand could be lowered tremendously, making it cheaper to produce.

Manipulate concentration of a drug so that it is enough to have an effect but not too much to introduce side effects. If we have a model fitted on adults, we would have to tweak the model to extrapolate it for children.

The modelling pipeline starts from some expert information, which is formalized as a mathematical model alongside the relevant (prior) probabilities. We then use Bayes theorem to incorporate data into this model to get an updated model of the system of interest. This updated model is often called the posterior distribution.

A lot of ideas go into the actual computation of the posterior distribution, and actually checking if everything did work (model diagnostics, inference diagnostics etc). A complete workflow involves improving and iterating.

Two types of uncertainty:

Aleatoric uncertainty due to randomness, cannot be reduced by repeated observations.
Epistemic uncertainty due to lack of knowledge. This uncertainty can be reduced. Two observers may have different epistemic uncertainties.

\(p(y|\theta)\) as a function of \(y\) given fixed \(\theta\) is called the model, and describes the aleatoric uncertainty. As a function of \(\theta\) given fixed \(y\), it is called the likelihood, and provides information about the epistemic uncertainty - this is not a probability distribution.

Rock paper scissor AI's can beat humans after around 6 games (#TODO Search for reference).

Basics of Bayesian inference

The Binomial model is used for multiple trials of a phenomena with binary outcomes. For known \(\theta,n\), as a function of \(y\), we call \(p(y|\theta,n) = {n \choose y}\theta^y(1-\theta)^{n-y}\) the observation model, which describes the aleatoric uncertainty. For known \(y\), as a function of \(\theta,n\), we call \(p(y|\theta,n)={n\choose y}\theta^y (1-\theta)^{n-y}\) the likelihood, which describes the epistemic uncertainty.

The predictive distribution for new \(\tilde{y}\) is given by \(p(\tilde{y}|y)=\int p(\tilde{y},\theta|y)d\theta\) where the joint distribution is decomposed to take into account the posterior distribution.

"There was a time where people dismissed priors as arbitrary, meanwhile forgetting models are also arbitrary".

Rough taxonomy of priors (see Stan-dev team's recommendations):

Vague/flat/diffuse/non-informative - Attitude is to let the data speach for itself. A flat prior is not uninformative, and it can be stupid. Makeing the prior flat somewhere can make it non-flat somewhere else.
Proper - A prior whose density integrates to 1.
Improper - A prior whose density does not have a finite integral but can still give a valid posterior distribution.
Weakly informative - Produce computationally better behaving posteriors. In practice, we might know the scale of certain quantities.

Example of using informative priors: Study on percentage \(\theta\) of girl births among parent's attractiveness \(x\) which was on a 1-5 scale. A linear regression model \(\theta= ax+b\) shows a positive slope. However, this uses a deault prior, and simulating \(a,b\) which show linear fits which are all over place, however using an informative prior shows linear fits whose slopes are effectively 0.

Incorrect priors introduce bias, but often give similar estimates due to reduced variance (#ASK Is there a bias-variance theorem in Bayesian terms?).

A quantity \(t(y)\) is called a sufficient statistic for \(\theta\) if the likelihood for \(\theta\) depends on the data \(y\) only through the value \(t(y)\).

Central limit theorem: Given finite variance, the distribution of the mean of random variables approach the Normal distribution as \(n\to \infty\).

Some (one parameter) models of interest:

Poisson model for count data (e.g. epidemeology)
Exponential model for time to some event (e.g. particle decay)

Multi-dimensional posteriors

The Normal distribution is used in:

Linear regression, giving another view for least squares.
Approximations for both continuous and discrete observations.
Many priors due to convenience.
Gaussian processes are in practice multivariate normals.
Kalman filters are Normals with chain rule.
Approximations such as Laplace, Variational Inference (VI) and Expectation Propagation (EP).

Recall the pdf of the Normal distribution is \(p(\theta|\mu,\sigma)=\frac{1}{\sqrt{2\pi}\sigma}\text{exp}(-\frac{1}{2\sigma^2}(\theta-\mu)^2))\). We could approximate.

A crude sampling approximation for \(\mathbb{E}[f(\theta)]=\int f(\theta) p(\theta) d\theta\) can be done by first dividing the domain into boxes and approximating the area under the pdf (which is equal to probability mass) with the area of the union of the resulting boxes. In the 1D case, the regions would be of the form \((\theta_i,\theta_{i+1})\), and the boxes would have area \((\theta_{i+1}-\theta_i)p(\theta_{i+\frac{1}{2}})\) where \(\theta_{1+\frac{1}{2}}\) is the midpoint of \((\theta_i,\theta_{i+1})\). The sum of all such boxes is approximately 1, and we can approximate the integral by \(\sum_{i} f(\theta_{i+\frac{1}{2}})(\theta_{i+1}-\theta_i)p(\theta_{i+\frac{1}{2}}) = \sum_i f(\theta_{i+\frac{1}{2}})w_{i+\frac{1}{2}}\). We can eventually think of decreasing the widths until each box has only 1 draw, and now each box (or bin in histogram terms) has 1 sample, and the weight is \(1/S\) where \(S\) is the total number of samples. This is called the Monte-Carlo approximation.

Given a joint distribution over parameters \(p(\theta_1,\theta_2|y)\propto p(y|\theta_1,\theta_2)p(\theta_1,\theta_2)\), marginalization is ther process of integrating out some of parameters to get a marginal distribution \(p(\theta_1|y)=\int p(\theta_1,\theta_2|y)d\theta_2\). This could be done via Monte-Carlo approximation \(p(\theta_1|y) \approx \frac{1}{S}\sum_{s=1}^S p(\theta_1|\theta_2^{(s)},y)\) where \(\theta_2^{(s)}\sim p(\theta_2|y)\).

The posterior predictive distribution is computed via marginalization. For the Normal distribution, we can get marginal distributions in closed form depending on the chosen priors.

For comparing means of 2 Normals, we could generate samples \(\theta^{(s)}_1,\theta^{(s)}_2\) and compute their difference \(\delta^{(s)}\) for each \(s=1,\cdots,S\).

There is a new recommended LKJ-prior for multivariate normals, seeding Appendix A and the Stan manual.

Many useful distributions can be represented as a scale mixture of Normals:

Student-t
Cauchy
Laplace
Horseshoe
R2-D2

For categorical data beyond the binary case, the multinomial model is useful. The observation model for \(k\) categories is \(p(y|\theta)\propto \prod_{j=1}^k \theta_j^{y_j}\) where \(y_j\) is the count for category \(j\).

A Generalized Linear Model (GLM) is a model of the form \(y_i \sim p(g^{-1}(\alpha+\beta x_i),\phi)\) where:

\(p\) is non-Gaussian (classically in the exponential family but modern MCMC approaches often allow us to bypass this)
\(g\) is a link function which maps a non-linear relationship between parameters and data to a linear one.
\(\phi\) are extra model parameters

One example is Logistic Regression.

Monte Carlo methods

(#ASK Equation 10.2 in the book has 2 integrals one in the numerator another in the denominator, why is it not 1 integral, in which case taking the same sample and evaluating it in the numerator and denominator make more sense.)

Floating point representation of numbers implies we need to use log densities, taking the exponetial at the very final step. For e.g. even with 64 bits, closest value to 0 is \(\approx 2.2\times 10^{-308}\) meaning even 600 iid Normal samples will have a density value of 0 (underflow), meanwhile the closest value to 1 is \(\approx 1 \pm 2.2\times 10^{-16}\) which gives rounding errors when evaluating the cumulative density of the Beta distribution.

The variance of the Monte Carlo estimator is \(\frac{\sigma_\theta^2}{S}\) where \(\sigma^2_\theta\) is the variance of the distribution we sample from - note that this variance is independent of the dimension of of this distribution.

When using Monte Carlo estimates for quantiles, as we go to lower probability mass regions (more to the left or right in unimodal 1D distributions), there will be higher variance. Similarly, to estimate small probabilities (via expectations of the indicator function) a large number of draws is needed.

If the inverse of cumulative density \(F^{-1}\) of a distribution is available, we can sample from the uniform distribution over \([0,1]\) and pass them through \(F^{-1}\) to get valid samples from a distribution whose cumulative density if \(F\). There are other direct simulation methods, e.g. Box-Muller Method.

Rejection Sampling - Requires a proposal density whose density values are greater than that of the target density at all points. Often used to sample from truncated distributions. The effective sample size is the number of accepted samples.

Importance Sampling - Removes the constraint for the proposal density to be greater than the target density, rather requiring its support be a superset of the support of the target density. The weights should behave nicely - if they have finite variance by the Central Limit Theorem (CLT) the variance goes down as \(\frac{1}{S}\). We can sometimes guarantee them to have finite variance by construction, but generally this is not trivial. If the variance is infinite but the mean is finite, we can use results in generalized CLT and asymptotic consistency. In special cases, it scales really well, e.g. for even 10,000 dimensions, in

Fast leave-one-out cross-validation
Fast bootstrapping
Fast prior and likelihood sensitivity analysis
Conformal Bayesian computation
Particle filtering
Improving distributional approximations (Laplace approximation, Variational Inference)

Sampling Importance Resampling (SIR) - We would want weights mentioned above to be more or less equal so that it resembles the actual Monte Carlo estimator - this can be done by resampling the samples according to the weights. This means that some of the original samples will be included more than once.

From the variance of the weights, the Effective Sample Size (ESS) can be computed via the variance of the normalized weights \(\tilde{w}\) - we have \(S_{eff}=\frac{1}{\sum_{s=1}^S \tilde{w(\theta^{(s)})}}^2\) - we see that if all the weights were the same, then each \(\tilde{w}=\frac{1}{S}\) and \(S_{eff}=S\) - similarly if exactly one of the normalized weights were 1 and others were 0, \(S_{eff}=1\).

The efficacy of importance sampling approaches can be checked using the Pareto-k diagnostic (see slide 47 for more applications of this).

Markov Chain Monte Carlo (MCMC)

Andrey Markov proves weak Law of Large Numbers (LLN) and CLT for certain dependent sequences of random variables - i.e. Markov chains. In a Markov chain, the probability of any upcoming event depends only on its current state, i.e. there is no memory.

The idea of MCMC methods is to draw from a Markov Chain which is constructed such that the equilibrium distribution is the posterior distribution we want to sample from.

The Gibbs Sampler, an approach to sample from the conditional distributions of each variable one at a time. With conditionally conjugate priors, sampling from these conditionals become easy, with no tuning parameters. One problem however is that it takes a long time to sample from highly correlated distributions.

One idea is to generate samples from multivariate distributions instead of 1 at a time. For e.g. suppose we use a Gaussian distribution as a proposal distribution and accept samples based on some acceptance ratio - this is the idea behind the Metropolis and Metropolis-Hastings Algorithm.

To prove that MCMC approaches work, we first have to show the sequence is indeed a Markov chain with a unique stationary distribution, and prove that this stationary distribution is the target distribution. The first part involves proving irreducibility, aperiodicity and recurrence. The second part involves showing reversibility.

In pracitce we run several chains for diagnostic purposes. Some of the initial samples are discarded in what is called warm-up.

The most commonly used convergence diagnostic is \(\hat{R}\) the Potential Scale Reduction Factor (PSRF), which compares the means and variances of multiple chains.

Autocorrelation can be used to assess the mixing of MCMC. It can also be used to compute ESS, where \(S_{eff}=\frac{S}{\tau}\), where \(\tau\) is the sum of the autocorrelation values (in both directions) for \(t=1,2,\cdots\). In practice, the autocorrelations are estimated over multiple chains (slide 41), and for some finite \(t=1,\cdots,T\). The truncation point \(T\) can be decided adaptively, see Geyer's adaptive window estimator.

Stan, Hamiltonian Monte Carlo (HMC) and Probabilistic Programming Languages (PPLs)

Diverging trajectories are an important diagnostic.
Exceeding relapses (i.e. too much doubling) in NUTS is also a diagnostic.
#DOUBT Where does slice sampler sample from exactly?
#ASK For HMC divergences (e.g. funnel) could we maybe apply a transform to the joint density such that for e.g. we match it to an isotropic Gaussian using Normalizing flows? (Possible project idea) Simple mass scaling does not help for nonlinear dependencies, so maybe this means for complex mass scaling? Deep kernel learning?
Is sampling \(\phi\) separately equivalent to saying the augmented model \(p(\theta,\phi)\) decomposed as \(p(\theta) p(\phi|\theta)\). Here, whats the relation between using the quadratic form with \(\phi\) and having some Gaussian involved? Could that instead be any distribution, specifically one which can for e.g. give more information about \(\theta\)?

Hierarchical models and exchangeability

In a simple model, we want the posterior for the parameters, meanwhile in hierarchical models we also want the posterior for the prior parameters.

For e.g. if we model patient deaths \(y_i\) with a parameter \(\theta_i\) for hospital \(i\), this is a model with separate/independent effects, and having the same \(\theta\) for all hospitals result in a model with a common effect/pooled model. A hierarchical model achieves a mix of these by including a parameter \(\theta_i\) for each hospital, which themselves have a common hyperparameter \(\tau\), whose posterior we are interested in.

We say parameters/observations are exchangeable if their joint distribution is invariant to the permutation of their indices. Assuming exchangeability means we can justify the use of a joint model for the data and a prior for model parameters. It isless restrictive than the independence assumption.

If there is no information to separate the parameters/observations from each other, we can make the assumption that they are exchangeable - ignorance here implies exchangeability.

Assuming exchangeability does mean we assume the results of the experiments cannot be different. We could know for e.g. that some observations would be higher/lower than others, but if we do not know which is which aprior we can make this assumption.

See also conditional exchangeability and partial exchangeability.

A counterexample to exchangeability is a die with 6 sides with probability \(\theta_i\) that it will show face \(i\) - we may not know whether the die is fair or not but the constraint that the probability of all sides must add to 1 means the parameters \(\theta_i\) are not independent and their joint distribution cannot be factored using the iid assumption.

Reading: BDA ch5

Model checking and cross-validation

See Bayesian Workflow 2019 paper, Chapter 6,7 of BDA textbook.

In a typical Bayesian workflow we may start from what we think are sensible parameters for the model we have constructed. We would want to consider whether the samples generated aprior are sensible, for e.g. does increasing a hazardous chemical concentration increase mortality? Since if it does not then we need to reiterate.

Similarly, we may want to validate predictions of the model to completely new data, and also check if real-world constraints are obeyed, for e.g. in physical models, are all speeds less than the speed of light?

The bayesplot package in R has more diagnostics for model checking. Some functions of interest include:

ppc_dens_overlay
ppc_ecdf_overlay
ppc_scatter
ppc_stat / ppc_stat2d
ppc_hist
ppc_rootogram

In posterior predictive checking, a good test statistic \(T(y,\theta)\) is (almost) ancillary, i.e. it depends only on observed data and its distribution is independent of the model. We can compare \(T(y,\theta)\) and \(T(y|\theta, \theta)\) i.e. statistics from the actual data and the replicated data generated from the model we have learnt - this is best done for several replicated datasets, often with the same size as the original dataset.

Marginal and cross-validation (CV) prediction checking involves considering the predictive distributions \(p(\tilde{y_i}|y)\) for each observation separately. The marginal posterior p-values are \(p_i = p(T(y_i^{rep})\leq T(y_i) |y)\) and if \(T(y_i)=y_i\) this is simply \(p(y_i^{rep}\leq y_i|y)\), the empirical cdf of the posterior evaluated at the actual data - if \(p(\tilde{y_i}|y)\) is well calibrated then the \(p_i\)'s would be uniform between [0,1] (consider reversing the process of generating samples from a known cdf). See "Visualization in Bayesian workflow" paper, demos_rstan/ppc/poisson-ppc/Rmd, Aki's webpage (+CV FAQ there), for concrete examples.

A sensitivity analysis involves how diferent model structures and priors affect the results.

True predictive performace can be computed by making predictions and comparing these to unseen true observations, i.e. via external validation. If we happen to have infinite data then this is plausible, but we do not so we can go for expected predictive performance as an approximation. This involves choosing some utility/cost function which is application specific. If choosing such a utility is a challenge, the theoretically justified choice is the log-score \(\log p(y^{rep}|y,M)\).

Some relevant packages for CV in R are elpdf_dd, loo.

Some values of interest include:

Leave-one-out (LOO) residual: \(y_i-\mathbb{E}[\tilde{y}|x=i,x_{-i},y_{-1}]\) which can then be used to

compute RMSE, \(R^2\), 90% error etc.

LOO predictive distribution: \(p(\tilde{y}|\tilde{x}=i,x_{-i},y_{-i})=\int p(\tilde{y}|\tilde{x}=i,\theta)p(\theta|x_{-i},y_{-i})d\theta\), which can be compared to what we get when we include all the data (i.e. with \(-i\) at the subscripts).

LOO-CV can be used for time-series since the latent function(s) obey exchangeability even if the actual observations do not. #DOUBT.

However, for time-series, Leave-Future-Out CV is better for prediciting the future. Another option is \(m\) step ahead Leave-A-Block-Out CV which can help to learn about hyperparameters.

Oveall, view CV as a tool to analyse different parts even if there is no clear prediction task.

Model comparison and selection

In importance sampling leave-one-out cross-validation, we wish to compute \(p(y_i|x_{-i},y_{-i})=\int p(y_i|x_i,\theta)p(\theta|x_{-i},y_{-i})d\theta\) - here we there proposal distribution is the full posterior \(\theta^{(s)}\sim p(\theta|x,y)\) and the target distribution is the LOO-posterior \(p(\theta|x_{-i},y_{-i})\), and the importance ratio is \[w_i^{(s)} = \frac{p(\theta^{(s)}|x_{-i},y_{-i})}{p(\theta^{(s)}|x,y)}\propto \frac{1}{p(y_i|\theta^{(s)})} \]

Which gives the self-normalized terms to be \[\tilde{w}_i^{(s)} = \frac{w_i^{(s)}}{\sum_{s'=1}^S w_i^{(s')}}\]

These self-normalized weights can be used to approximate the predictive posterior as \(p(y_i|x_i,x_{-i},y_{-i})\approx \sum_{s=1}^S \tilde{w}_i^{(s)}p(y_i|x_i,\theta^{(s)})\).

Using the definitions of \(\tilde{w}_i^{(s)}\) and \(w_i^{(s)}\), turns it down to \(p(y_i|x_i,x_{-i},y_{-i})=\frac{1}{\frac{1}{S}\sum_{s'=1}^S w_i^{(s')}} = \frac{1}{\frac{1}{S}\sum_{s'=1}^S \frac{1}{p(y_i|\theta^{(s)})}}\)

Either way, the distribution of the self-normalized weights matter, notably the ESS \(\approx \frac{1}{\sum_{s=1}^S (\tilde{w}^{(s)})^2}\) - this however assumes finite variance.

Pareto \(\hat{k}\) estimates, and Pareto smoothing can be used to go get diagnostics beyond the finite variance assumption.

Parento Smoothed Importance Sampling (PSIS) replaces the largest weights with ordered statistics of the fitted Pareto distribution - motivated by the fact that in many distributions, the far away tail can be modelled by a Pareto distribution. This gives better variance than normal importance sampling and reduced bias compared to truncated importance sampling.

In Stan code

generated quantities {
  vector[N] log_lik;
  for (i in 1:N)
    log_lik[i] = normal_lpdf(y[i] | mu[i], sigma);
}

Where the for loop is written because otherwise normal_lpdf will compute the log pdf assuming y is a vector of iid Gaussians.

If one gets many high Pareto \(\hat{k}\)'s in the loo function, try running it with moment_matching=TRUE to use implicitly adaptive importance sampling with moment matching (see Paananen et al 2021). Another "fix" is to run MCMC for folds with \(\hat{k}\) above a certain threshold, which can be done by including the argument k_threshold=0.7 for example.

In comparison to WAIC, PSIS-LOO is more accurate, more flexible (e.g. can be better adapted to leave one group out etc), has better diagnostics and makes the assumptions clearer. Other Information Criterions exists, e.g. AIC, DIC, BIC, TIC, NIC, RIC, PIC, BPIC, QIC, QICc, etc which are not Bayesian.

In general, marginal likelihood and Bayes factors based methods are not advisable in more complex models. Through the chain rule for probabilities, we can see that the marginal likelihood is equivalent to leave-future-out 1-step-ahead cross validation starting with 0 observations. This means we start with a same from the prior, fit the model, then generate an extra sample, then fit the model to the 2 existing samples, and so on. This is very sensitive to the choice of priors, and unstable when the model is misspecified.

Sometimes, cross-validation is not needed, and posterior predictive checks are often sufficient.

Is model selection needed to avoid overfitting? Not necessarily, consider a polynomial basis function regression problem, and a basis function approximation to Gaussian processes - here increasing the basis of the former leads to overfitting meanwhile the increasing the latter does not. Overall, there is no overfitting when using good priors that keep the prior in the predictive space approximately as more components are added, e.g. Gaussian processes, regularized Horseshoe priors, R2-D2 and R2-D2-M2 priors etc. #ASK What are the basis functions used in slide 56/66?

Predictive model comparison can be useful, for e.g. we could have 2 marginals which overlap at 0, but they are highly colinear, #TODO see colinear demo.

Other options if given 2 models, one is not better than the other

#TODO Finish rest of lecture from the polynomial example above.

Decision analysis

Bayesian decision theory involves the following components:

Potential decisons \(d\) (sometimes called actions \(a\)).
Potential consequences \(x\), which be categorical, ordinal etc.
Distributions of consequences given decisions, \(p(x|d)\). Since the decisions are set by us, \(p(d)\) does not exist. #ASK What if we had a prior over decisions before we see any data?
A utility function \(U(x)\) that quantifies consequences to some value in \(\mathbb{R}\), e.g. USD, expected lifetime etc.

The proper protocol is to then choose the decision \(d^*\) that maximizes the expected utility, i.e. \(d^* = \text{argmax}_d\mathbb{E}[U(x)|d]=\text{argmax}_d\int U(x)p(x|d)\:dx\).

For e.g. one may find a paw print \(f\) which could be from a dog or wolf - suppose we want to go pick mushrooms, the decision to go or not go depends on whether it was in fact a wolf \(w\) or not. We can assign utilities to each of the 4 possibilities (staying home and paw print is of a dag, going to the forest and paw print is of a dog …), and we will make use of the posterior probability of the paw print beig either of a dog or wolf to compute the expected utility. There are 3 different things we can look at here:

Maximum likelihood estimate (MLE), \(\text{argmax}_w p(f|w)\).
MAP estimate, \(\text{argmax}_w p(f|w)p(w)\).
Maximum utility decision, which suggests the decision to take based on the utility.

In real world applications, setting the utility function is something not all agree on. What could be agreed on is also often not a linear function.

Model selection is also a decision problem, for e.g. when selecting the best predictive distribution, the log score (#ASK was it log score or log loglikelihood?) utility is used.

In real world applications, decision problems may have multiple stages. Such multi-stage decision making problems involve many unknown variables, and earlier decisions can be "designed" to be chosen such that they provide the most information - this is called sequential experimental design.

Normal approximation, frequency properties

The posterior distribution often converges to a Gaussian under certain assumptions as \(n \to \infty\) (bounded posterior, non-singular, parameters do not grow with \(n\)). This suggests that in 1D we can make the approximation \(p(\theta|y)\approx \mathcal{N}(\theta|\hat{\theta},\sigma_{\theta})\). The mean is concentrated at the mode of \(p(\theta|y)\). For the variance, first note we have \[ \log p(\theta|y) \approx \alpha(\theta-\hat{\theta})^2+C \] And comparing with the Taylor series expansion around \(\theta=\hat{\theta}\) we have \[ f(\theta)=f(\hat{\theta}) + f'(\hat{\theta})(\theta-\hat{\theta})+\frac{f''(\hat{\theta})}{2!}(\theta-\hat{\theta}^2) + \cdots \] As \(\hat{\theta}\) is the mode, the second term disappears, and assuming terms beyond the quadratic vanish as \(n\to \infty\), we have \[ \log p(\theta|y) \approx \log p(\hat{\theta}|y) + \frac{1}{2}\frac{d^2}{d\theta^2}\log p(\theta|y)_{\theta=\hat{\theta}} \]

Going to higher dimensions, we have \[ \log p(\theta|y) \approx \log p(\hat{\theta}|y) + \frac{1}{2}(\theta - \hat{\theta})^T [\nabla^2 \log p(\theta|y)_{\theta=\hat{\theta}}] (\theta - \hat{\theta}) \]

See BDA3 Chapter 4 for an example where this is easy to compute.

If the density is close to Gaussian, often around 50 density evaluations suffice, which is much less than MCMC.

Accuracy can be further improved using importance sampling.

Other distributional approximations: We can in theory use higher order derivatives at the mode - this however may result in unexpected results at the tails. Split-normal and Split-t by Geweke 1989 use additional scaling along principal axes. Other distributions such as Student-t could be used. See Chapter 13 for Variational Inference (VI) and Expectation Propagation. VI includes many methods, for some models we can derive deterministic algorithms, in some methods we can also use mini-batching. In general however VI is unlikely to achieve the accuracy of HMC for the same computational cost.

In large sample theory, we always assume there is some true underlying data distribution \(f(y)\) whose exact form is not important as long as it has some certain regularity conditions.

We have consistency if the true distribution \(f(y)\) is included in the parametric family so that \(f(y)=p(y|\theta_0)\) for some \(\theta_0\), in which case the posterior converges to a point \(\theta_0\) when \(n \to \infty\). If the true distribution is not in the parametric family, there is no true \(\theta_0\), and it is replaced with some \(\theta_0'\) that minimizes the KL divergence from \(f(y)\) to \(p(y|\theta_0')\). Both of these are the same for Maximum-Likelihood estimators as well.

Counter examples to large sample theory: We could have under and non-identifiability. When a model is under-identifiable, the model has parameters or parameter combinations for which there is no information in the data, and there is no single point \(\theta_0\) where the posterior would converge. For e.g. let the model be \(y \sim \mathcal{N}(a+b+cx,\sigma)\), then the posterior would converge to a line with the prior determining the density along the line.

Another example is a 2D Gaussian \((U,V)\) with 0 mean and unit variances, but the covariance \(\rho\) is unknown. If we never observe the 2 Gaussians at the same time, \(\rho\) is unidentifiable. If somehow we sample pairs together randomly (e.g. 10% of the time \(U\) is sampled we also sample \(V\) together, and vice versa) then \(\rho\) is identifiable asymptotically.

Another e.g. of under-identifiability is in Gaussian Mixture Models (GMMs), where if the means are switches, so do they covariances and the mixture coefficients. In a mixture case in 1D, this means the posterior will have 2 modes which are mirror images of each other and the posterior does not converge to a single point. This is called aliasing.

From an MCMC perspective this makes convergence diagnostics harder since as it is difficult to identify aliasing from other kinds of multimodality.

Large sample theories also do not hold is the parameters increase with the number of observations. For e.g. a time series model \(y_t \sim \mathcal{N}(\theta_t,\sigma)\) - posterior of \(\theta_t\) does not converge to a point if more data does not bring enough information. (#DOUBT Does more data mean more (noisy) data for a specific \(t\)?).

We could also have unbounded likelihoods, for e.g. in a GMM, if we know the mixture coefficients (and they are not 0 or 1), if we have 1 sample from 1 cluster, we can set the variance to go to 0 to get an infinite likelihood, meaning if the prior for this variance does not go to zero when the variance goes to 0, the posterior is unbounded.

Improper posteriors e.g. Binomial model with \(Beta(0,0)\) prior and observation \(y=n\) gives a posterior density that goes to infinity as the parameter approaches 1. This is a problem for any inference method, including MCMC, and can be avoided with proper priors.

And of course the prior distribution may not even include the convergence point. It could also be that the convergence point is at the edge of the parameter space, e.g. \(y_i \sim \mathcal{N}(\theta,1), \theta>0\).

Recall that Bayesian theory accounts for epistemic and aleatoric uncertainties, meanwhile frequentist methods account only for aleatoric uncertainty. Frequentist statistics have uncertainty estimates are based on all possible datasets which could have been generated by the data generating mechanism, inference is based also on unobserved data, and estimates are derived to fulfil frequency properties. They strive for unbiasedness, minimum variance and calibration of confidence intervals, given finite data. The unbiasedness requirement may give high variance/useless estimates. Confidence intervals are defined to have the true value inside the interval \(\alpha \%\) of the cases of repeated data generation from the data generating mechanism.

Some common statistical tests with Bayesian models:

t-test : mean of data
paired t-test : mean of differences
two-sample t-test: group means
ANOVA : hierarchical models

In frequentitst hypothesis testing, we can make estimates and confidence intervals but for some reason null hypothesis testing has had a very big role (#ASK Is this because of falsification?) - reporting only the null hypothesis testing result throws away a lot of useful information. One problem of frequentist hypothesis testing is that it asks what if the data is generated from the smaller model, without telling whether the complex model is good enough - doing model checking for all models is preferable.

In the Bayesian setting, instead of hypothesis testing, we can report the full posterior distribution, and use it to compare expert information or combine with utility/cost function. People do all sorts of things except using the full posterior, see slides for more information.

Bayes factors are (very) sensitive to the prior choice. (#TODO Understand the simulation experiment plot)

Gaussian Process (GP) example, with case where optimization does not give good results compared to integration when using smaller sample sizes, and when using heteroskedastic noise modelled as another GP itself.

Logistic regression example to show that model selection does not necessarily avoid overfitting, here, with 100 samples and integrating over the posterior still gives worse log probability on a test set as the covariates increase, with 30 completely irrelevant variables. (#TODO Think about this example very deeply). Having a Gaussian prior with mean 0 and variance 3 gives prior predictive probabilities close to 0 or 1 for the model as the number of covariates increase. If instead we scale the prior with the number of predictors, this problem is mitigated. Going beyond to weakly relevant covariates, this better prior does improve test log probability as the number of covariates increase while the unscaled version has approximately constant log probability. With correlated relevant variables, unscaled prior gives worse log probability meanwhile scaled priors improves it with around a constant gap between the log probability of the training data.

Extended topics

Assignments

Main project

Registration deadline: <2022-11-08 Tue> Project Deadline: <2022-12-04 Sun>

Choose a non-trivial dataset and report a Bayesian workflow performed to analyse this data. Ideas for datasets are mentioned in the course webpage.

This involves:

Making a notebook in Python or R.
- This should be similar to one from Stan case studies.
- Page count is limited to 20, including figures.
Oracle presentation, sometime between <2022-12-12 Mon>-<2022-12-16 Fri>.
- A high level but sufficiently detailed presentation of 10 minutes.

The overall Bayesian workflow should involve:

Description of the data and a thorough analysis of the problem.
Description of atleast 2 models, e.g. non-hierarchical and hierarchical, linear and non-linear, models under variable-selection etc.
Informative or weakly informative priors, with description of prior choices.
Stan code, or rstanarm/brms alongside details of use such as parameters used, system it is run on etc.
Convergence diagnostics such as \(\hat{R}\), divergences, effective sample size etc.
Posterior predictive checks.
Model comparison e.g. with Leave-one-out.
Predictive performance assessment if applicable.
Sensitivity analysis with respect to prior choices.
Discussion of challenges and potential improvements.