Bayesian Inference, Shakir Mohamed - MLSS 2020, Tübingen - 2020

Details

Title : Bayesian Inference, Shakir Mohamed - MLSS 2020, Tübingen Author(s): virtual mlss2020 Link(s) : https://www.youtube.com/watch?v=x4Y90zPjbq0, https://www.youtube.com/watch?v=DItJbz2OH5U

Rough Notes

Bayesian Basics

The goal is to talk about:

Concepts in probability and Bayesian analysis
Separation of model, inference and algorithm
Bayesian applications and values

Some common definitions of probability:

Statistical probability, based on frequency ratios of events.
Logical probability, representing a degree of confirmation of a hypothesis based on logical analysis.
Probability as propensity, viewing probability as something that is only useful when we are making predictions.
Subjective probability, where probabilities reflect a degree of belief.

Probability is sufficient for the task of reasoning under uncertainty (#DOUBT Why, see Cheeseman's In defense of probability).

Some consequences of using probability as a degree of belief:

The "probability of an event" is ill defined, as it depends on the evidence.
Is subjective since it depends on the believer's information.
Different observers with different information will have different beliefs.

A model is a description of the world, of data, of potential, e.g. differential equations. In machine learning models are coupled with data, meanwhile other fields like fluid dynamics may not. A probabilistic model is a model written in the language of probability. In the real world we only observe data so the aim to learn probability distributions of this data via probabilistic models. We choose what we want to learn from the data, this could be e.g. the mean or the entire distribution.

De Finetti's Theorem states that given an infinitely exchangeable set of random variables \(X_1,X_2,\cdots\) i.e. for any \(N\) and all \(\pi \in S_N\) \(p(X_1,\cdots,X_N)=p(X_{\pi(1)},\cdots,X_{\pi(N)})\) , the joint distribution can be represented as \(p(X_1,\cdots,X_N)=\int \prod_{n=1}^N p(X_n|\theta)p(d\theta)\) (#DOUBT Check for rigorous definition of infinite exchangeability from textbook). This theorem explains why we have parameters and priors alongside the power of averaging. Assuming the prior admits a density, \(p(d\theta)\) becomes \(p(\theta)d\theta\). The parameters \(\theta\) can have any dimension, even infinite.

Bayesian Statistics involves:

Deciding on prior beliefs.
Describe a model of how the observed data may have been generated, which is your explanation of that process.
Update beliefs as new evidence is gathered via Bayes theorem.

There are 2 important quantities:

Marginal likelihood, or model evidence \(p(y|x)=\int p(y|h(x);\theta)p(\theta)d\theta\)
Posterior distribution \(p(\theta|x,y)\propto p(y|h(x);\theta)p(\theta)\)

All variables that are not observed must be marginalized, this integral is often intractable.

E.g. A multivariate Gaussian prior on \(\theta\) and a Categorical likelihood \(p(y|x,\theta)\) whose parameter \(\pi(x,\theta)\) is a function of both the data and the parameter, the aim is to get the posterior \(p(\theta|x,y)\). Other examples include unsupervised learning such as factor analysis or density estimation, and decision-making via probabilistic models for environments, actions and rewards.

The Bayesian approach is also an interpretive approach to modelling and data analysis. Interpretive frameworks such as Marr's Levels of Analysis are important its a tool for communication and a blueprint we look into ourselves. Another frame work is "Statistical Operations" from Bradley Efron's 1984 World Lecture. It starts with data enumeration (asking how the data is coming, sampling bias, what we observe and dont) followed by summarization (finding similarities) and comparison (finding differences). Modelling connects summarization and data enumeration, which experimental design connects comparison and data enumeration. On top of these there is inference, which is the way we connect the observed data with our defined model. Another framework, (arguably today's lingua-franca) is the "Architecture and Loss" framework, we build computational graphs of the model and backpropagate errors over some loss function to update model parameters. Bayesian analysis suggests the "Model-Inference-Algorithm" framework, consisting of models which represent what we believe to be the data generating process, learning principles that relates data and models such that a single model could have many different learning principles, with both models and learning principles connected by algorithms which are the result of different combinations of models and learning principles. Models could include all sorts of graphical models, VAEs etc. Each model requires statistical inference (the learning principle), which can divide into direct inference that computes the marginal probability of the data directly (e.g. Laplace approximation, MLE, MAP, variational inference, cavity methods, EM, noise contrastive learning, SMC, MCMC, integrated nested Laplace approximation) and indirect methods that compare and contrast the observed data with the distribution we have (e.g. 2-sample comparison, ABC, maximum mean discrepancy, method of moments, transportation methods). Slide 24 shows e.g. of algorithms that mix different models and learning principles.

The interpretive frameworks fall under epistemic concerns, and reflect epistemic values in our field, i.e. values and concerns which relate to technical work, how we think about SOTA, what are important questions, how we think about reproducibility and falsifiability etc. An additional layer on top of epistemic values comes from contextual values, i.e. social, political, culture and economic values which influence what projects we work, how we consider the use of data, how we think of the end user etc.

Bayesian Computation

The goal is to talk about:

Probabilistic models and priors.
Likelihood, marginalization, prediction.
Inference and testing.

Linear regression is the most common example for a model, with the optimization objective being negative log-likelihood. We could go further and make use of deep networks, and deep hierarchical models. Here 2 themes arise:

Deep learning: which allows for rich

non-linear models which are very good at classification, sequence prediction etc. which are also scalable and composable with gradient-based methods albeit giving only point estimates and no clear way to do model selection.

Bayesian reasoning: Intractable inference in realistic models, but provide a unified framework for model building, inference, prediction, decision making, and explicitly accounts for uncertainty and it robust to overfitting.

A likelihood function is the density of our model with the data being fixed, hence is a function of the model parameters. We use the likelihood to get statistically efficient estimators (Cramer-Rao lower bound #DOUBT What does that mean here?), is asymptotically unbiased and consistent, and incorporates the principle of indifference (#DOUBT How does it relate to maximum entropy and the indifference principle). Likelihoods are also needed to construct hypothesis tests with good power. Likelihoods also allow us to combine difference data sources, and we can also use knowledge outside the data like constraints. Bad likelihoods result in model misspecification, resulting in inefficient estimates or confidence intervals/tests that fail. We can however correct biases.

The classic way to estimate parameters is MLE, and adding a regularization (or shrinkage) term results in MAP estimation - not every regularizer corresponds to a valid probability distribution, i.e. regularization is more general. However, point estimates from MLE or MAP do not correspond to "typical" solution. Bootstrap estimates could be used to quantify uncertainty. Reparametrization may also result in different MAP values - the solution to this, called invariant MAP uses a modified probabilistic model that reduces this dependency on the parametrization.

Recall we want to compute the evidence and posterior. The evidence, which is an integral can be computed in closed form if we choose conjugate models e.g. Beta-Bernoulli. This however restricts the model class - approximations include the Laplace approximation (also called Saddle point approximation or the delta method).

Learning and Inference: Statistics does not distinct between these and just call it estimation. In Bayesian statistics, all quantities are probability distributions so there is only the problem of inference. In software engineering, the inference is the forward evaluation of a trained model to get predictions. In decision making and AI, learning is a general term used to describe a mechanism that makes use of past experience to inform future actions and understanding. In machine learning, inference is about reasoning and computing unknown probability distributions, and (parameter) learning is about finding point estimates of quantities in the model.

Given the posterior, the posterior predictive distribution is \(p(x^*|x)=\mathbb{E}_{p(\theta|x)}[p(x^*|\theta)]\).

Bayes factors are a method to compare 2 models, it is a Bayesian approach to hypothesis testing. The Bayes factor is \(\frac{p(x|\mathcal{M}_1)}{p(x|\mathcal{M}_2)}\), which arise in posterior odds \(\frac{p(\mathcal{M}_1|x)}{p(\mathcal{M}_2|x)}\) since we often assume apriori the model priors to be equal. In comparison to frequentist hypothesis testings, we do not compare against some unknown model, and we can compare nested models. Large scale approximations of Bayes factors like BIC, AIC, DIC, WAIC can be used. The main problem here is computing the marginal likelihood.

Marginal likelihoods are:

Consistent i.e. given more data it favours the true model.
It embodies Occam's razor.
Easier comparison, since the compared model and parameters don't have to be nested or equivalent.
The marginal likelihood of a data-model instance can be used as a reference for future model selection or comparison.
Weight of evidence (#DOUBT Did not understand that).

Overall, the model evidence computation embodies a learning principle that is direct, in comparison to for e.g. indirect 2-sample tests which may learn the distribution in relation to some other distribution.

Inferential questions include:

Evidence estimation.
Moment computation.
Parameter estimation.
Prediction.
Planning.
Hypothesis testing.
Experimental design.

In light of all of this, there are some "neutrality traps" we should be aware of:

Solutionism trap: Failure to recognize the chance that the best solution may not even need technology.
Formalism trap: Failure to account for the full meaning of social concepts like fairness.
Portability trap: Failure to understand that what works in what (social) context may be inaccurate or even harmful in a different context.
Ripple effect trap: Failure to understand how insertion of technology into social system will change behaviours and embedded values of the pre-existing system.

(#DOUBT What is the formal definition of a calibrated predictive model in the Bayesian context)

Bayesian Approximation

The goal is to talk about:

Direct and indirect inferences.
Monte Carlo methods.
Variational methods.

First of all, how do we represent a distribution?

Analytically via a closed form density.
Via approximations, which could be analytical.
Using samples.
As a sampling program, with separate components which when used together to get the samples.

(#TODO Finish from Part 2 20:00)