CLAPEM 2019 I Chris Holmes-Bayesian Nonparametric Learning through Randomized Loss Functions - 2019
Details
Title : CLAPEM 2019 I Chris Holmes-Bayesian Nonparametric Learning through Randomized Loss Functions Author(s): CIMAT Link(s) : https://www.youtube.com/watch?v=Abr8ukffFXo
Rough Notes
Overview:
- Generalization of Bayesian Bootstrap for inference in parametric models under model misspecification.
- Using Bayesian nonparametrics to update parametric Bayesian models:
- Extending the weighted likelihood bootstrap.
- Improved robustness properties over conventional Bayesian updating.
- Fast, independent, parallel MC sampling, avoids MCMC.
Recall Bayesian inference which states that the posterior distribution is proportional to the prior distribution of the parameter before you saw the data multiplied by the updating function (i.e. likelihood). This is prescriptive updating rule, i.e. if you want to adhere to certain notions of coherency, rational decision making, you have to use this rule.
The above principle relies on the model being correct, but we know all models are wrong.
In the real world, the true model is hard to justify or even define - it is hard to define a true generative model over medical images etc.
What happens when you fit a false model? Looking at the MLE estimate \(\hat{\theta} = \max_\theta \sum_{i}^{}\log f_\theta(x_i)\), as more and more data arrives, the MLE estimate converges to a point \(\theta_0\). Bayesian models with reasonable priors also result in posteriors converging to the point mass on \(\theta_0\).
Even if the model is true or not, \(\theta_0\) is the value that minimizes the KL divergence from the model to Nature's true unknown sampling distribution \(F_0(x)\).
Now, we want to update the posterior assuming the model class is not true. The speaker's main argument is that Bayesian bootstrapping can help us from going to the prior to the posterior.
Uncertainty in \(\theta_0\) stems from uncertainty in \(F_\theta(x)\). Being Bayesian, we can put a prior \(p(F)\) on this. This is the essence of Bayesian nonparametric learning, using \(p(F|x)\) to train a parametric model \(f_\theta(x)\).
If we can sample \(F\sim p(F|x)\), we can then use this to train out model:
- Sample a distribution function for the data from the posterior \(F^{(i)}\sim p(F|x)\). For each \(F^{(i)}\), there is no uncertainty in the corresponding optimal value \(\theta_0|F^{(i)}\).
- Set \(\theta^{(i)} = \text{argmax}_\theta \int \log f_\theta(x)dF^{(i)}(x)\).
- Repeat to get \(\theta^{(1)},\cdots,\theta^{(T)}\).
For the nonparametric model, we can use a Dirichlet Process (DP). Taking its concentration parameter \(\alpha\to 0\) then the DP places a discrete random distribution with atoms only on the observations. This makes the optimization step above an optimization over a sum instead of an integral.
This is the Bayesian bootstrap.
The classical bootstrap conditions on the empirical CDF \(\hat{F}\) and looks at the distribution of the estimates of \(\hat{\theta}(x_{1:n})\) after repeated resampling from \(\hat{F}\). This captures the finite sample properties of the estimator. Compared to this, the Bayesian bootstrap draws \(F^{(i)}\sim p(F|\hat{F})\), and looks at the distribution of estimates \(\hat{\theta}(x_{1:\infty})\) under resampling - this captures the uncertainty in \(\theta_0\) that flows through \(p(F|\hat{F})\).
Now, how can be incorporate priors to the Bayesian bootstrap? We can replace the prior with synthetic samples drawn from a prior sample predictive, this is called the Posterior Bootstrap. It can also be combined with Variational Bayes, by sampling from the approximate posterior in the prior sampling step.
Because we are optimizing, we can mode jump.
Overall, this nonparametric approach aims build an uncertainty update, instead of approximating the Bayesian posterior. Both the Bayesian boostrap and traditional Bayesian inference methods aim to make inference statements about the same \(\theta_0\) that minimizes the KL divergence from the model to the true distribution.
Note that Bayesian bootstrap requires conditionally independent likelihood functions.