Bayesian Experimental Design (BED)

A model-based approach to choose some design \(\xi\) which maximizes the information gained about some model parameters \(\theta\) from the outcome \(y\) after performing experiment \(\xi\). The design is something the experimenter can control, meanwhile \(\theta\) is the latent variable of interest - whose posterior is \[ \mathbb{P}(\theta|y,\xi)\propto \mathbb{P}(y|\theta,\xi)\mathbb{P}(\theta) \]

Given the likelihood \(\mathbb{P}(y|\theta,\xi)\) and prior \(\mathbb{P}(\theta)\), the objective is to choose a design \(\xi\) that yields the greatest reduction in our uncertainty i.e. the difference between the prior entropy and the posterior entropy. This value, called the Information Gain (IG) (/Bayesian surprise) is large if the posterior reduces our uncertainty by a large amount. As a function of \(\xi,y\) it is \[ \text{IG}(y,\xi) = \mathbb{H}[\mathbb{P}(\theta)] - \mathbb{H}[\mathbb{P}(\theta|y,\xi)]\]

Or more generally in an iterative setting, at time \(t\), \(\text{IG}_t(y,\xi)=\mathbb{H}[\mathbb{P}(\theta|h_{t-1})] - \mathbb{H}[\mathbb{P}(\theta|h_{t-1},y_t,\xi_t)]\) where \(h_t=\{y_1,\xi_1,\cdots, y_t,\xi_t\}\) and \(\mathbb{H}\) is the entropy operator. This quantity requires knowing posterior distributions which is in general an intractable task in itself. It also depends on \(y_t\) which we have not yet seen. Integrating out \(y\) (i.e. taking an expectation w.r.t \(p(y_t|\xi_t)\) gives us the Expected Information Gain (EIG)

\[ \text{EIG}(\xi_t) = \mathbb{E}_{\mathbb{P}(y_t,\theta|\xi_t)}\Big[\log \frac{\mathbb{P}(y_t|\theta,\xi_t)}{\mathbb{P}(y_t|\xi_t)}\Big] \] #TODO Add \(h_{t-1}\) to above. The above formula is derived after:

Dropping the first term \(\mathbb{E}_{p(y_t|\xi_t)}[\mathbb{H}[p(\theta|h_{t-1})]]\) since it does not depend on \(\xi_t\).
The double expectation \(\mathbb{E}_{p(y_t|\xi_t)}\mathbb{E}_{p(\theta|y_t,\xi_t)}\) is the same as \(\mathbb{E}_{p(y_t,\theta|\xi_t)}\).
\(\mathbb{E}_{p(y_t,\theta|\xi_t)}[\log p(\theta)]\) drops out since the \(y_t\) term is integrated over just the function 1, and the DAG model leaves \(\xi_t\) and \(\theta\) independent.

This amounts to using the model to both define the utility function we optimize and to approximate the true distribution over outcomes \(\mathbb{P}(y|\xi)\).

The Bayes optimal design \(\xi^*\) is then defined as: \[ \xi^* = \text{argmax}_{\xi} \text{EIG}(\xi) \]

In general, the objective need not be EIG, any other utility function \(U(y_t,\xi_t,\theta)\) could be used, as long as we integrate out \(y_t,\theta\).

Since the EIG is itself an expectation over the IG, the actual objective we optimize for here is doubly intractable. A Monte Carlo approximation involves sampling \(y_i,\theta_i \sim \mathbb{P}(y,\theta|\xi)\) for the outer expectation and for each such sample, approximate the marginal likelihood \(\mathbb{P}(y_n|\xi) = \int \mathbb{P}(y_n,\theta|\xi) \: d\theta\).

The EIG can also be interpreted as:

Mutual information between the parameters and the outcomes.
Expected utility with the KL divergence utility function.
Expected reduction in predictive uncertainty (BALD score) ((, , Equation 2)).

To perform adaptive experiments, we condition on \(h_t = \{\xi_i, y_i\}_{i=1}^t\) in both the prior and posterior entropies in the IG and posterior predictive distributions. This gives rise to the greedy adpative BOED loop:

Choose new design \(\xi_t\) by optimizing conditional EIG.
Update posterior \(\mathbb{P}(\theta|h_t)\).
Sample new outcome \(y_t \sim \mathbb{P}_{\text{true}}(y_t|\xi_t, h_{t-1})\).