Targeted Active Learning for Bayesian Decision-Making - 2021
Details
Title : Targeted Active Learning for Bayesian Decision-Making Author(s): Filstroff, Louis and Sundin, Iiris and Mikkola, Petrus and Tiulpin, Aleksei and Kylmäoja, Juuso and Kaski, Samuel Link(s) : http://arxiv.org/abs/2106.04193
Rough Notes
Introduces an active learning criterion that maximizes information gain on the posterior distribution of the optimal decision in the context where we have a probabilistic model for decision making.
Setup considered is when user chooses an action from \(K\) options.
Dataset consists of features \(x_i\) outcome values \(y_i\) and decisions \(d_i\), with \(y_i = f_i(\mathbf{x})+\epsilon_i\) where \(\epsilon \sim \mathcal{N}(0,\sigma_i^2)\). Given \(\mathcal{D}_k = \{(x_i,d_i,y_i \in \mathcal{D} | d_i=k\}\), denote \(f_{k,\mathbf{x}\) as the corresponding function whose posterior \(p(f_{k,\mathbf{x}}|\mathcal{D}_k)\) could be modeled as a Gaussian Process (GP) or a Bayesian Neural Network.
The user is assumed to choose the Bayes optimal decision \(d_{BAYES}\) given a sample \(\tilde{\mathbf{x}}\), where \[ d_{BAYES} = \text{argmax}_{k\in [K]} \int\int \tilde{y}p(\tilde{y}_k|f_{k,\tilde{\mathbf{x}}})p(f_{l,\tilde{\mathbf{x}}}|\mathcal{D}_k)df_{k,\tilde{\mathbf{x}}}d\tilde{y}_k \]
Access to a pool of unlabelled data \(\mathcal{U}=\{(\mathbf{x}_j,d_j)\}_{j=1}^J\), from where outcome values can be actively queried.
Considering the case without decisions, the optimal data point \(\mathbf{x}^*\) to query from an information-theoretic perspective is the one which maximizes the Expected Information Gain (EIG), often written as the expectation with respect to \(p(y_i|x_i,\mathcal{D})\) of the entropy of the model parameters \(\theta\) before observing \((x_i,y_i)\) minus the entropy after observing them. This can be rewritten in many ways, for e.g. as a Mutual Information, or as what is called the Bayesian Active Learning by Disagreement (BALD).
Let \(\bar{Y}_k\) be the conditional expectation of \(Y_k\) given \(f_{k,\tilde{\mathbf{x}}}\) - we have that \(\bar{Y}_k = f_{k,\tilde{\mathbf{x}}}\) since the additive noise has 0 mean. The posterior probability that decision \(k\) is optimal, denoted \(\pi_k\), is \[ \pi_k = \mathbb{P}(f_{k,\tilde{\mathbf{x}}}=\max_{k'} f_{k',\tilde{\mathbf{x}}})=\mathbb{P}(\cap_{k'\neq k} \{f_{k,\tilde{\mathbf{x}}}>f_{k',\tilde{\mathbf{x}}}\}) \]
The distribution \(\pi = (\pi_1,\cdots,\pi_K) =: D_{best}(\tilde{\mathbf{x}})\) is the eventual posterior distribution of which we want to query such that the decisions are optimal - the optimization problem is then:
\[ (\tilde{\mathbf{x}}^*, d^*) = \text{argmin}_{(\tilde{\mathbf{x}}_j, d_j)\in \mathcal{U}}\mathbb{E}_{p(y_{d_j}|\mathbf{x}_j,\mathcal{D}_{d_j})}[\mathbb{H}(p(D_{best}(\tilde{\mathbf{x}}|\mathcal{D}\cup \{(\mathbf{x}_j,d_j,y_{d_j})\})))] \]
This gives 2 computational challenges:
- The outer expectation is intractable and needs approximation.
- Probabilities \(\pi\) are not in closed form, and we need this to compute the entropy term.