Causal Bayesian Optimization - 2020

Details

Title : Causal Bayesian Optimization Author(s): Aglietti, Virginia and Lu, Xiaoyu and Paleyes, Andrei and González, Javier Link(s) : https://proceedings.mlr.press/v108/aglietti20a.html

Rough Notes

Focus is on global optimization of a variable within a causal model where sequences of interventions can be performed.

In classical BO, all input variables are manipulated, meaning the dependency structure between them is broken. Rather, intervening on certain subgroups of input variables may be more optimal.

Given a Structural Causal Model (SCM), the model variables are divided into:

Non-manipulable variables $\mathbf{ C }$.
Treatment variables $\mathbf{ X }$.
Output variable(s) $\mathbf{ Y }$.

They introduce the problem class of Causal Global Optimization (CGO), where the goal is find the optimal intervention $\text{do}(\mathbf{ X=x })$ such that $\mathbb{ E }_{\mathbb{ P }(\mathbf{ Y }|\text{do}(\mathbf{ X=x }))}[\mathbf{ Y }]$ is maximized.

Some challenges:

Set of variables in the optimal intervention are from the power set, which grows exponentially with the number of variables.
One the variables in the optimal intervention are determined, evaluating the values to set them to involves doing multiple interventions in the system.

Challenge 1 is tackled by using a smaller exploration set $ES$, for which they use the notion of Possibly-Optimal Minimal Intervention Sets.

For each set of variables $\mathbf{ X }_s$ in $ES$, they place a GP prior on $f(\mathbf{ x }_s)=\mathbb{ E }[Y|\text{do}(\mathbf{ X }_s=\mathbf{ x }_s)]$, where the mean is the Monte Carlo estimate of this expectation from observation data, and the kernel is the RBF kernel added with $\sigma(\mathbf{ x }_s)\sigma(\mathbf{ x }'_s)$ where $\sigma(\mathbf{ x }_s)=\sqrt{\hat{\mathbb{ V }}(Y|\text{do}(\mathbf{ X }_s=\mathbf{ x }_s))}$ i.e. which is the Monte Carlo estimate of the standard deviation of the expectation we are modelling.

The causal acquisition function is a modified version of the Expected Improvement (EI) function. Specifically, for each set indexed by $s\in [1,|ES|]$, we have $EI^s(\mathbf{ x })=\mathbb{ E }_p(s)[\max(y_s - y^*)]/\ \text{Cost}(\mathbf{ x })$, where $y_s = \mathbb{ E }[Y|\text{do}(\mathbf{ X }_s=\mathbf{ x })]$, and $y^*$ is the optimal value of $y_s$ observed so far. The solutions to optimizing each of the acquisition functions… [#DOUBT Not sure about what the $\alpha$'s exactly are in the acquisition function section.]

We may need to observe the system rather than intervene (i.e. when the intervention set is the null set) - this improve Monte Carlo estimates for example.

[#DOUBT What are the function evaluations when normal BO is used in the crop yield example in page 1? They say that all variables will be manipulated in this case, and hence the dependency structure will be broken.]

Previous Notes

Paper considers global optimization of a variable that is part of a known Structural Causal Model (SCM), where a sequence of interventions can be performed. This is a generalization of Bayesian Optimization which treats all input variables of the objective function as independent. It also generalizes the Causal Multi-Armed Bandit (MAB) problem, which treats arms as interventions which is equivalent to binary valued outcomes for treatment variables considered in this paper.

To do this, propose a new class of optimization problems called Causal Global Optimization (CGO) which takes into account causal information. The proposed algorithm, called Causal Bayesian Optimization (CBO) automatically balances 2 trade-offs: exploration/exploitation, and observation/intervention.

Let $\mathcal{D}^O,\mathcal{D}^I$ represent observational and interventional datasets. Denote all variables of the SCM as $\mathbf{V}$, which divide into:

Context/Non-manipulative variables $\mathbf{C}$
Treatment variables $\mathbf{X}$
Output variables $\mathbf{Y}$

The CI relationships in the DAG $\mathcal{G}$ (which corresponds to the known SCM) entails a observational distribution $p(\mathbf{V})=\prod_{v\in \mathbf{V}}p(v|Pa(v))$

Assuming the causal effect is identifiable, do-calculus will give us interventional distributions (i.e. those conditioned $\text{do}(\mathbf{X}=\mathbf{x})$) as observational distributions which could then be approximated via for e.g. MC approximations.

The CGO problem is then defined as: \[ \mathbf{X}^*_s, \mathbf{x}^*_s = \text{argmin}_{\mathbf{X}_s \in \mathcal{P}(\mathbf{X}), \mathbf{x}_s\in D(\mathbf{X}_s)} \mathbb{E}_{P(\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s))}[\mathbf{Y}] \]

For notational convenience, $\mathbb{E}_{P(\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s))}[\mathbf{Y}]:=\mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]$ and $\mathbb{\hat{E}}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]$ when the expectation is taken over the MC approximation of $P(\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s))$.

Intractability arises from the optimization space, first includes the power set of $\mathbf{X}$, and for each such subset, we need to find the value to set the variables to. Denote $Co(\mathbf{X}_s,\mathbf{x}_s)$ to be the cost of evaluating the function at the specified point.

Authors define causal intrinsic dimensionality of $\mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{X}=\mathbf{x})]$ as the number of parents of $\mathbf{Y}$, which could be considerably lower than the input dimensionality in standard BO which is known to not have good performance for higher input dimensions.

CBO relies on:

Deciding which set of variables is worth intervening on.
Causal Gaussian Process (GP) to use both observational and interventional data.
An acquisition function to solve the exploration/exploitation trade-off across interventions.
An $\epsilon$ greedy policy to solve the observation/intervention trade-off.

Variables that are worth intervening are chosen from from what is called the Possibly-Optimal Minimal Intervention Set (POMIS).

First define variables $\mathbf{X}_s\in \mathcal{P}(\mathbf{X})$ to be a Minimal Intervention Set (MIS) if there is no subset of $\mathbf{X}_s'\subset \mathbf{X}_s$ such that $\mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]=\mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{X}_s'=\mathbf{x}_s')]$.

Denoting $\mathbb{M}_{\mathcal{G},Y}^\mathbf{C}$ as the set of MISs for the causal model, $\mathbf{X}_s \in \mathbb{M}_{\mathcal{G},Y}^\mathbf{C}$ is a Possibly-Optimal Minimal Intervention Set (POMIS) $\mathbb{P}_{\mathcal{G},\mathbf{Y}}^\mathbf{C}$ if there exists an SCM conforming to $\mathcal{G}$ such that $\mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s^*)]>\forall_{\mathbf{W}\in \mathbb{M}_{\mathcal{G},\mathbf{Y}}\textbackslash \mathbf{X}_s} \mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{W}=\mathbf{w}^*)]$ for $\mathbf{x}^*,\mathbf{w}^*$ are optimal intervention values. Define the Exploration Set (ES) to be either the MIS or POMIS, and denote $\mathbb{B}_{\mathcal{G},\mathbf{Y}}^\mathbf{C}$ as the (unique) set BO performs interventions on which includes all manipulable variables $\mathbf{X}$.

The surrogate model is a GP prior $f(\mathbf{x}_s)=\mathbb{E}(\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]$ for each $\mathbf{X}_s \in \text{ES}$, with mean function $m(\mathbf{x}_s)=\mathbb{\hat{E}}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]$ and kernel $k_C(\mathbf{x}_s,\mathbf{x}_s')=k_{RBF}(\mathbf{x}_s,\mathbf{x}_s')+\sigma(\mathbf{x}_s)\sigma(\mathbf{x}_s')$ where $\sigma(\mathbf{x}_s)=\sqrt{\mathbb{V}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]}$.

The causal acquisition function is the Expected Improvement (EI) with respect to the current best interventional setting across all sets in ES. Denoting $y_s=\mathbb{E}[\mathbf{Y}|\text{do}(\mathbf{X}_s=\mathbf{x}_s)]$ and $y^*$ as the optimal value of $y_s, s\in \{1,\cdots,|\text{ES}|\}$ seen so far, the (causal) EI is $EI^s(\mathbf{x})=\mathbb{E}_{p(y_s)}[\max(y_s-y^*,0]/Co(\mathbf{x})$. The best new intervention is selected by evaluating $EI^s(\mathbf{x})$ for each set in ES, and selecting the ES with the highest $EI^s(\mathbf{x})$ value.

When the null set is in the ES, we may wish to observe rather than intervene. The authors propose an $ε-$greedy approach as a way to allow for observing, with $\epsilon$ being the probability of observing, and is chosen to be $\frac{\text{Vol}(\mathcal{C}(\mathcal{D}^O))}{\text{Vol}(\times_{X\in \mathbf{X}}D(X))}\times \frac{|\mathcal{D}^O|}{N_{\max}}$ where the first fraction's numerator is the volume of the convex hull of the observational data and its denominator is the volume of the interventional domain, $N_\max$ is the max number of observational samples the agent is willing to collect. The hope is that when the convex hull volume is small relative to the interventional domain, the policy will choose to intervene so as to explore areas not covered by $\mathcal{D}^O$, and otherwise we opt to collect observational samples which will improve the MC causal effect estimators.