Interventions, Where and How? Experimental Design for Causal Models at Scale - 2022

Details

Title : Interventions, Where and How? Experimental Design for Causal Models at Scale Author(s): Tigas, Panagiotis and Annadani, Yashas and Jesson, Andrew and Schölkopf, Bernhard and Gal, Yarin and Bauer, Stefan Link(s) : http://arxiv.org/abs/2203.02016

Rough Notes

This paper extends Bayesian Experimental Design (BED) for learning Structural Causal Models (SCMs), going beyond the linearity assumption and only selecting intervention targets (instead of targets and values to set them to).

Directed Acyclic Graphs (DAGs) can be used to model causal relationships, which can be learnt up to an extent with observational data alone. To get the true DAG, one needs to run experiments/interventions where you choose some variables and set them to be specific values and then collect data from the system. To select these interventions efficiently, we can make use of BED - this involves being Bayesian about the causal model itself, i.e. having a distribution over possible causal models.

The goal is to identify $\xi_{\mathbf{x}_j} = \text{do}(X_j=x_j)$ which maximizes the expected information gain on the SCM $\mathbf{G},\mathfb{\Theta}$ after observing $\mathbf{Y}\sim p(X_1,\cdots,X_d|\text{do}(X_j=x_j))$.

This means solving the following optimizing problem: \[ \xi_{\mathbf{x}_j} = \text{argmax}_{\mathbf{x}_j} \text{argmax}_j MI(\mathbf{Y};\mathbf{G},\mathbf{\Theta}|\xi_{\mathbf{x}_j}, \mathcal{D}) \]

This paper focuses on atomic interventions of 1 variable at a time.

The Mutual Information (MI) term can be rewritten as proposed in Bayesian Active Learning for Classification and Preference Learning - 2011, who call it Bayesian Active Learning by Disagreement (BALD). The BALD formulation is chosen in this work, as it requires samples from $p(\mathbf{G},\mathbf{\Theta}|\mathcal{D})$ rather than $p(\mathbf{G},\mathbf{\Theta}|\mathbf{Y}, \xi_{\mathbf{x}_j}, \mathcal{D})$.

Denote the MI for intervention target $X_j$ as $U_j = MI(\mathbf{Y};\mathbf{G},\mathbf{\Theta}|\xi_{\mathbf{x}_j},\mathcal{D})$. Since each $U_j$ is continuous, we can perform Bayesian Optimization (BO) to find the best intervention value for each variable and then select the best variable,value pair as the intervention. This means we will compute each MI for, say, $t_{BO}$ steps. The authors use the Upper Confidence Bound (UCB) acquisition function for the BO loop.

The authors further extend to the batch setting, where multiple interventions are selected at once. Then means the MI optimization problem now becomes: \[ \text{argmax}_{\cup_{i=1}^\mathcal{B}\mathbf{x}_j_i, j_i} MI(\mathbf{Y}_i;\mathbf{G},\mathbf{\Theta}|\xi_{\mathbf{x}_j_i},\ \mathcal{D})\]

Previous work such as (??, ????) create the batch greedily, while the authors here generate a batch at once by reusing the computed MI values - specifically, define a softmax distribution using $U_{t,j}$ and select interventions for the batch from this distribution without replacement. Here, $U_{t,j}$ is the matrix where the $t,j$th element denotes the MI value for variable $j$ at the iteration $t$ of the BO loop. This approach to selecting batches is called softBALD.

(#TODO Talk about experiments section)