UAI 2023 Tutorial: Causal Representation Learning - 2023

Details

Title : UAI 2023 Tutorial: Causal Representation Learning Author(s): Uncertainty in Artificial Intelligence Link(s) : https://www.youtube.com/watch?v=f8JrbaTR1vg

Rough Notes

Causal inference provides modular mechanisms, given assumptions allow for transportability and you have data. We use causal inference to estimate modular explanations i.e. mechanisms. Here, we assume variables are known and mostly observed.

In biology:

We can observe biological systems in different and ever increasing detail.
We can collect a lot of data.
We can perform interventions via drugs, gene knockouts etc.
However, causal variables are not observed directly.

Thought experiment: What would it take to build an AI bench scientist? Some desiderata:

Map signals like images to abstract variables.
Need to work in unlabelled settings.

We start with non-identification in autoencoders.

Given an encoder \(g^{-1}(x)\) and decoder \(g(z)\), with the constraint that \(g \circ g^{-1}(x) = x\). For a nonlinear mapping \(a\) we can have the encoder \(a \circ g^{-1}(x)\) and decoder \(g\circ a^{-1}(z)\) satisfying the constraint \(g \circ a^{-1} \cir a \circ g^{-1}(x)=x\).
Assume linear mixing with independent latents i.e. \(p(Z)=\prod_{i}^{}Z_i\) and \(X=AZ\). Recovering \(Z\) is equivalent to asking if the solution set \(\{(\hat{Z},\hat{A}) : X=\hat{A}\hat{Z} \text{ and } \forall i,j <\hat{Z}_i,\hat{Z}_j>=0\}\) contains 1 element. Given a perfect decoder \(\hat{Z}=A^{-1}X\), left multiplying by any rotation matrix also works i.e. we still cannot unambiguously recover the latents.

To generalize this, for now suppose \(p(Z) = \prod_{i}^{}Z_i\) and \(X=g(Z)\) where \(g\) is injective. Could we have a nonlinear transformation \(\hat{z} = a(f(z))\) that entangles \(z_i,z_j\)? We can construct such an \(a\) explicitly using the CDF and inverse normal CDF. This means IID data is not enough.

Time Contrastive Learning TCL (, ). Assume \(p(z|t) = \prod_{i}^{}p(z_i|t)\) and a nonlinear mixture map (same input and output dimensions) \(x=g(z)\). Split time series to \(T\) bins and train a classifier (e.g. neural network) to predict the time segments. The optimal hidden representation happens to be a linear function of the source signals.
iVAE (, a). Same exponential family prior as above. Assume noise i.e. \(x=g(z)+\epsilon\).