Causality - 2009

Details

Title : Causality Author(s): Pearl, Judea Link(s) : https://doi.org/10.1017/cbo9780511803161

Chapter 1: Introduction to Probabilities, Graphs and Causal Models

Why use probability when talking about causation?

Causal statements are often made in situations when there is uncertainty. E.g. "You will fail the course because of your laziness" means to say that the antecedent only makes the consequence more likely, not certain.
Most casusal statements in natural language have exceptions which are hard to process using the standard rules of determinstic logic. E.g. "1. My neighbour's roof gets wet whenever mine does. 2. If I hose my roof it will get wet." - here deterministic logic implies the absurd conclusion that my neighbour's roof gets wet when I hose my roof. If we use deterministic logic, we need to modify statement 1 to include the exceptions, while probability theory can handle these exceptions.

Equipped with probability theory, we can focus on the main topics in causality:

Inference.
Interventions.
Identification.
Ramification.
Confounding.
Counterfactuals.
Explanation.

This book focuses on discrete probability, see Feller 1950, Hoel et al. 1971, Appendix to Suppes 1970 for additional information.

[Bayes theorem]

[Odds ratio]

[Likelihood ratio]

The posterior indicates the belief about our hypothesis \(H\) after observing evidence \(e\), i.e. \(p(H|e)\). The posterior odds ratio \(\frac{p(H|e)}{p(\neg H|e)}\) is an important object. Let \(O(H)=\frac{p(H)}{p(\neg H)}=\frac{p(H)}{1-p(H)}\) be the prior odds ratio, and \(L(e|H)=\frac{p(e|H)}{p(e|\neg H)}\) be the likelihood ratio. Then, the posterior odds ratio is \(O(H|e)=L(e|H)O(H)\). In epidemeology, \(H\) is exposure and \(e\) is disease, hence the posterior odds represents the odds that a person with disease \(e\) has been exposed to \(H\).

[#DOUBT In paragraph below Eq. 1.19, not sure what he is referring to when he mentions the first and second factor]

[Random variables and Expectations]

[Conditional independence and graphoids]

[Graph terminology: Adjacent nodes, paths vs. directed paths, undirected vs. bidirected edges, kinship relations between variables]

Undirected graphs are also called Markov networks, often used to model symmetrical spatial relationships.

[Definition 1.2.1: Markovian Parents] Given an ordered set of variables \(X_1,\cdots, X_n\) and a joint distribution \(p_X\) over them, a set of variables \(PA_j\) constitute Markovian parents of \(X_j\) if \(PA_j\) is a minimal set of the predecessors \(X_1,\cdots,X_{j-1}\) such that \(p_X(X_j|PA_j)=p_X(X_j|X_1,\cdots,X_{j-1})\).

P15: \(PA_j\) is unique whenever \(p_X\) is strictly positive i.e. there are no logical/definitional constraints).

[Definition 1.2.2: Markovian compatibility] If a probability density \(p_X\) can be factorized relative to a DAG \(G\) i.e. \(p_X(X_1,\cdots,X_n)=\prod_{j}{p_{X_j}(X_j|PA_j)}\) then we say \(G\) represents \(p_X\), or \(G\) and \(p_X\) are compatible, or \(p_X\) is Markov relative to \(G\).

The set of distributions compatible with a DAG \(G\) can be characterized by listing the set of conditional independencies that each such distribution must satisfy. These conditional independence relations can be read from \(G\) via the d-separation criterion.

Selection bias, Berkson's paradox, explaining away effect: Observations on a common consequence between two independent cases (where the consequence may not even be on a blocked path between these two causes) can make the two causes dependent.

P18: Figure 1.3(a) gives a good example on colliders when there are bidirected edges.

[Theroem 1.2.4] d-separation implies conditional independence, and no d-separation implies conditional dependence in at least 1 distribution compatible with \(G\).

See Spirtes et al. 1993, and Section 2.4, 2.9.1 in this book for more information on the converse case mentioned in the theorem above.

P19: Ordered Markov Condition and Parental Markov Condition as equivalent notions of a distribution \(p\) and DAG \(G\) to be compatible.

[Theorem 1.2.8: Observational equivalence] Two DAGs are observationally equivalent iff they have the same skeletons and v-structures. See Verma and Pearl 1990.

P20: Join-tree method and cut-set conditioning is mentioned as examples of algorithms for computing probabilities given DAGs and conditional distributions.

P21: First sentence under Section 1.3 Causal Bayesian Networks: "The interpretation of DAGs as carriers of independence assumptions does not necessarily imply causation; in fact, it will be valid for any set of recursive independencies along any ordering of the variables - not necessarily causal or chronological."

Reasons to build DAG models around causal rather than associational information:

More meaningful, accessible and hence reliable judgements are required in their construction. This can be appreciated by trying to construct a DAG of Fig 1.2 with the ordering \(X_5,X_1,X_3,X_2,X_4\).
Allows us to model external/spontaneous changes (interventions). Local changes of environment mechanisms can be translated by changing the network topology. If instead of causal directions, we make changes along the order \(X_5,X_1,X_3,X_2,X_4\) it would require more effort.

The flexibility to model interventions comes from the assumption that parent-child relationships in the network represent stable, autonomous mechanisms. That is, it is conceivable to change one such relationship without changing the others. Organizing knowledge in such a modular way allows us to predict effects of interventions with minimum extra information. In contrast, without causal graphs, effects under interventions cannot be predicted even if the the joint distribution is fully specified.

Predictions under interventions involve distributions of the form \(p(Y|\text{do}(X=x))\) - where the conditioning is done in a mutilated causal graph. This is different from \(P(Y|X=x)\).

[Definition 1.3.1: Causal Bayesian Network] Let \(p(v)\) be a distribution and \(p_x(v)\) denote the distribution resulting from the intervention \(\text{do}(X=x)\). Denote \(\mathbf{ p }^*\) as the set of all interventional distributions (which includes no interventions, i.e. when \(X=\emptyset\)). A DAG \(G\) is a causal Bayesian network compatible with \(\mathbf{ p }^*\) iff:

\(p_x(v)\) is Markov relative to \(G\)
\(p_x(v_i)=1\) for all \(V_i \in X\) whenever \(v_i\) is consistent with \(X=x\).
\(p_x(v_i|pa_i)=p(v_i|pa_i)\) for all \(V_i\notin X\) whenever \(pa_i\) is consistent with \(X=x\), i.e. each conditional distribution \(p(v_i|pa_i)\) is invariant to interventions not involving \(V_i\).

The constraints in Defintion 1.3.1 result in \(P_x(v) = \prod_{i|V_i\notin X}P(v_i|pa_i)\).

Two important properties for a causal Bayesian Network \(G\) with distribution \(p\):

\(p(v_i|pa_i) = p_{pa_i}(v_i)\).
\(p_{pa_i,s}(v_i) = p_{pa_i}(v_i)\) for subsets \(S\) disjoint from \(\{V_i,PA_i\}\). Represents invariance - once directed causes i.e. parents are controlled, no other interventions cannot influence \(V_i\).

We see causal relationships as more stable than probabilistic relationships - causal relationships are ontological (describe objective physical constraints in the world) meanwhile probabilistic relationships are epistemic (reflect what we know/believe about the world).

P25: Mechanism stability is also at the heart of explanatory accounts of causality, where causal models do not necessarily encode behaviour under intervention but rather aim to provide an explanation/understanding of how the data are generated.

[#TODO Continue starting from P26]

Anki

Describe an example where 2 DAGs are Markov Equivalent.

Rough Notes

In a Bayesian Network (BN) the set \(PA_j\) is unique when the distribution is strictly positive.
The construction of a BN given an ordering and a distribution over the variables \(X_{[p]}\) is given by setting the parent sets \(PA_j\) as the minimal set of predecessors of \(X_j\) that render \(X_j\) independent of all other predecessors \(X_{[p]}\textbackslash (PA_j \cup X_j)\) i.e. \(PA_j\) is any subset of \(X_1,\cdots,X_{j-1}\) satisfying \(p(X_j|PA_j)=p(X_j|X_1,\cdots,X_{j-1})\). Hence the BN carries conditional independence information by construction.
Markov compatibility definition.
Different names for collider phenomena include selection bias, Berkson's paradox, and the explain away effect.
Fig 1.3(a) is an interesting case of bi-directed graphs \(X->Z_1<->Z_3<-Y \cup Z_1 <- Z_2 <- Z_3\) where conditioning on \(Z_1\) removes opens pathways through 2 colliders since that variable is both part of a collider and also a descendant of another collider at \(Z_3\).
d-separation based on ancestral graphs is mentioned, seems interesting and relevant.
Theorems 1.2.6 and 1.2.7 have the ordered and parental (local) Markov properties defined, and Pearl mentions that the local Markov property is also sometimes taken as the definition of a Bayesian network.
Frydenburg 1990 presents a criterion for MECs in the context of chain graphs.