# Mutual Information

For (discrete) random variables \(X,Y\) with joint probability mass function (pmf) \(p_{XY}\) and marginal pmfs \(p_X,p_Y\), their **Mutual Information** (MI) is defined as
\[I(X;Y) = \sum_{x,y} p(x,y)\log \Big(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\Big)=\mathbb{E}_{p_{XY}}\log \Big[\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\Big]\]
Note that this is symmetric.

As \(H(X)=I(X;X)\), the Shannon Entropy \(H\) is sometimes called the *self-information* of \(X\).

We also have the following formulations for \(I(X;Y)\), assuming all entropies and conditional entropies are finite (to avoid \(\infty - \infty\)): \[ I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)=H(X)+H(Y)-H(X,Y)\]

The last formula is reminiscent of set-additive functions \(\mu\) over sets \(A,B\) where \(\mu(A\cap B)=\mu(A)+\mu(B)-\mu(A\cup B)\). More generally, this analogy extends further:

Information theory | Set theory |
---|---|

\(I\) or \(H\) | \(\mu\) (set additive function) |

, | \(\cup\) |

; | \(\cap\) |

\(\mid\) | \(-\) i.e. \(A-B=A\cap B^C\) |

For random variables \(X,Y,Z\), the **conditional mutual information** between \(X,Y\) conditioned on \(Z\) is
\[I(X;Y|Z)=\sum_{x,y,z}p_{XYZ}(x,y,z)\log \Big(\frac{p(x,y|z)}{p(x|z)p(y|z)}\Big) = \mathbb{E}_{p_{XYZ}}\log \Big[ \frac{p(x,y|z)}{p(x|z)p(y|z)}\Big]\]

Even though conditioning never increases entropy i.e. \(H(Y|X)\leq H(Y)\), the counterpart does not hold for mutual information, i.e. in general \(I(X;Y|Z)\nleq I(X;Y)\).

Decomposing the joint distribution, we could rewrite this as \(I(X;Y|Z)=\sum_z p(z)I(X;Y|Z=z)\).

The **chain rule for mutual information** states that \(I(X_1,X_2,\cdots,X_n|Y)=\sum_{i=1}^n I(X_i;Y|X_1,\cdots,X_{i-1})\). Similarly, the **conditional chain rule for mutual information** simply involves conditioning on another random variable \(Z\) on both sides.

An important inequality here is the **data processing inequality** which states that given a Markov Chain \(X\to Y \to Z\), \(I(X;Y)\geq I(X;Z)\) i.e. processing data after receiving it cannot increase information.

We also have the following **convexity of mutual information** result, that for fixed \(p(x)\), \(I(X;Y)\) is a convex functional of \(p(y|x)\). The **concavity of mutual information** states the opposite, that for fixed \(p(y|x)\), \(I(X;Y)\) is a concave functional of \(p(x)\).

All the Shannon Entropy measures can be rewritten as special cases of conditional mutual information. To see this, let \(\Phi\) be a constant random variable, then we have \(H(X)=I(X;X|\Phi), H(X|Y)=I(X;X|Y), I(X;Y)=I(X;Y|\Phi)\). This fact is relevant to define information measures as in terms of signed measures.

## Estimation

[Notes from here] Following black-box lower bounds on MI start from the KL divergence definition of it:

- Nguyen-Wainwright-Jordan (NWJ) Lower Bound
- Donsker-Varashan (DV) Lower Bound
- InfoNCE Lower Bound

These are however quite poor - since any distribution-free and high-confidence lower bound \(B(x_{1:N},\delta)\) of \(\mathbb{ H }[p(x)]\) only from samples \(x_{1:N}\) to be correct with probability at least \(1-\delta\) requires exponentially many samples (\(B(x_{1:N},\delta)\leq \log(2kN^2)\)). Since \(I(X;Z)=H(X)-H(X|Z)\leq H(X)\) for discrete \(X\) this shows hardness for computing the relevant MI.

We can get better lower bounds by relaxing the black-box restriction. This can be done by not lower-bounding intractable entropies (like the marginal entropy in VAEs or posterior entropies), or equivalently assuming one of the marginals (e.g. prior in VAEs) to be known. Avoiding lower-bounds on entropies and instead relying on upper-bounds on the conditional entropy \(\mathbb{ H }[p(z|x)]\) gives the Baber-Agakov Bound. A multisample lower bound also exists where taking \(K=0\) for the multisample part gives the Barber-Agakov bound.

Upper bounds to MI can be obtained using the Importance-Weighted Autoencoder (IWAE) lower bound on \(\log p(x)\).