Mutual Information

For (discrete) random variables \(X,Y\) with joint probability mass function (pmf) \(p_{XY}\) and marginal pmfs \(p_X,p_Y\), their Mutual Information (MI) is defined as \[I(X;Y) = \sum_{x,y} p(x,y)\log \Big(\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\Big)=\mathbb{E}_{p_{XY}}\log \Big[\frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}\Big]\] Note that this is symmetric.

As \(H(X)=I(X;X)\), the Shannon Entropy \(H\) is sometimes called the self-information of \(X\).

We also have the following formulations for \(I(X;Y)\), assuming all entropies and conditional entropies are finite (to avoid \(\infty - \infty\)): \[ I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)=H(X)+H(Y)-H(X,Y)\]

The last formula is reminiscent of set-additive functions \(\mu\) over sets \(A,B\) where \(\mu(A\cap B)=\mu(A)+\mu(B)-\mu(A\cup B)\). More generally, this analogy extends further:

Information theory	Set theory
\(I\) or \(H\)	\(\mu\) (set additive function)
,	\(\cup\)
;	\(\cap\)
\(\mid\)	\(-\) i.e. \(A-B=A\cap B^C\)

For random variables \(X,Y,Z\), the conditional mutual information between \(X,Y\) conditioned on \(Z\) is \[I(X;Y|Z)=\sum_{x,y,z}p_{XYZ}(x,y,z)\log \Big(\frac{p(x,y|z)}{p(x|z)p(y|z)}\Big) = \mathbb{E}_{p_{XYZ}}\log \Big[ \frac{p(x,y|z)}{p(x|z)p(y|z)}\Big]\]

Even though conditioning never increases entropy i.e. \(H(Y|X)\leq H(Y)\), the counterpart does not hold for mutual information, i.e. in general \(I(X;Y|Z)\nleq I(X;Y)\).

Decomposing the joint distribution, we could rewrite this as \(I(X;Y|Z)=\sum_z p(z)I(X;Y|Z=z)\).

The chain rule for mutual information states that \(I(X_1,X_2,\cdots,X_n|Y)=\sum_{i=1}^n I(X_i;Y|X_1,\cdots,X_{i-1})\). Similarly, the conditional chain rule for mutual information simply involves conditioning on another random variable \(Z\) on both sides.

An important inequality here is the data processing inequality which states that given a Markov Chain \(X\to Y \to Z\), \(I(X;Y)\geq I(X;Z)\) i.e. processing data after receiving it cannot increase information.

We also have the following convexity of mutual information result, that for fixed \(p(x)\), \(I(X;Y)\) is a convex functional of \(p(y|x)\). The concavity of mutual information states the opposite, that for fixed \(p(y|x)\), \(I(X;Y)\) is a concave functional of \(p(x)\).

All the Shannon Entropy measures can be rewritten as special cases of conditional mutual information. To see this, let \(\Phi\) be a constant random variable, then we have \(H(X)=I(X;X|\Phi), H(X|Y)=I(X;X|Y), I(X;Y)=I(X;Y|\Phi)\). This fact is relevant to define information measures as in terms of signed measures.

Estimation

[Notes from here] Following black-box lower bounds on MI start from the KL divergence definition of it:

Nguyen-Wainwright-Jordan (NWJ) Lower Bound
Donsker-Varashan (DV) Lower Bound
InfoNCE Lower Bound

These are however quite poor - since any distribution-free and high-confidence lower bound \(B(x_{1:N},\delta)\) of \(\mathbb{ H }[p(x)]\) only from samples \(x_{1:N}\) to be correct with probability at least \(1-\delta\) requires exponentially many samples (\(B(x_{1:N},\delta)\leq \log(2kN^2)\)). Since \(I(X;Z)=H(X)-H(X|Z)\leq H(X)\) for discrete \(X\) this shows hardness for computing the relevant MI.

We can get better lower bounds by relaxing the black-box restriction. This can be done by not lower-bounding intractable entropies (like the marginal entropy in VAEs or posterior entropies), or equivalently assuming one of the marginals (e.g. prior in VAEs) to be known. Avoiding lower-bounds on entropies and instead relying on upper-bounds on the conditional entropy \(\mathbb{ H }[p(z|x)]\) gives the Baber-Agakov Bound. A multisample lower bound also exists where taking \(K=0\) for the multisample part gives the Barber-Agakov bound.

Upper bounds to MI can be obtained using the Importance-Weighted Autoencoder (IWAE) lower bound on \(\log p(x)\).