Shannon Entropy

Given a (discrete) random variable \(X\) with probability mass function (pmf) \(p\) and sample space/alphabet \(\mathcal{X}\), the Shannon entropy \(H(X)\) of \(X\) is defined to be \[ H(X) = -\sum_{x \in \text{supp}(p)}p(x)\log(p(x))\] Where \(\text{supp}(p)\) denotes the support of \(p\). The units of \(H(X)\) depends on the base of the logarithm, for base 2 and \(e\) the units are bits and nats respectively.

Note that \(H(X)\) is invariant to relabelling of the outcome space, i.e. given two (discrete) random variables \(X,Y\) such that their pmfs \(p_X,p_Y\) are a permutation of one another, we still have \(H(X)=H(Y)\).

Also, \(H(X)\leq \log(|\mathcal{X}|)\) with equality iff \(X\) is uniform over \(\mathcal{X}\). This directly implies that if \(\mathcal{X}\) is finite then \(H(X)\) is also finite, however there are distributions whose sample space is infinite and the entropy is finite, e.g. \(p_X(X=i)=2^{-i}\) where \(i \in \mathbb{N}\).

The joint entropy \(H(X,Y)\) of a pair of (discrete) random variables \(X,Y\) with joint pmf \(p_{X,Y}\) their joint entropy is defined as \[ H(X,Y) = -\sum_{x,y \in \text{supp}(p_{X,Y})}p_{X,Y}(x,y)\log(p_{X,Y}(x,y))\]

The conditional entropy \(H(X|Y)\) is defined as \[ H(X|Y) = -\sum_{x,y \in \text{supp}(p_{X,Y})}p_{X,Y}(x,y)\log(p_{Y|X}(y|x))\] This could be rewritten as \(H(Y|X) = \sum_{x \in \text{supp}(p_X)}p_X(x)H(Y|X=x)\) by decomposing the joint distribution. Similarly, we could write \(H(Y|X,Z)=\sum_{z\in \text{supp}(p_Z)}p(z)H(Y|X,Z=z)\).

An important inequality here is Fano's inequality, which states that given two random variables \(X,\hat{X}\) defined over the same outcome space \(\mathcal{X}\), we have \(H(X|\hat{X})\leq H_b(p(X\neq \hat{X}))+p(X\neq \hat{X})\log (|\mathcal{X}|-1)\). Here, \(H_b(p) = -p \log_2(p) -(1-p)\log_2(1-p)\) is the binary entropy function. Intuitively, if \(\hat{X}\) is an estimate for \(X\), the smaller the error probability \(p(X\neq \hat{X})\) implies smaller \(H(X|\hat{X})\).

Importantly, \(H(X|Y)=0\) iff \(X\) is a function of \(Y\). Also \(H(Y|X)\leq H(X)\) with equality iff both \(X,Y\) are independent, i.e. conditioning does not increase uncertainty.

We also have the following concavity of entropy result, given 2 random variables \(X_1,X_2\) with pmfs \(p_1,p_2\) and a mixture distribution \(p = \lambda p_1 + (1-\lambda)p_2\) where \(0\leq \lambda \leq 1\) then \(H(X)\geq \lambda H(X_1)+(1-\lambda)H(X_2)\).

The entropy function is continuous at \(p\) if \(\lim_{p' \to p}H(p')=H(\lim_{p' \to p} p')=H(p)\), or if \(\forall \epsilon>0, \: \exists \delta>0 \: : \: |H(p)-H(q)|<\epsilon\) for all \(q \in \mathbb{P}^{|\mathcal{X}|}\) satisfying \(V(p,q)=\sum_{x \in \mathcal{X}}|p(x)-q(x)|<\delta\), where \(\mathbb{P}^{|\mathcal{X}|}\) denotes all possible distributions over \(\mathcal{X}\).

The chain rule for entropy states that for (discrete) random variables \(H(X_1,X_2,\cdots,X_n) = \sum_{i=1}^n H(X_i|X_1,\cdots,X_{i-1})\). Similarly, the conditional chain rule for entropy simply involves conditioning on another random variable \(Y\) on both sides.