Kullback-Leibler (KL) Divergence

The Kullback-Leibler (KL) divergence, (also called relative entropy or information divergence) between two (discrete) distributons \(p,q\) defined over a common sample space \(\mathcal{X}\) is \[ D_{KL}(p||q)=\sum_x p(x)\log \Big( \frac{p(x)}{q(x)} \Big) = \mathbb{E}_p \log \Big[ \frac{p(x)}{q(x)} \Big]\] Here, the summation is taken over the \(\text{supp}(p)=\{x \in \mathcal{X} : p(x)>0\}\), and we define \(c\log\Big(\frac{c}{0}\Big)=\infty\) for \(c>0\). This last convention implies that if \(D_{KL}(p||q)<\infty\) then \(p(x)>0 \implies q(x)>0\) and thus \(\text{supp}(p)\subseteq \text{supp}(q)\).

The KL divergence is not a metric since it is not symmetric and does not satisfy the triangle inequality.

A very important inequality is that \(D_{KL}(p||q)\geq 0\) with equality iff \(p=q\), this is called the Gibbs inequality or divergence inequality.

Another important and relevant inequality is the Log-Sum inequality, which states that given positive numbers \(a_1,a_2,\cdots\) and non-negative numbers \(b_1,b_2,\cdots\) such that \(\sum_i a_i <\infty\) and \(0<\sum_i b_i <\infty\) we have \(\sum_i a_i \log \Big( \frac{a_i}{b_i} \Big) \geq \Big(\sum_i a_i \Big)\log\Big(\frac{\sum_i a_i}{\sum_i b_i}\Big)\), which can be proved using the Gibbs inequality.

The KL divergence can also be related to the variational distance between two distributions via Pinsker's inequality, which states that \(D_{KL}(p||q)\geq \frac{2}{2ln(2)}V^2(p,q)\). Where \(V\) is the total variation between \(p,q\) which is symmetric. If either \(D_{KL}(p||q), D_{KL}(q||p)\) is small, then so is \(V\). Given a sequence of distributions \(q_k\), if either of \(D_{KL}(p||q_k),D_{KL}(q_k||p)\) tend to 0 as \(k\to \infty\) then \(V\) converges to 0 as well, i.e. convergence in KL divergence is stronger than convergence in variational distance.