Lost Relatives of the Gumbel Trick - 2017
Details
Title : Lost Relatives of the Gumbel Trick Author(s): Balog, Matej and Tripuraneni, Nilesh and Ghahramani, Zoubin and Weller, Adrian Link(s) : https://proceedings.mlr.press/v70/balog17a.html
Rough Notes
The main setting is the problem of sampling from a discrete probability distribution and evaluatings its normalizing constant. Let \(p\) be a distribution over the discrete sample space \(\mathcal{X}\) provided in terms of its potential function \(\phi : \mathcal{X} \to [-\infty, \infty)\) where \(p(\mathbf{x}) = e^{\phi(\mathbf{x})}/Z\), and \(Z\) is called the partition function, and \(p\) is called the Gibbs distribution on \(\mathcal{X}\) associated with the potential \(\phi\).
The contributions are:
- A family of tricks that give better approximations to \(Z\) via averaging Gumbel permutations in different spaces.
- New upper and lower bounds on the partition function of a discrete graphical model computable via low-rank pertubations alongside sequential samplers for the Gibbs distributions.
- Discussion on a simpler analytic form of the Gumbel trick.
The Gumbel trick: Consider \(N\) independent exponential clocks which start simultaneously and clock \(j\) rings after a random time \(T_j \sim \text{Exp}(\lambda_j)\). Then:
- The time until some clock rings has distribution \(\text{Exp}(\sum_{j=1}^N\lambda_j)\).
- The probability of clock \(j\) ringing first is proportional to \(\lambda_j\).
Considering \(N\) competing exponential clocks \(\{T_x\}_{x \in \mathcal{X}}\) with \(\lambda_x = \tilde{p}(x)\) where \(\tilde{p}\) is the unnormalized density. Then:
- \(\min_{x\in \mathcal{X}} \{T_x\} \sim \text{Exp}(Z)\)
- \(\text{argmin}_{x \in \mathcal{X}} \{T_x\}} \sim p\)
The Gumbel trick involves applying \(g(x) = -\ln x - c\) to both equations above. When \(g\) is applied to an \(\text{Exp}(\lambda)\) random variable, we get a \(\text{Gumbel}(-c + \ln \gamma)\) distribution which can be represented as \(\ln \lambda+ \gamma\) where \(\gamma \sim \text{Gamma}(-c)\).
Define \(\{\gamma(x)\}_{x \in \mathcal{X}} \sim \text{Gumbel}(-c)\). Then:
- \(\max_{x\in \mathcal{X}} \{\phi(x)+\gamma(x)\} \sim \text{Gumbel}(-c +\ln Z)\)
- \(\text{argmax}_{x\in \mathcal{X}}\{\phi(x)+\gamma(x)\} \sim p\)
Where \(\phi(x) = \ln \gamma_x = \ln \tilde{p}(x)\). Since \(\text{Gumbel}(-c + \ln Z)\) has mean \(\ln Z\), the log partition function admits a consistent estimator via sampling.
(#TODO Actually make an extra note of the Gumbel trick and read Bach's blog post).
This paper then proceeds to apply different \(g\) which transforms exponential distributions to other distributions with known means. As the original exponential distribution has rate \(Z\), the transformed distribution will have mean \(f(Z)\) for some \(f\). Further transformations of these \(f(Z)\) give biased estimators for other transformations of \(Z\), including \(Z\) itself. See Table 1 for a summary of this.
Bounds on \(\ln Z\) can be achieved via the sum-unary pertubation MAP value random variable.