Equilibrium Computation and the Foundations of Deep Learning - 2021

Details

Title : Equilibrium Computation and the Foundations of Deep Learning Author(s): Algorithmic Learning Theory Link(s) : https://www.youtube.com/watch?v=GpaCWKlOMig

Rough Notes

#NOTE The previous note's heading does not match the current video title.

Motivating questions: How is that modern ML is able to beat humans in Go, but cannot enter highways. An oversimplified view of modern ML is that the remarkable progress comes from advances in deep learning models which combined with proper learning objectives and gradient-based optimization alongside the availability of powerful hardware and big data. Empirically we have seen that in many learning settings, gradient descent and its variants discover local minima that generalize well. Going from this, the motivation of the talk is the study of systems, which, down the road interact with out learning systems and act in the same environment. From an optimization stand-point, agents using gradient-descent in an environment with other using also using gradient-descent have a hard time converging in practice (according to the speaker). Conflicts occur even when there are 2 learning agents, where the 2 agents control variables \(x_t,y_t\) and one aims to maximize \(f(x,y)\) w.r.t \(x\) and the other aims to minimize it w.r.t \(y\). Here one agent uses gradient-descent while the other uses gradient-ascent, or variants of these respectively. Using naive Gradient-Descent-Ascent (GDA) has problems when used to train Generative Adversarial Networks (GANs), presenter shows example of GANs trained on MNIST and a finite mixture model with mixtures located equidistant on some circle.

GDA on \(f(x,y) = x\cdot y\) sheds light on problems of GDA, where the objective is \(\min_x \max_y f(x,y)\). Here, the min-max objective is (0,0) - if one player does not choose 0, the other can choose \(\pm \infty\) to optimize \(f\) according to their objective. In GDA, the dynamics are \(x_{t+1} \leftarrow x_t - \eta y_t, \: y_{t+1} \leftarrow \eta x_t\), and running GDA will show that a trajectory that is spiralling away from (0,0). In short, we get training oscillations/garbage solutions in 2 agent zero-sum games, convex-concave low-dimensional objectives and even when the function is perfectly known. This is not good news for real-world cases where the objective is nonconvex-nonconcave, high-dimensional and the function is learned from data.

The focus from here on is the comparison for \(\min_x f(x) \: :x \in S\) and \(\min_x \max_y f(x,y)\: : (x,y) \in S\) where \(f\) is Lispchitz, and L-smooth (i.e. \(\nabla f\) is L-Lipschitz) and the constraint set \(S\) is convex and compact (convex in \(x\) and concave in \(y\) for min-max). First order-methods can find minima and min-max equilibria in number of steps/queries to \(f\) or \(\nabla f\) that are polynomial in \(1/\epsilon, L, \text{diam}(S)\), where minima defined as \(x^*\) such that \(f(x^*)\leq f(x)+\epsilon, \forall x \in S\) and min-max equilibria is \((x^*,y^*)\) such that \(f(x^*,y)-\epsilon \leq f(x^*,y^*), \forall y : (x^*,y)\in S\: \& \: f(x^*,y^*)\leq f(x,y^*) + \epsilon, \forall x:(x,y^*)\in S\). The training oscillations in min-max optimization is a feature of the optimization procedure, and is not a byproduct of some computational intractability inherent to the problem.

When \(f\) is not convex and convex-concave respectively, finding global solutions is NP-hard. Local minima and min-max equilibria can be defined by constraining \(x\) and \((x,y)\) to lie around a \(\delta\) ball from the local optima values. For local minima problems, \(\delta \leq \sqrt{2\epsilon/L}\) is enough for first-order methods to find a \((\epsilon,\delta)\) local minima, for higher \(\delta\) the existence holds but finding them becomes NP-hard. For local min-max problems, existence holds for \(\delta \leq \sqrt{2\epsilon/L}\) but the complexity of finding solutions is unknown. The training oscillations here could be due to computational intractability of the problem itself.

From this motivation, recall the minimax theorem from Von Neumann, given \(X,Y\) compact and convex subsets of \(n,m\) dimensional Euclidean space respectively, \(f:X\times Y \to \mathbb{R}%\) is a continuous, convex-concave function, then \(min_{x\in X}\max_{y\in Y}f(x,y)=\max_{y\in Y}\min_{x\in X}f(x,y)\) and the min-max optimal point \((x,y)\) is essentially unique. Min-max points are equilibria of zero-sum games where min player pays max player \(f(x,y)\). The minimax theorem is equivalent to strong convex programming duality. In fact, min-max solutions can be found via convex-programming and vice-versa (von Neumann, Dantzig 1947 and Adler IJGT'13). The connection to online learning comes from (Blackwell'56, Hannan'57 etc), where if over multipler rounds \(t=1,\cdots,T\) if the min player updates their strategy \(x_t\) using any no-regret learning algorithm, and simultaneously the max player updates their strategy \(y_t\) using any no-regret learning algorithm, then their joint behaviour \((x_t,y_t)_t\) will "converge to equilibrium", where convergence is in the sense that the average of the trajectory arrives at the equilibrium i.e. \(\frac{1}{T}\sum_{t\in[T]}(x_t,y_t) \to (x^*,y^*)\). (E.g. no-regret learning algorithms: multiplicative-weights-update, follow-the-regularized-leader, follow-the-pertubed-leader, gradient descent). Recall however even for \(f(x,y)=x\cdot y\) the trajectory itself spirals out, but their average is still (0,0). Metaphorically speaking it is like the average trajectory of the moon which converges to the Earth meanwhile the moon itself does not converge to the Earth. From this physics viewpoint, if we have a trajectory whose behaviour is spiralling out of the target, it can be interpreted that the momentum of the dynamics is wrong, and it is pushing you further away from the target. Daskalakis, Ilyas, Syrgkanis, Zeng ICLR'18 suggest an approach called "optimistic" gradient descent to correct this. (Existing literature calls the general idea "optimistic learning methods"). Methods correcting the momentum show asymptotic last-iterate convergence for many algorithms. The fast, last-iterate convergence rates in constrained convex-concave case is unknown and is an open problem. Any positive result (last iterate or not) when \(f\) is not convex-concave is unknown and a larger open problem.

Modern theorem (Daskalakis-Skoulakis-Zampetakis'20), computing \((\epsilon,\delta)\) min-max equilibria for \(\delta \leq \sqrt{2\epsilon/L}\) is PPAD-complete, which gives the corollary that any algorithm (first-order, second-order etc) takes super-polynomial time unless P=PPAD. PPAD lies between P and NP, and characterizes problems in topology and game theory such as computing Brouwer fixed points of Lipschitz functions, Nash equilibria in general-sum, normal-form games which are both PPAD-complete i.e. as hard as any problem in PPAD.

The speaker has a (personal) philosophical corollary of this, that we cannot base multi-agent deep learning based on the current paradigm (optimization of some objective function with gradient descent or variants), since gradient-descent will fail to some intrinsic intractability barrier - i.e. gradient descent is not the right optimization problem when we have two or more agents with different objectives. Instead, we have to do more work in modelling the setting, choosing the agent model, deciding on meaningful optimization objectives and solutions, and designing the learning/optimization algorithm. And to do this, we will need some type of interaction with domain experts.

In conclusion:

Min-max optimization and equilibrium computation are intimately related to the foundations of game theory, mathematical programming and online learning theory, and have profound applications in statistics, complexity theory and many other fields.
Applications in ML pose challenges due to the dimensionality and non-convexity of the problems (as well as the entanglement of decisions with learning).
There are intractability results for finding first-order local min-max solutions/fixed points of GDA dynamics, and there are many important open problems as well, for e.g. identifying special cases with structure that enable fast convergence to min-max equilibria like Daskalakis-Foster-Golowich 2021.