Is There a Framework for Deep Learning in Multi-Agent Settings? - 2022

Details

Title : Is There a Framework for Deep Learning in Multi-Agent Settings? Author(s): Simons Institute Link(s) : https://www.youtube.com/watch?v=kJlQBxsj4ys

Rough Notes

Recent Machine Learning (ML) breakthroughs from an optimization viewpoint involves the objective $\min_{x \in \mathbb{R}^d} l(x)$ with:

Very high $d$.
Loss function $l$ being non-convex.
Loss function only accessible through $l(x),\nabla l(x)$ queries.
Loss function being Lipschitz, differentiable, smooth ie. has Lipshitz gradient.

Gradient Descent (GD) is guaranteed to efficiently compute local minima even with non-convex $l$, and empirically the local minima are good enough. So much so that there is a debate whether scale is all we need. In the multi-agent learning setting, it is very clear that scale is not all we need.

Multi-agent learning happens explicitly in many Reinforcement Learning (RL) applications e.g. self-driving cars, co-operative video games, and implicitly in Deep Generative Models (DGMs) and training classifiers robust to adversarial attacks, where the multi-agent learning formulation helps one obtain a learning goal that is otherwise a single-agent learning goal.

Consider DGMs for example. Their goal is to find a neural network $G_x$ with parameters $x$ such that $G_x(z)$ (where $z$ is from a noise distribution e.g. isotropic Gaussian) produces "interesting samples" from a distribution of interest, for e.g. a distribution over celebrity faces. A game theoretic formulation helps us achieve this learning goal in Generative Adversarial Networks (GANs). The setup consists of a 2-player zero-sum game between a player tuning the parameters $x$ of a generator, and another player tuning the parameters $y$ of another neural network called the discriminator. The payoff function is setup such that at the equilibrium, the generator is generating good images. The generator's goal is to take "boring samples" and create "interesting samples" (samples are often images), and the discriminators goal is to take fake or real samples and distinguish the real and fake samples. For e.g. in Wasserstein GANs the loss function is $l_G(x,y) = -l_D(x,y) = \mathbb{E}_{z\sim P_{real}}[D_y(z)] - \mathbb{E}_{z \sim \mathcal{N}(0,I)}[D_y(G_x(z))]$. The hope is that at the equilbrium, the generator is completely fooling the discriminator. $l_G,l_D$ are non-convex in $x,y$ respectively, which are both very high-dimensional, and such games may have no equilibria. The motivation question for this talk is "What does game theory even predict in this case?"

In practice, the GAN objectives are optimized using Simulataneous GD i.e. \[ x_{t+1} = x_t - \eta \cdot \nabla_x l_G(x_t,y_t)\] \[y_{t+1} = y_t -\eta \cdot \nabla_y l_D(x_t,y_t)\] Empirically this does not work out of the box like how single-agent GD works out of the box. Some problems include mode collapse and mode cycling, and visually bad samples generated on datasets like MNIST (??, ????)

So we ask ourselves if there is a generic optimization framework for multi-agent deep learning, and what are meaningful optimization targets in this setting, and whether they are practically attainable.

Consider applications with multiple agents, each agent takes a high-dimensional decision, and has a loss function that depends on their action and the actions of other agents..It is possible their are constraints on the strategies of all agents (e.g. limited supply constrains maximum sales). Similar to the single-agent setting, the loss functions are non-convex (for the decision of each agent), and is accessible through queries to $l_i(x),\nabla_x l_i(x)$, and are Lipschitz, differentiable and smooth i.e. their gradient is Lipschitz. We call this setting smooth non-convex games. (Non-concave in game theory literature). The challenge is that Nash equilibria do not exist in non-convex games e.g. consider a two-player zero-sum game with loss function $l_1(x_1,x_2)=-l_2(x_1,x_2)=(x_1-x_2)^2$, since if at a pair of points $(x_1^t, x_2^t)$ we have $x_1^t\neq x_2^t$, then the strategy for $x_1$ is $x_1^{t+1}=x_2$ and if $x_1^t=x_2^t$ then $x_2^{t+1} = c \neq x_1^t$ i.e. it goes as far away as possible and thus this simple game has no Nash equilibria.

The question now is, what are meaningful optimization targets in this setting and are they practically attainable? By meaningul we mean that it should exist, and verifiable with the information they have about the loss functions, and by practically attainable we mean efficiently reached via a GD like (or similar light-weight) method.

One weak optimization target we might be interested in is the Local Nash Equilibrium, which is a point (i.e. collection of strategies) $x^* = (x_1^*,\cdots,x_n^*) \in \mathcal{S}$ such that for each player $i, x_i^*$ is a local minimum of $l_i(x_i;x_{-i}^*)$ w.r.t $x_i$, that is, if the other players keep their strategies at $x^*_{-i}$ we allow player $i$ to locally optimize their utility function.

The weakest variant of this is First-Order Local Nash Equilibrium. This means from agent $i$'s viewpoint this means $x_i^*$ is the best response to $x_{-i}^*$ as far as first-order Taylor approximations can tell. If the constraint set $\mathcal{S}$ is convex and compact, a first-order local NE must exist. However the GAN dynamics from before suggest that practically speaking it is a struggle to solve. One theoretical result is that even in two-player zero-sum smooth non-convex games, any methods accessing the $l_i$'s via value and gradient value queries need exponentially many (in dimension and/or $1/ \epsilon$) such queries to compute even and $\epsilon$ approximate first-order local NE. In fact, any method needs super-polynomial-time (in dimension and/or $1/\epsilon$) to compute an $\epsilon$ local NE unless PPAD=P. Here PPAD is the complexity class in between P,NP (assuming P$≠$NP) and problems such as computing approximate Brouwer fixed poins of Lipschitz functions, and mixed NE in normal-form (convex) games are PPAD-complete i.e. they are in PPAD and no easier than any problem in PPAD.

To understand why the jump from single-agent non-convex minimization to two-player zero-sum non-convex games makes things harder, consider the following problem of taking steps in horizontal and vertical steps. The single-agent, at every improvement step, gets a lowe objective value thus making progress towards a local minimum, and if the objective is bounded from below, the value along $\epsilon$ improving path reveals information about the distance from the end of the path i.e. gives memory/information gain (this is the only reason it converges to local minimum in poly($1/\epsilon)$ steps). In the two-player zero-sum case, if one player moves horizontally and the other vertically, we could have better-response paths being cyclic, and querying the objective value along non-cyclic $\epsilon$ better response path does not reveal any information about how far the end of the path is. Intuitively this can be turned to an intractability proof by hiding an exponentially long better-response path within some ambient space such that it is not easy to find local NE (plus many other details).

The philosophical corollary in the speakers opinion is that we cannot base multi-agent DL on a single-agent paradigm, and instead, we need:

A lot more work on modeling the setting.
Choosing the learning model.
Deciding what are meaningful solution concepts.
Designing the learning/optimization algorithm.

The way forward in the speakers opinion involves:

Going beyond worst-case analysis of equilibrium computation algorithms.
Looking into games with special structure.
Non-convex optimization based (as opposed to min-max/fixed point-based) solution concepts.