No-Regret Learning in Extensive Form Games

Details

Title : No-Regret Learning in Extensive Form Games Author(s): Amy Greenwald

Rough Notes

AI is good in games of complete information e.g. Go, and more recently in games of incomplete information e.g. poker, mainly due to an algorithm called Counterfactual Regret Minimization (CFR), which minimizes regret in Extensive-Form Games (EFGs). This talk will be on extending these results from zero-sum games like poker to non zero-sum games. Extensive-Form Games (which are in tree form) when converted to their Normal-Form Game (NFG) representation (via Payoff matrices) can lead to an exponential blowup.

The learning model consists of a learner \(i\) whose current policy is \(\pi_i^t\). This policy alongside the policy of other players, \(\pi_{-i}^t\) is fed to the game, which outputs utilities \(u_k(.,\pi_{-k}^t)\) for every player \(k\). From the perspective of the learner, what the other players are doing and whatever is in the game is considered to the environment.

Minimizing regret, or being rational in hindsight, is to have \(\frac{1}{T}\sum_{i=1}^T u_i(\pi_i^t,\pi_{-i}^t)\geq \max_{\phi \in \Phi}\frac{1}{T}\sum_{i=1}^T u_i(\phi(\pi_i^t),\pi_{-i}^t) -o(1)\) where \(\phi\) is any deviation from the learners policy which accounts to doing something other than its policy at time \(t\), \(\Phi\) is a class of deviations, and \(o(1)\) is some leeway.

No-external-regret learning means looking at the best fixed action in hindsight, i.e. looking at the best path through the whole tree. No-internal-regret learning means when playing an action, the player does not wish they playing another action, i.e. instead of switching to all fixed actions, the player conditions on what they were doing, and at that time the player does not wish they were doing something else.

In NFGs, no-external-regret learning algorithms and no-internal-regret learning algorithms exist, which converge to minimax equilibrium in zero-sum NFGs and correlated equilibrium in non zero-sum NFGs respectively.

Considering kinds of deviations other than no-internal and no-external regret, resulted in defining no- \(\Phi\) regret, which converge to the set of \(\Phi\) - equilibria \(\forall \Phi\). No-internal-regret learning is the strongest form of no- \(\Phi\) regret learning.

In (, ) the authors present a no-external-regret minimizing algorithm for zero-sum EFGs, called Counterfactual Regret Minimization.

Some other deviations in EFGs include behavourial deviations which defne Behavourial Correlated Equilibria (BCE) and causal deviations, which are behavourial deviations assuming reduced strategies, and define (causal) Extensive Form Correlated Equilibria (EFCE). These are defined mainly since we cannot efficiently minimize no-internal-regret. In (, a) the authors develop a learning algorithm that minimizes causal regret and hence converges to the set of (causal) EFCE (which got the NeurIPS 2020 best paper award).

Recent developments in no-regret learning for non-zero-sum EFGs include (, a), from which we know that in this setting:

  • No-internal-regret learning coverges to correlated equilibrium.
  • No-external-regret learning converges to coarse correlated equilibrium.
  • No-internal and no-external regret can be defined along one continuum, no- \(\Phi\) -regret.
  • For certain choices of \(\Phi\), no- \(\Phi\) -regret learning algorithms exist, called Extensive-Form Regret minimization (EFR).
  • Choosing \(\Phi\) to be the set of counterfactual deviations and causal deviations, EFR generalizes CFR and ICFR (Internal CFR) respectively.

A correlated equilibrium (CE) is a joint probability distribution \(D\) over the set of action profiles \(A\) such that for \(\forall i \in \text{Players}, a_i,a_i' \in A_i\), we have \[\mathbb{E}_{(a_i,a_{-i})\sim D}|_{a_i}[u_i(a_i,a_{-i})]\geq \mathbb{E}_{(a_i,a_{-i})\sim D}|_{a_i}[u_i(a'_i,a_{-i})]\] (#DOUBT That is, when each player gets a recommended action, conditioned on seeing that action, the player can do a Bayesian update and infer what the opponent(s) would do and see that it is better to play with the recommended action rather than any other action).

A course correlated equilibrium (CCE) is a joint probability distribution \(D\) over the set of action profiles \(A\) such that for \(\forall i \in \text{Players}, a_i,a_i' \in A_i\), we have \[\mathbb{E}_{(a_i,a_{-i})\sim D}[u_i(a_i,a_{-i})]\geq \mathbb{E}_{(a_i,a_{-i})\sim D}[u_i(a'_i,a_{-i})]\]

A deviation is a function \(\phi : A \to A\). \(\Phi_{SWAP}\) is the set of all \(n^n\) action transformations where \(n\) is the number of actions. Deviations can be thought of as stochastic matrices that act on the policy, e.g. assuming \(n\) actions, the external deviation to action \(k\) is represented by the matrix \(A \in \mathbb{R}^{n \times n}\) such that \(A\pi = e_k\) \(\forall \pi \in \Delta^{n-1}\) and the internal deviation from \(k\) to \(k+1\) is represented by the matrix \(A \in \mathbb{R}^{n\times n}\) such that \(A\pi = [\pi_1,\cdots,0,\pi_k+\pi_{k+1},\cdots,\pi_n]^T\) \(\forall \pi \in \Delta^{n-1}\).

One e.g. of a regret minimizing algorithm here is the Regret Matching Algorithm.

One more ingredient for the EFG setting is the concept of time-selection regret minimization. Here, we can think of \(m\) experts each with weights \(\{w_i^t\}\) for each \(t\) represented via \(W\), and the objective is \(\forall w \in W, \frac{1}{T}\sum_{i=1}^T w^t u_i(\pi_i^t,\pi_{-i}^t)\geq \max_{\phi \in \Phi}\frac{1}{T}\sum_{i=1}^T w^t u_i(\phi(\pi_i^t),\pi_{-i}^t) -o(1)\).

One formalism for reasoning about one player for EFGs in the Partially Observable History Process (POHP) (, a).

Open problems:

  • Can EFR help us solve Stratego, Hanabi, Diplomacy?
  • How can we achieve performance at scale?
  • Open question: What is the largest class of EFG derivations for which we can efficiently learn correlated equilibria?

Emacs 29.4 (Org mode 9.6.15)