Jakob Foerster - Zero-Shot (Human-AI) Coordination (in Hanabi) and Ridge Rider - 2021

Details

Title : Jakob Foerster - Zero-Shot (Human-AI) Coordination (in Hanabi) and Ridge Rider Author(s): DeepMind ELLIS UCL CSML Seminar Series Link(s) : https://www.youtube.com/watch?v=Sy2Z7alDgAE

Rough Notes

Jakob's research interests related to this work include Theory of Mind (ToM).

The human-AI collaboration problem here is modelled using the Decentralized Partially Observable Markov Decision Process (Dec-POMDP) framework - all agents share the same reward.

Hanabi is a cooperative card game which involves ToM reasoning. Cooperative policies perform well only within the same team of AI agents involved in training.

Zero-Shot Coordination (ZSC): Train 2 teams separately, then test with cross-play (test time team is a mix of the 2 initial teams), and a training strategy is agreed on beforehand. See (, ).

Key idea: Symmetries - mappings that leave the Dec-POMDP unchanged. Imagine the Dec-POMDP as a tree, then mappings that map states, observations and actions to another space such that the tree is still the same. Symmetries can also be applied to policies. These symmetries are helpful to learn coordination zero-shot policies, going from self-play to other-play.

Self-play: Maximize \(\pi\) over \(J(\pi_1, \pi_2)\). Other-play: Maximize \(\pi\) over \(\mathbf{ E }_{p(\phi)}[J(\pi_1,\pi_2)]\) where \(p(\phi)\) is a uniform distribution over the symmetry group.

In Hanabi, each symmetry is a permutation of the colors.

Using the insight that symmetries lead to repeated eigenvalues, they also introduce a method that uses the Ridge Rider optimization algorithm for ZSC.