Inverse Game Theory for Stackelberg Games: The Blessing of Bounded Rationality - 2022
Details
Title : Inverse Game Theory for Stackelberg Games: The Blessing of Bounded Rationality Author(s): Wu, Jibang and Shen, Weiran and Fang, Fei and Xu, Haifeng Link(s) : https://www.youtube.com/watch?v=yKWBpUHA2eo
Rough Notes
Making notes to learn more about Stackelberg Games. The speaker has apparently helped increased Baidu's revenue by 5%.
The work here is about learning the pay-off matrix from bounded-rational players whose equilibrium policies are observed.
Game theory: Given the game (e.g. payoff matrix), predict players' behaviours/responses. However, this behaviour is only a proxy for what we are often interested in, like customer preferences, or the utility of some hacker.
Inverse game theory: Given equilibrium behaviours, what game parameters can induce such behaviours (with respect to some solution concept)?
Inverse Stackelberg game:
- Two agents: A leader (who commits to a strategy according to their payoff) and a follower (who responds to the leader taking into account their own payoff). E.g. The leader can be a defender defending some security network, the follower could be hacker(s).
In this work, the leader is assumed to be able to chose any mixed strategy, and the follower uses a quantal response i.e. the probability of action \(j\) is \(y_j = e^{\lambda x^T V_j} / \sum_{k}^{}e^{\lambda x^T V_k}\) - this is supposed to capture the follower's bounded rationality \((\lambda >0)\).
The research question is: Can the leader recover \(V\) by querying the follower's response to an action \(x\)?
In the quantal response model, computing the optimal leader strategy is difficult but recovering the follower payoff is easy. In the best response model (the follower's strategy is deterministic), computing the optimal leader strategy is easy but recovering the follower payoff is difficult.
In the quantal response model - the strategy is invariant to row-wise translations of \(V\) with a constant, which introduce an identifiability issue. The logit distance is used to recover \(V\) uniquely.
\(m\) linearly independent queries is enough to recover \(V\) where each query \(x\) returns a mixed strategy \(y\) - this is done by minimizing the cross entropy between the estimated mixed strategy \(\tilde{y}\) (computed with current estimate \(\tilde{V}\)) with the observed \(y\).
More complicated setting: we observe the action sample instead of the whole mixed strategy \(y\). One idea to use MLE to optimize \(\tilde{V}\) leading to highest probability of observing the action - this however is difficult to optimize and bound the error. Rather, a better idea is to query a certain point \(x\) multiple times to estimate the mixed strategy then use the cross entropy optimization approach mentioned above. They give a PAC-style bound of how close the learnt mixed strategy is compared to the true one, and another bound on how close we can recover \(V\) given the approximate follower strategy.