# Model-Based Multi-agent Reinforcement Learning for AI Assistants - 2023

## Details

Title : Model-Based Multi-agent Reinforcement Learning for AI Assistants Author(s): Çelikok, Mustafa Mert Link(s) : https://aaltodoc.aalto.fi/handle/123456789/120725

## Rough Notes

Thesis focuses on Model-Based Multi-Agent Reinforcement Learning (MARL) in settings where humans and AI need to collaborate, and the end-goal is augment human intelligence rather than replace it. To do this, the AI must infer a model of its human partner (e.g. their goals). Data scarcity is a key problem in human-AI collaboration, and hence the model space is important. Ideas from cognitive science and behavourial economics are used to address this issue.

Addresses types of tasks where:

- AI should learn human preferences from human feedback.
- AI must teach conceptual knowledge to the human to assist them. (#ASK What does conceptual knowledge mean here).
- AI should infer cognitive bounds and biases of humans to improve their [I assume the human's] decisions.

Thesis structure is as follows: Motivating questions (MQs) (which are open problems) are presented, and solutions are looked into by formulating and investigating specific research questions (RQs).

### MQs and RQs

#### MQ1 : Can we develop practically plausible human-AI interaction protocols which enable better collaboration between them, by allowing the AI to infer a model of the human and augment human intelligence.

- RQ1.1 Can a teacher-learner interaction protocol improve human-AI collaboration - and how can this be implemented computationally?

Relevant work for this considers 2 scenarios - where the human agent is treated as the teacher, and as the learner. The overall framework is machine teaching.

- RQ1.2 Can an AI assistant learn to augment a human's intelligence, when the human has the authority to override its actions?

Relevant work for this proposes a supervisor-agent protocol where the human has the authority to override an action of the AI. This protocol is a superset of the machine teaching framework mentioned in RQ1.1.

#### MQ2 : What does the AI need to assume of the human so that it can infer a model of them?

- RQ2.1 In the teacher-learner interaction protocol, what are plausible assumptions about the human that are needed.

Relevant work models the human partner as a mixture of 2 behavioural models : a fixed distributions based on human preferences and a decision-making agent who plans into the future to actively steer the AI.

- RQ2.2 In the supervisor-agent interaction protocol, what are plausible assumptions about the human that are needed.

Relevant work here models the human as a decision-maker who follows a policy conditioned on their internal state which is maintained using their subjective model of the world.

#### MQ3 : What are the theoretical limitations of the Bayesian MARL for human-AI collaboration?

- RQ3.1 How does the underlying mathematical structure of the task affect the convergence rate of the beliefs of the 2 agents when they model the task differently, and how does this difference in beliefs affect collaboration?

Relevant work here introduces a problem called the belief alignment problem, where the human and the AI may have different beliefs about the state of the world, which may never come close if the 2 agents mode the world differently. This can lead to failure to collaborate.

3 important dimensions are identified for building models of humans:

- Modelling goals of the humans, which may be tacit.
- Modelling humans as bounded-rational.
- Modelling the case where humans may model the AI system and engage in recursive reasoning.

### Model-based approaches for Human-AI collaboration.

Given a Markov Decision Processes (MDPs), an optimal criterion OPT gives a preference ordering over policies. Combining an MDP with an optimality criterion results in a **Markov decision problem**.

Discounting has uses other than giving convergence in infinite-horizon problems.

Model-based RL aims to model the state transition dynamics \(p(\cdot|s,a)\) and reward function \(r(s,a)\) of the environment. Some methods blur the line - Deep Q-learning with experience replay can be thought of using stored state-transition data to learn an empirical model of the transition dynamics and reward function.

Bayesian model-based RL starts by placing a prior on \(p(\cdot|s,a)\) and sometimes over \(r(s,a)\) - a Bayes-adaptive MDP (BAMDP) is a model which once solved, gives us an optimal policy that balances exploration and exploitation. The BAMDP agent has a distribution over the transition dynamics \(P(p|h)\) - this results in an augmented transition dynamics \(p^+\) which uses \(P(p|h)\) to model the one-step transition probabilities (which were modelled by \(p(\cdot|s,a)\) in classical MDPs).

A Markov decision problem represents a decision-making tasks which when solved, gives a policy that achieves the optimality criterion, meanwhile BAMDPs represent a learning and decision-making task. (#ASK Model-based solutions to MDPs also involve simultaneous learning and decision-making right?).

The BAMDP agent only knows \(p^+\) but not the true transition dynamics \(p\). BAMDP states include posteriors over the dynamics, and computing Q-functions involves averaging over all possible future trajectories and the corresponding posterior updates - the resulting actions balance the learning and reward maximization objectives. Such policies are called **Bayes-optimal**.

Bayes-adaptive Monte Carlo Planning (BAMCP) is a sampled-based anytime planning algorithm for solving BAMDPs based on Monte-Carlo Tree Search (MCTS). It uses a sampling model (i.e. a simulator). It involves 2 procedures : search and simulate. In each step, a transition dynamics model is sampled \(\tilde{p}\sim P(p|h)\) and used to simulate transitions. Tree nodes are maintained for augmented states (which represent the current environment state and agents epistemic uncertainty over the environment dynamics). The sampling of \(\tilde{p}\) (called root sampling) is the only difference compared to normal MCTS. After this search procedure is terminated, each root node contains the estimation Q-function values (where the states are the augmented states). This is used to choose an action which is performed in the real environment and a real transition is observed, and the augmented state is updated accordingly.

In POMDPs, where only noise observations of the state from \(O(o|s,a)\) are observed, we maintain a state belief \(b(\cdot|h_t)\) which is a distribution over the state-space. Bayes-adaptive POMDPs (BAPOMDPs) hence do not observe states, meaning there are multiple models of the transition dynamics consistent with its history of action-observation pairs. The augmented state for BAPOMDPs now include a distribution over the transition dynamics and one over the observation probabilities. Note that we are not guaranteed to converge to the true transition probabilities even in the limit of infinite data.

In Bayes-adaptive Partially-observable Monte Carlo Planning (BAPOMCP) - the root sampling procedure samples an augmented sample state which now includes distributions over the transition dynamics and the observation probabilities.

In MARL, we should consider whether:

- Agents are independent and self-interested or not.
- The setting is general-sum (individual rewards are different but not orthogonal), zero-sum (competitive and orthogonal) or fully-cooperative (all agents have the same reward thus they should co-operate).

Partially-observable stochastic games (POSGs) generalize POMDPs to the multi-agent setting - where we now have separate action states, observation dynamics and reward functions for each of the \(N\) agents. The agents all interact with the same environment and thus there is 1 transition dynamics function, and this depends on the joint actions of all the agents. If an agent could prediction the behaviour of other agents, the POSG reduces to a POMDP.

In MARL, is no consensus on what a good optimality criterion is. We may take some solution concepts from game theory like the joint policy achieving a Nash equilibrium (which can be a poor choice). Game theoretic approaches are an objective perspective, meanwhile the subjective perspective looks at the multi-agent system from the perspective of the (protagonist) agent.

Bayesian Best-response models explicitly model the behaviour of agent \(-i\) (agents other than the protagonist agent \(i\)) - specifically by modelling the internal state \(I_{-i}\) and other parts of the model \(f_{-i}\) (called the frame of the agent) which are static.

(#TODO Continue from Contribution section onwards).