Curiosity in Multi-Agent Reinforcement Learning - 2019
Details
Title : Curiosity in Multi-Agent Reinforcement Learning Author(s): Schäfer, Lukas and Albrecht, Stefano Link(s) :
Rough Notes
Chapter 1
Some issues in Multi-Agent Reinforcement Learning (MARL):
- Partial observability.
- Reward sparsity.
See (, ) on exploration. In this work, agents are trained to compute intrinsic rewards (, a). Three definitions of intrinsic rewards are applied to valu-based and policy gradient MARL to see impacts on exploration, on both competitive and cooperative tasks, with modified variations with partial observability and sparse rewards.
Chapter 2
In MARL, the actions of each agent depends on the action of all the other agents, which makes things more challenging than in single-agent RL. Similarly, the credit assignment problem becomes more difficult as we want to identify agents on top of actions with rewards.
The main framework would be the Partially Observable Stochastic Game, where we have a state space \(\mathbf{ S }\), a state transition function \(P\), a discounting factor \(\gamma\), \(I=\{1,\cdots,N\}\) agents, agent action space \(\mathbf{ A }=A_1\times\cdots\times A_N\), agent observation space \(\mathbf{ \Omega }=\Omega_1 \times \cdots \times \Omega_N\), an observation function \(\mathbf{ O }:\mathbf{ S }\times\mathbf{ A }\times\mathbf{ \Omega }\to [0,1]\) and a reward function for each agent \(R_i:\mathbf{ S }\times\mathbf{ A }\times\mathbf{ S }\to\mathbb{ R }\) which are often the same for cooperative tasks.
Amongst value function based methods, Independent Q-Learning involves each agent learning its own Q-function - this operates under the false assumption that the environment is stationary and leads to suboptimal performance in cooperative games. Amongst policy gradient methods, Multi-Agent Deep Deterministic Policy Gradient (MADDPG) (which is an extension of the Deep Deterministic Policy Gradient (DDPG)) works by training a critic and actor network for each agent. The critic (Q-functions) are centralized i.e. they take in observations and actions from all agents during training, however during deployment each agent's policy only makes use of its own critic.
Intrinsic motivation is related to activities motivated by their inherent satisfaction rather than their consequences - this leads to exploratory behaviour. [TODO From Page 17]
Previous notes
Three definitions of intrinsic rewards are applied to value-based and policy gradient MARL methods. Curiosity introduces slight disturbance during training on the original tasks, and the same also occurs under partial observability where joint and decentralized curiosity do no assist exploration. Curiosity leads to good improvements in training stability and performance for policy gradient MARL when training with sparse rewards.
MARL as the extension of RL, objective is still to learn a policy that maximizes future cumulative rewards.
Some challenges in MARL include:
- Exploration-Exploitation trade-off - agents need to explore environment ro guarantee eventually finding the optimal policy, but this sacrifices short-term rewards.
- Credit assignment problem - identifying responsible agents and actions for received rewards.
Intrinsic motivation is concerned with activities motivated by their inherent satisfaction rather than their consequences, and such motivation leads to exploratory behaviour.
Methods to replicate curiosity for efficient exploration in RL often are a form of reward shaping.