Model-Based Methods in Reinforcement Learning

Details

Title : Model-Based Methods in Reinforcement Learning Author(s): Igor Mordatch, Jessica Hamrick Link(s) : https://sites.google.com/view/mbrl-tutorial

Rough Notes

This is from an ICML tutorial on model-based Reinforcement learning (RL).

Outline

Introduction and Motivation
Problem Statement
What is a "model"?
What is model-based control?
Model-based control in the loop.
What else models can be used for?
What's missing from model-based methods?
Conclusion

Introduction and Motivation

Model-based reasoning is important for robotics, safety, human-AI interaction, games, scienctific applications, operations research.

Problem Statement

Sequential decision making loop: A cyclic graph between the environment and the agent. The agent takes an action, and the environment changes state and returns a reward. The goal is for the agent to find a policy that maximizes long-term rewards.

This means that we have data \(\mathcal{D}=\{s_t,a_t,r_{t+1},s_{t+1}\}_{t=0}^T\) representing the interaction between the agent and the environment for any time \(T\).

Model-free approaches learn the policy directly from this data (\(\mathcal{D}\to \pi\)), while model-based approaches first learn a model (#DOUBT model of what?) then use it to learn the policy (\(\mathcal{D}\to f\to \pi\)).

The model is defined to be a representation that explicitly encodes knowledge about the structure of the environment and task. This could include:

Transition/dynamics model: \(s_{t+1} = f_s(s_t,a_t)\)
Reward model: \(r_{t+1}=f_r(s_t,a_t)\)
Inverse transition/dynamics model: \(a_t = f_s^{-1}(s_t,s_{t+1})\)
Distance model \(d_{ij} = f_d(s_i,s_j)\)
Future returns model: \(G_t = Q(s_t,a_t) / G_t = V(s_t)\)

The model can be used to:

Simulate the environment.
Assist the learning algorithm.
Boost the policy.

Model-based methods are more data efficient, adapt to changing rewards and dynamics better and do better exploration meanwhile model-free approaches are often better in the asymptotic reward regime and when it comes to computation during deployment.

What is a "model"?

The states of a model depend on the states/observations/latent states of the system being modeled - this is sometimes known for some physics-based systems. (#DOUBT They mention using system identification but do not explain what it is) if the system is more complex, using an approach such as a neural network also works - GNNs have been used successfully here.

In the real-world we often have access to (higher-dimensional) observations that depend on the state, in this case we would want our model to incorporate the relation between the low-dimensional state and the observation. Some models also aim to predict the next hidden state given the current observation. There are also recurrent-value models that predict the value of the states for each step in the future - which is useful since in many cases we only want the values of states, not the states themselves.

Trade-off tables at 29:22 and 32:37.

What is model-based control?

One use of a model is to generate experience and apply model-free approaches to these samples. One e.g. algorithm here is Dyna, which updates Q-values based on real and model transitions. Models are also used in policy learned to generate a "rollout" involving multiple steps.

One e.g. of end-to-end learning in RL is Policy Gradient (PG), which performs gradient-based optimization on a parametric policy. This can be done via for e.g. sample based approaches like the REINFORCE Algorithm. Model-based approaches can be be incorporated to PG by replacing the real transition and rewards by our models - differentiating through this via Back-Propagation-Through-Time (BPTT) - this is deterministic thus lower variance than REINFORCE however it is prone to local minima and could suffer from exploding/vanishing gradients.

In the taxonomy of model-based approaches, we have background vs. decision-time planning. Background planning learns how to act for any situation, relates more to habit, more similar to fast type of thinking, meanwhile decision-time planning aims to learn the best sequence only for the current situation, relates more to improvisation, more similar to slow type of thinking. Relevant table at 13:38.

The distinction between discrete and continuous actions are also important here - for background planning there is not much of a difference however for decision-time planning there is a difference. For e.g. Monte-Carlo Tree Search (MCTS) in discrete action spaces keeps track of Q-values and the number of times each state has been visited, followed by repeated application of expansion (of nodes according to policy), evaluation (of the long-term value of each node e.g. via Monte-Carlo rollouts), and backups (propagate the Q-values and number of vists back to the parent nodes). In contrast, Trajectory Optimization in continuous action spaces initializes an action sequence from a guess, then repeatedly performs expansion (executing the action sequence to get a state sequence), evaluation (get a reward of the trajectory as a whole) and backpropagation to get the relevant gradients. Within the continuous action space methods within the decision-time planning paradigm, there is a further split between shooting methods and collocation methods. Cross-Entropy Method (CEM) and PI\({}^2\) are commonly used to escape local optima.

Model-based control in the loop

When it comes to gathering data, it is in a way a cyclic problem since a bad policy leads to bad experience which leads to a bad policy leading to a bad model and so on. There are also stability related problems, see (??, ????) for more details about that.

The models however are just that - models.

We don't have experience everywhere.
There are function approximation errors.
Small errors in the model propagate and compound.
The planner may exploit the model errors.
Longer model rollouts are less reliable.

Some solutions here include not commiting a plan but continually replan. In general, planning conservatively helps, this means keeping model rollouts short, have distributions over the model and plan for average or worst case scenarios, and/or plan to stay close to states where the model is certain (implicitly by staying close to past policy or explicitly by penalizing going to unknown regions).

Regarding model uncertainty, there are 2 types of uncertainty:

Epistemic, relating to the lack of knowledge about the world, is a distribution over beliefs, is reducible by gathering more data and changes with learning.
Aleatoric uncertainty/risk: The inherent stochasticity in the world, is a distribution over the outcomes, cannot be reduced with more observations etc, and is static i.e. does not change with learning.

Currently, ensembles are popular for uncertainty quantification in this context.

Can we combine background planning and decision-time planning? There are some approaches to this:

Distillation: Generate decision-time plans for multiple situations and distill into a policy/value. Gather the trajectories and corresponding rewards and treat the distillation process as a supervised learning problem. In case the learnt policies themselves have compounding errors, we can create new decision-time plans from states visited by the policy, and add these trajectories for distillation - this is called the DAGGER Algorithm. There may also be inconsistent plans - to avoid this can use the policy being distilled to the planner generating the trajectories for distillation, and add an extra term to the optimization objective that encourages planning similar to the policy being distilled. (#DOUBT Here are we using the same policy being distilled or is there something like a double Q network).
Terminal value functions: Append the value function with a value for the terminal state to avoid myopic behaviour.
Planning as policy improvement.
Implicit planning: Put a planner inside the policy network and train end-to-end. Many examples of this in around 26:00.

What else can models be used for?

Models of the world can be used for:

Exploration.
Hierarchical reasoning.
Adaptivity and Generalization.
Representation Learning.

(Also reasoning about other agents, dealing with partial observability, langugage understanding, commonsense reasoning and much more).

Having a model of the world gives us a way to "reset" things. Resetting could be done when executing the policy itself, by saving "interesting" states to visit back and explore further. One could also reach a terminal state then explore from there, this is called backward exploration.

We could also use some pre-defined intrinsic reward (curiosity-based exploration), which encourages agents to visit unknown states. Planning to explore: use disagreement between forward model predictions and intrinsic reward. Goal-directed exploration involves learning a density model over states, and sampling goals from this model then training a policy to achieve the imagined goals.

An example of hierarchical reasoning is Task And Motion Planning (TAMP), which involves planning symbolically at the task level and motion planning in the low level jointly, this enables solving very long-horizon problems e.g. OpenAI's Rubik's code solver.

Regarding representation learning, one could treat model learning as an auxillary optimization objective. Plannable representations also help i.e. learning embeddings of states where it is easier to plan.

Often we may wish to adapt to changes in rewards and/or dynamics - having an explicit model means we can adapt the planner and/or the model directly which is faster than adapting the policy (#DOUBT Why).

What's missing from model-based approaches?

Humans are the ultimate model-based reasoners, and we are capable of motor control, language comprehension, pragmatics, theory of mind, decision making, intuitive physics, scientific reasoning, creativity (some references in 20:38 of Part 4).

Some themes emerge:

Compositionality: We understand the world in terms of objects and relations between them, we can break down the world into pieces and manipulate them in our imagination.
Causality: We can reason about things that have never happened and will never happen (counterfactual reasoning).
Incompleteness: We can come up with ideas, understand if they are wrong, improve on the initial incomplete model and repeat.
Adaptivity: Make use of previous knowledge extremely efficiently.
Efficiency: Learn from extremely few samples.
Abstraction: We can abstract states at different levels (tables vs. floor, kitchen vs. bedroom, home vs. office etc), and relation between them, alongside temporal abstraction (at different timescales) as well.

Conclusion

We covered what it means to have a model of the world, where the model fits into the picture, landscape of model-based methods, practical considerations, where else models could be used and where there is room for improvement.