John Tsitsiklis (MIT): "The Shades of Reinforcement Learning" - 2019
Details
Title : John Tsitsiklis (MIT): "The Shades of Reinforcement Learning" Author(s): MIT Institute for Data, Systems, and Society Link(s) : https://www.youtube.com/watch?v=OmpzeWym7HQ
Rough Notes
Reinforcement learning (RL) is a methodology for dealing with stochastic control problems, where there is some state \(s_t\) which goes to a controller \(\pi\) which chooses a control \(a_t\sim \pi(\cdot|s_t)\) which goes into the dynamics which in turn updates the new state (with the possibility for noise) and the end goal is make \(\pi\) minimize the expected cost over some time horizon.
The classical approach to this started with Dynamic Programming (DP), where when you are thinking of an action \(a\), you think about the current cost and the next state and the cost starting from that next state. Bellman uses this intuition to characterize the optimal value function in a recursive manner. However, people soon realize that solving these problems in practice is difficult due to problems such as the curse of dimensionality.
During the mid 80s to 90s, people in AI wanted to make agents that learn from interactions - the speaker and his colleague realized after reading the classic Sutton & Barto textbook that RL used in the AI field was approximate DP, and wrote a book called Neurodynamic Programming (DP with function approximation). There was also the field of adaptive control, focusing on online learning with an unknown system - however it is hard even for linear control. The field of robust control came out from adaptive control.
These days, RL is one of the hottest fields, with things like AlphaGo and AlphaZero which do incredible things after burning a few forests. Some practical uses include Google using RL to control data center cooling. In these examples, there is only some computational simulator (e.g. Go game engine) where cost and rewards are in Monopoly money.
The actual difference between online and offline RL is whether or not the costs are in the real world, since for e.g. simulators could play a role in online RL.
Model-free RL means different things to different people, the speaker believes that being model-free is a terrible situation to be in rather than a virtue, and we should try to avoid and get around it. The question here is:
- Given a model/simulator - should one use or ignore it?
- Given physical/historical data - should one learn/refine a model or not?
Alexander Madry et al 2018, "we may need to move beyond the current benchmark-centric evaluation methodology". Take well-constructed problems (e.g. LQG, inventory control) - compare RL and exact methods on small/solvable instances then go bigger.