Ben Eysenbach Thesis Defense - 2023

Details

Title : Ben Eysenbach Thesis Defense Author(s): Ben Eysenbach Link(s) : https://www.youtube.com/watch?v=ODVaWSKCzTM

Rough Notes

This thesis looks at Reinforcement learning (RL) from a probabilistic perspective. RL is hard because of limited feedback (rewards):

Feedback comes after many steps in the future.
Learning from high dimensional data is hard with limited feedback.
Rewards are hard to specify and measure.

Self-Supervised Learning (SSL) approaches learn from limited feedback. Within SSL methods, Contrastive Learning methods learn representations so that similar inputs (e.g. faces) have similar representations.

Thesis in 5 words: A foundation for self-supervised RL.

Outline of talk: Skills: Using data to define desired outcomes. Planning: Inferring how to reach desired outcomes. Robustness and generalization.

Goal-Conditional Reinforcement Learning involves giving a goal state rather than a hand-crafted reward function. If we are given a dataset for videos and actions (in a robotic task to move an item to a specific area), we can learn a representation that encodes information about the environment, and use it to extract skills, \(\pi(a|s,g)\).

Representation learning here uses a state-action encoder \(\phi(s,a)\), and a goal encoder (which takes a future frame from the same video) \(\psi(s)\) which both map to a representation space. The goal encoder is sometimes given frames from a different video. Contrastive learning can then be used to make the representations from frames within the same video to be close, and between different videos to be different. This work encodes information about actions as well, which differentiates itself from prior similar work.
The representations can then be used for action selection by choosing \(a_i\) such that \(\phi(s,a_i)\) is closest to the goal state representation \(\psi(g)\).

Overall, this means that goal-conditioned RL can be written as a self-supervised RL problem. Key takeaways:

Users specify desired outcomes via data, not reward functions.
Data-driven policies can be learnt from interaction or available data.
There are strong theoretical connections with reward maximization.
The representation learning part is aligned with the downstream task.

However, goal-conditioned RL fails to solve long-horizon tasks. This leads to planning.

Planning methods require a symbolic representation of the world. Key idea here is to use the skills to build a graph (which is what planning methods need). Following from the video examples, the nodes of this graph are sampled images, and the edges are the learnt distance function from \(\phi(s,a)^T\psi(g)\). This method can be interpreted as probabilistic inference (see ICLR 2022 paper). This approach is exciting as it provides a recipe to combine planning and learning.

Future work:

Learning skills for navigating language. E.g. sequence of language instructions required to solve a task.
Finding hierarchical structure, i.e. learn skills that interact with other skills (See ICLR2023 paper, also "Some history of the hierarchical Bayesian methodology 1980" by Good).
Viewing RL as a generative model.