Reinforcement Learning: An Introduction - 2018

Details

Title : Reinforcement Learning: An Introduction Author(s): Sutton, Richard S and Barto, Andrew G Link(s) : http://incompleteideas.net/book/the-book-2nd.html

Rough Notes

Chapter 1: Introduction

Reinforcement learning (RL) is a field that studies the problem of decision making under uncertainty, where the decision maker (agent) interacts with some environment to maximize some reward.

Besides the agent, the main elements are:

The agent's policy.
A reward signal, representing short term desirability of certain states and actions.
A value function, representing long term desirability of states.
Optionally, a model of the environment, used for planning i.e. any method that makes decisions by considering possible future scenarios before they are experienced.

Efficient estimation of the value function is the most important component is almost all RL algorithms in this book.

Exploration-exploitation dillema: In P3 Paragraph 1, they say the exploration-exploitation dilemma is unsolved - is there some proof somewhere saying this is impossible (or in NP etc.) in the general case? Also what about claims from the Bayesian RL community that they automatically balance exploration and exploitation?

Of the examples in Section 1.2 - the short description of planning as "anticipating possible replies and counterreplies" is intuitive.

The reward signal defines the goal of the reinforcement learning problem (#DOUBT Is this by definition?). Rewards are immediate upon going to state, which is why the long-term value of the state is more important - states represent the information available to the agent about the environment.

This book focuses on RL methods that learn while interacting with the environment, unlike genetic algorithms, simulated annealing etc. which use static policies i.e. they do not consider policies as functions from states to actions. P11 Paragraph 1 details the difference.

P17 Paragraph 2 mentions what Minsky called the "basic credit assignment problem for complex reinforcement learning systems" - how to distribute credit for the many decisions that were involved to get it - all of the methods in this book are about solving this problem.

Chapter 2:

Typos

(P2) In the paragraph about supervised learning, line 6, "object" to "objective".