Reinforcement learning (RL)

A Markov Decision Process (MDP) with unknown dynamics i.e. unknown state transition functions and reward functions, is a Reinforcement learning problem. Learning through trial and error, and the concept of delayed rewards are important features of RL problems.

There are 2 main problems in RL:

The prediction problem, i.e. estimating the value function for a given policy.
The control problem, i.e. finding an optimal policy.

Below are some general methods to approach RL problems:

Model based methods: Estimate the MDP dynamics, then apply MDP methods.
Model-free methods: Methods based on learning from trial-and-error.

Other taxonomies include:

On-policy vs. off-policy.
- On-policy learning: "I can only learn from my own actions". Uses own policy while learning how to optimize it, can be thought of as learning on the job. The agent uses the best policy they have every time they take an action.
- Off-policy methods: "I can learn from anyone trying to achieve any goal". Agents have 2 policies:
  - Learning policy (e.g. some other agent's policy or an \(\epsilon\) -greedy policy).
  - Operation policy (the target policy you learn to be optimal).
  Here, one can learn the optimal policy without necessarily knowing it i.e. via using suboptimal policies intermediately.
Episodic vs. non-episodic.
- Episodic tasks have a finite horizon \(T\) which is often a random variable.
- Non-episodic/continuing tasks, where the tasks which continue without limit.
Value-based, policy-based, actor-critic.
- Value-based methods, which involve learning the state/action-value functions which give an implicit policy.
- Policy-based methods which do not rely on an intermediate value function and learn an explicit policy.
- Actor-critic methods which learn and use both an explicit value function and a policy.

Reinforcement learning (RL)

Resources