Reinforcement Learning (Aalto University ELEC-E8125 2022)

Content from 2022 offering, except for slides on Optimal Control which was from 2021.

Content

Markov Decision Processes

Reading: (, , Chapter 2-2.3, 2.5-2.6, 3-3.8).

An agent interacting in an environment to achieve some goals would to plan their actions, and this plan should take into account surprises/unknowns. We can formalize some rule to take actions as a mapping from observations $o$ of the current environment to a distribution over possible actions $a$ called a policy $\pi(a|o)$.

To evaluate how well the plan is, we could define success as reaching a particular state, taking into account how the environment changes based on the plan being executed. All plans that reach a goal are not equal.

#ASK Does success in paragraph above mean reaching the terminal state?

To evaluate the quality of a plan, the goal(s) can involve immediate and/or cumulative rewards, and the rewards are often defined for one's own problem.

Once we have a policy $\pi$ and a reward function $R$, we can evaluate the quality of $\pi$ by starting from a random state, run $\pi$, get the return of the resulting trajectory and get the average of the returns of these trajectories.

Planning (i.e. sequential decision making) can be viewed as optimizing a policy with respect to the expected return.

Some things that make RL hard:

We need to define states which are based on the environment.
State dynamics (i.e. how actions affect the state, or effects of actions) need to be learned and are often stochastic.
Rewards are often very delayed (credit assignment problem).
Rewards may need to be learnt.
Rewards may be difficult to choose/formulate.

There is a trade-off between learning (exploration) and maximizing rewards (exploitation).

The RL problem is as follows: Determine $\pi^*(a|s)$ such that $\pi^* = \text{argmax}_{\pi}\mathbb{E}[\sum_t r_t]$

Assuming we know the state dynamics, the RL problem can be formalized as a Markov Decision Process (MDP), consisting of an observable variable representing the environment $Z=S$, a Markov transition function $\mathbb{P}(S_{t+1}|S_t,A_t)$ defining how $S$ evolves, a reward function $R_t(S_t,A_t)$, and assuming a finite horizon $T$, the solution is $a^*_{1,\cdots,T}=\text{argmax}_{a_1,\cdots,a_T}\sum_{t=1}^T r_t$. When we do not know the transition function, we have the RL problem, and is the observable variable is given by an observation model $P(Z_t|S_t,A_t)$ we have a Partially Observable Markov Decision Process (POMDP).

To solve a simple shortest path problem via an MDP, where the cost of a trajectory $\tau_k=(a_1,\cdots,a_k)$ is $L(\tau_k)=\sum_{k=1}^K l(s_k,a_k) + l_F(s_{K+1})$ where $l_F(s)$ is 0 is $s\in S_G$ ($S_G$ is a goal set) and infinity otherwise. Here, the goal is to fine $\min_\tau L(\tau)$. To solve this, we can make use of the Bellman's Principle of Optimality, which characterizes an optimal policy. Specifically, "An optimal policy has the property that whatever the initial state and initial decision are, the remaining decision must constitute an optimal policy with regard to the state resulting from the first decision."

Here, assume we have the optimal policy, then the corresponding cost after starting from some state $s_k$, denoted $G_k^*(s_k)$ has the property such that $G^*_k(s_k)=\min_{a_k,\cdots,a_K}\sum_{i=k}^K l(s_i,a_i)+l_F(s_F)$ - but we can write this as $G^*_k(s_k)=\min_{a_k,\cdots,a_K} l(s_k,a_k)+G_{k+1}^*(s_{k+1})$, and since $s_{k+1}=f(s_k,a_k)$ where $f$ is the state transition function, we can rewrite this only as a function of $a_k$ as $G_k^*(s_k)=\min_{a_k}l(s_k,a_k)+G_{k+1}^*(f(s_k,a_k))$. Running this iteration until the cost function is stationary results in the Bellman equation, with the cost function associated with the optimal plan, $G^*$, will be in both sides of the equation. Knowing this would allow us to get the optimal actions trivially.

Value-based methods in discrete domains

Reading: (, , Chapter 5-5.4, 6-6.5).

Increasing complexity for the value function:

Markov Process (no actions, no rewards).
Markov Reward Process (no actions, with rewards). Each state has some scalar value denoting the reward for coming to that state. Accumulated rewards are defined to be $G_t =\sum_{k=0}^H \gamma^k r_{t+k+1}$. The corresponding Bellman equation can be defined by simply expanding the definition of $V(s)=\mathbb{E}[G_t|S_t=s]=\mathbb{E}[R_t+\gamma V(S_{t+1})|S_t=s]$ (where I assume the expectation is over the state transition distribution).
Markov Decision Process (MDP) with deterministic policy $\pi(s)$.
MDP with probabilistic policy.

The action-value function is defined to be $Q(s,a)=\mathbb{E}_{\pi}[G_t|S_t=s,A_t=a]$.

Optimal state-value function is $V^*(s)=\max_\pi V_\pi(s)$ and optimal action-value function is $Q^*(s,a)=\max_\pi Q_\pi(s,a)$.

From these optimal value functions, we can extract the optimal policy, specifically, by taking an argmax over the action space.

Value iteration uses the state-value function to get to the optimal state-value function by first starting from some initial value, and iteratively updating the state-value function to be $V^*_{i+1}(s)=\max_a R(s,a)+\gamma \sum_{s'}P(S_{t+1}=s'|S_t=s,A_t=a)V_i^*(s')$ until convergence. As $i\to \infty$ the $V^*_i\to V^*$.

Suppose now we are given some policy $\pi$ and we want to evaluate its value (called the policy evaluation problem), we can iteratively get from $V_0 \to \cdots V_\pi$ by updating it as $V_{i+1}(s)=\sum_a\pi(a|s)(R(s,a)+\gamma \sum_{s'}\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)V_i(s')$)

If we are given some policy $\pi$ it can be improved by evaluating $V_\pi$ and improve by taking a greedy action with respect to $V_\pi$. Doing this iteratively is called policy iteration.

Monte-Carlo and Temporal Difference

Recall that RL involves solving an MDP with an unknown transition dynamics and reward functions.

In the RL setting, we can perform policy evaluation by simply running the policy, gather the returns and update $V(s)=\frac{S(s)}{N(s)}$ where $S(s)$ is the sum of cumulative rewards at state $s$ and $N(s)$ is the number of times state $s$ has been visited - note this requires a finite horizon in order to get the cumulative reward value itself.

Beyond the episodic setting, we can think of rewriting the MC average above as a running average. Recall the running average of measurements $G_t$ can be computed as $V(s_t)=(1-\alpha)V(s_t)+\alpha G_t=V(s_t)+\alpha(G_t-V(s_t))$ where as we collect more of $G_t$, $V(s_t)$ converges to its average. If we assume that after taking an action at time $t$ while in state $s_t$, that the resulting state and reward $s_{t+1},r_{t+1}$ are such that $r_{t+1}+\gammaV(s_{t+1})$ is a good approximation to $G_t$ (an approximation since there is uncertainty in the environment) then we could update the value function via $V(s_t)=V(s_t)+\alpha(r_{t+1}+\gamma V(s_{t+1}) -V(s_t))$. This is called Temporal Difference (TD) learning, specifically this is TD(0). Note that we do not need to assume finite episodes.

We can mix the ideas of MC and TD by approximating $G_t$ in TD(0) by $G_t^\lambda = (1-\lambda)\sum_{k=1}^\infty \lambda^{k-1}G_t^{(k)}$ where $G_t^{(k)}$ is the $k$-step return, where we unroll the trajectory for $k$ steps instead of 1. This is called TD($\lambda$). Note however that since $G_t^{\lambda}$ has $k$ is running to infinity, this approach requires episodes.

To extend to non-episodic tasks, we can make use of eligibity traces. An eligibity trace for a state $E_t(s)=\gamma \lambda E_{t-1}(s)+\mathbb{I}(S_t=s)$ which measures how much a particular state has been visited recently on average. Eligibility traces can be used to update the state-value function as $V(s)=V(s)+\alpha E_t(s)(R_{t+1}+\gamma V(s_{t+1}) - V(s_t))$, which is called Backward-TD($\lambda$). If we did not recently visit $s$, the eligibility trace value $E_t(s)$ is or is close to 0, meaning $V(s)$ is not updated, while on the other hand if $E_t(s)$ is >0, the value function is updated for such states according to the step taken at time $t$, meaning the state values of recently visited states are also updated instead of the state value of only the current state.

Given we can estimate the state-value functions using TD or MC approaches, we would want to get a policy. In the MC case, we can perform greedy policy improvement $a=\text{argmax}_a Q(s,a)$ by first computing the action-value function from the state-value function - this is called Monte-Carlo Policy Iteration. To ensure exploration, the policy can follow an ε-greedy approach.

Applying TD to the action-value function is called SARSA, and converges under GLIE with epsilon-greedy exploration. Eligibility traces in SARSA is a function that measures the recency of both the states and actions, and directly replaces the state-value function with the action-value function.

Up until now we have considered on-policy methods, which use the same policy as the one being learnt, i.e. the agent "learns on the job". We could use another policy (called the behaviour policy) to act while learning about the optimal policy (also called the operation policy/target policy), this is called off-policy learning, for e.g. we can learn from observing other agents, or we can learn about the optimal policy when using an exploratory policy.

An example of an off-policy method is Q-learning - this differs from SARSA where the last A in SARSA, found in the $Q(S_{t+1},A_{t+1})$ term is replaced by $\max_a Q(S_{t+1},a)$ which does not care about the policy which would suggest $A_{t+1}$.

Function approximation

Reading: (, , Chapter 9-9.3, 10-10.1).

Approaches based on value function approximations do not scale to problems with large state spaces.

Policy gradient

Reading: (, , Chapter 13-13.3).

Actor-critic

Reading: (, , Chapter 13.5, 13.7).

Optimal control

Model-based Reinforcement Learning

Reading: (, , Chapter 8-8.2).

Interleaved Learning and Planning

Exploration and Exploitation

Partially Observable Markov Decision Processes

Reading: (, a) From "Brief Introduction to MDPs" to "General Form of a POMDP solution".