# Topics in Reinforcement Learning (Arizona State University CSE691)

## Introduction

Taught by Dimitri Bertsekas. 2024 version, see website.

## Content

### History and motivation

- Offline and online aspects of both AlphaZero and TD-Gammon are discussed, specifically the importance of truncated rollout, which was first detailed in (, ).
- Course aims to provide a unifying framework for Reinforcement learning (RL) (AI community), Approximate Dynamic Programming (DP) (Optimization and Operations research community), Model Predictive Control (MPC) (control systems community) and discrete optimization (Algorithms community).
- In principle, exact DP can be used in many optimization problems, but suffers from the curse of dimensionality and the need for a mathematical model. Approximate DP/RL overcomes this via (function) approximation and simulation (i.e. using a computer model instead of a mathematical model).
Terminology:

RL DP/Control Reward Cost Agent Controller Action Decision/Control Environment Dynamical system (Discrete time systems, ODEs, PDEs etc) Learning Solving DP-related problem via simulation Self-learning/self-play Solving DP-related problem via simulation-based policy iteration Planning vs. Learning Solving DP-related problem with model-based vs. model-free simulation Transition model \(p(s,a,s')\) Discrete-time system equation \(x_{t+1}=f(x_t, u_k, w_k)\) Also, OR community uses \(p_{ij}(u)\) for transition probability from state \(i\) to \(j\) from action \(u\).

### Finite-Horizon Deterministic Optimal Control Model

Consider a deterministic system \(x_{k+1}=f_k(x_k,u_k)\). Assuming we start and end at states \(x_0,x_N\), taking a sequence of controls \(u_0,\cdots,u_{N-1}\) results in a chain graph with the nodes being states and the edges being the controls. Each transition has a cost, the cost of doing control \(u_k\) from state \(x_k\), denoted \(g_k(x_k,u_k)\) - this cost is incurred at stage \(k\). The total cost is \(g_N(x_N) + \sum_{k=0}^{N-1}g_k(x_k,u_k)\).

If we assume a finite number of states and controls, all possibilities can be represented as a Directed Acyclic Graph (DAG), with an extra node \(t\) representing an artificial terminal node, and all edges going into it are labelled with the terminal costs \(g_N(x_N)\). **Choosing a control that minimizes the total cost is equivalent to choosing a shortest path problem in this DAG.**

### Principle of optimality

Let \(u^*_0,\cdots,u^*_k,\cdots,u^*_{N-1}\) be an optimal control sequence, and assume we have arrived at state \(x^*_k\) after control \(u^*_k\). The **tail subproblem** is the problem of starting at \(x_k^*\) and finding a control that minimizes the total cost-to-go (i.e. total cost starting from \(x^*_k\)). The principle of optimality states that the tail of an optimal sequence (i.e. the control \(u^*_k,\cdots,u^*_{N-1}\)) is optimal for the tail subproblem.

Notice that if we are in state \(x^*_k\) where we can take possible controls \(u_k,u'_k,u''_k\) which result in next states \(x_{k+1},x'_{k+1},x''_{k+1}\), and if we happened to solve the tail subproblem that starts at \(x_{k+1}=f_k(x_k,u_k)\), (i.e. we know \(J^*_{k+1}(x_{k+1}), J^*_{k+1}(x'_{k+1}), J^*_{k+1}(x''_{k+1})\)) we can work backwards to get the cost of \(x^*_k\) by choosing the best of \(u_k,u'_k,u''_k\). The starting point is the cost for all terminal states \(x_N\) which are \(J^*_N(x_N)\). This gives rise to DP.

### Dynamic Programming (DP)

Given the terminal costs \(J^*_N(x_N)=g(x_N)\) for each \(x_N\), and \(k=0,\cdots,N-1\), we have:

\[ J^*_k(x_k) = \min_{u_k\in U_k(x_n)} g_k(x_k,u_k) + J^*_{k+1}(f_k(x_k,u_k)) \]

Which is one form of the Bellman equation. This gives the optimal cost \(J^*(x_0)\) at the final step, from which we can find the optimal control by working forward.

Any finite state discrete optimization problem (even with a sequential structure) can be solved by DP. In this case, we start from an artificial initial \(s\), and stage \(k\) will have state \(x_k = (u_0,\cdots,u_{k-1})\), and the terminal is \(G(u_0,\cdots,u_{N-1})\). This method however suffers from an exponential blowup of the state space quite quickly - it however makes sense for approximate DP/approximation in value space.

Equivalently, one can use **Q-factors**, where

\[ Q^*_k(x_k,u_k)=g_k(x_k,u_k) + J^*_{k+1}(f_k(x_k,u_k)) \],

and the optimal cost function can be recovered by

\[ J^*_k(x_k) = \min_{u_k\in U_k(x_k)} Q^*_k(x_k,u_k) \]

Where the Bellman equation is now:

\[ Q^*_k(x_k,u_k) = g_k(x_k,u_k) + \min_{u_{k+1}\in U_{k+1}(f_k(x_k,u_k))}Q^*_{k+1}(f_k(x_k,u_k),u_{k+1}) \]

### General Discrete Optimization

Minimize \(G(u)\) subject to \(u\in U\).

Each solution \(u\) has \(N\) components, corresponding to \(N\) stages. Define the states \(x_k = (u_0, \cdots, u_{k-1})\), and let \(x_0=s\) be an artificial initial state. Define the terminal cost to be \(G(u)\) and everything else to be 0. This formulation is helpful to think about approximate DP/approximation in value space.

### Next Word Prediction

Given a prompt, the current text window \(x_k\), choose the next work \(u_k\) to get the next text window \(x_{k+1}\) and so on. \(x_k\) and \(x_{k+1}\) are n-word strings differing by the single word \(u_k\) [#DOUBT Why are both states n-words?]. The state dynamics is \(x_{k+1} = f(x_k,u_k)\) which is deterministic, and the cost function \(G(x_N)\) encodes the quality of the final text string. Existing LLM models (GPTs) can be viewed as a heuristic/suboptimal control generation method, called a policy, or a base heuristic. The initial state \(x_0\) is the user supplied prompt.

### Approximate DP/Approximation in value space

The optimal cost function \(J^*_k\) is very hard to compute, so use an approximation \(\tilde{J}_k\) (i.e. **offline training**).

To find the sequence of optimal controls, start with

\[ \tilde{u}_0 \in \text{argmin}_{u_0\in U_0(x_0)}[g_0(x_0,u_0) + \tilde{J}_1(f_0(x_0,u_0))] \]

This gives the next state \(\tilde{x}_1 = f_0(x_0,\tilde{u}_0)\). From here go forward for \(k=1,\cdots,N-1\) (**online play**):

\[ \tilde{u}_k \in \text{argmin}_{u_k\in U_k(x_k)}[g_k(\tilde{x}_k,u_k)+\tilde{J}_{k+1}(f_k(\tilde{x}_k,u_k))] \]

An important question now is how to compute \(\tilde{J}_{k+1}(x_{k+1})\). This can be done via:

- Offline problem approximation, where \(\tilde{J}\) is the optimal cost function of a simpler problem, computed offline using exact DP.
- Parametric approximation, e.g. a neural network fitted on training data.
- Rollout with a heuristic.

### Rollout with a heuristic

An approach to find an approximate cost function \(\tilde{J}_k\). When we reach \(x_k\), we compute the next states \(x_{k+1}=f_k(x_k,u_k)\) for all possible \(u_k\). From each of these next possible states \(x_{k+1}\), run a heuristic to get a heuristic cost \(H_{k+1}(x_{k+1})\), then take the control \(\tilde{u}_k\) that minimizes the heuristic Q-factor \(g_k(x_k,u_k)+H_{k+1}(x_{k+1})\).

For example, in the travelling salesman problem, we can use the nearest neighbour heuristic (in general many heuristics are greedy heuristics).

[#DOUBT Is there a relation between this and A*-search?]

Rollout can also be seen as applying one-step of Newton's method starting from the cost function of the base heuristic.

### Stochastic DP

Now consider the case where state transition dynamics are not determinstic, i.e. we have \(x_{k+1}=f_k(x_k,u_k,w_k)\) where \(w_k\) is some random disturbance. Now the cost function becomes an expectation.

Let \(\pi = \{\mu_0,\cdots,\mu_{N-1}\}\) where \(\mu_k\) (called closed-loop controls/feedback policy) are now functions that map a state \(x_k\) to a control \(u_k\). This makes the problem harder - instead of find a control \(u_k\) we are now searching for a function \(\mu_k\). The problem is now to minimize over all possible \(\pi\) the cost function:

\[ J_\pi(x_0) = \mathbb{ E }[g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k,\mu_k(x_k),w_k)] \]

The optimal cost function is now \(J^*(x_0)=\min_\pi J_\pi(x_0)\).

The stochastic DP algorithm computes \(J^*_k(x_k)\) using the cost-to-go similar to deterministic DP, with the addition of taking an expectation wrt \(w_k\) at timestep \(k\). The optimal policy can be implemented in 2 ways:

- Get \(\mu^*_k\) simultaneously with \(J^*_k\) by minimizing \(u^*_k=\mu^*_k(x_k)\). Requires space for storing the policy. [#DOUBT]
- Sequentially move forward (i.e. online) and find the minimal \(u^*_k\) via the Bellman equation asumming \(J^*_{k+1}\) is known. Requires computing the expectation for all possible controls \(u_k\) and then minimizing them. Thus takes more time.

### Approximate DP/Approximation in value space in Stochastic environments

Online play: At state \(x_k\), perform \(\min_{u_k}\mathbb{ E }[g_k(x_k,u_k,w_k) + \tilde{J}_{k+1}(x_{k+1})]\). There are 3 approximations which can be designed independently of each other:

- Finding the approximate cost-to-go \(\tilde{J}_{k+1}\).
- Use certainty equivalence, i.e. pretend the problem is deterministic and set the random variable to a fixed value.
- Problem approximation, replace \(\tilde{J}_{k+1}\) with the (exact) optimal cost of a simpler problem.
- Rollout, Model Predictive Control.
- Parametric approximations e.g. using Neural Networks.
- Aggregations - replace the original problem with that of a smaller dimension.

- The expected value, which may require full on MCMC.
- Adaptive simulation, i.e. simulate more accurate controls that are more promising. e.g. Monte-Carlo Tree Search (MCTS), which can be used in offline training and online play.

- Minimization over \(u_k\), which could be in an infinite dimensional space.
- Discretize the problem.
- Selective minimization - select out a subset of promising controls e.g. via some heuristic.

The art and practice of Reinforcement Learning (RL) is mostly about different methods to deal with the above approximations.

One example is **multi-step look-ahead**:
At state \(x_k\), perform

\[ \min_{u_k,\mu_{k+1},\cdots,\mu_{k+l-1}}\mathbb{ E }_{w}[g_k(x_k,u_k,w_k)+\sum_{m=k+1}^{k+l-1}g_m(x_m,\mu_m(x_m),w_m)+\tilde{J}_{k+1}(x_{k+1})] \]

i.e. do look-ahead minimization over the first \(l\) steps. This is a DP problem with horizon less than \(N\). Here, even though the minimization is done over multiple iterations, only \(u_k\) is used in the true environment, which then gives \(x_{k+1}\), and the process is repeated.

Value space approximation methods are primarily an online play method, where offline training can be optionally used to construct the cost function approximation for one-step/multi-step look-ahead. Policy space approximations may be optionally used to provide a policy for the online rollout.

See also Lecture 2 Slide 19/34 for illustration of truncated rollout.

### Approximation in policy space

Considers a parametric form for the policy \(\tilde{\mu_k}(\cdot,\theta_k)\). Some methods used for optimization/offline training include:

- Random search (not a good idea, but still an option).
- Policy gradients.
- Classification.

Note that the parameter is learnt offline. Once \(\theta_k\) is known, the online computations are faster as we do not have to do all of the computations needed in value space alternatives.

However, once the policy is fixed we cannot adapt it online to a changing environment i.e. there is no Newton step/online replanning.

### Combining policy space and value space approximations

Given \(\tilde{J}_{k+1}\), use an approximate model \(\tilde{\mu}_k\) trained on pairs \((x_k^s,u_k^s)\) with \(u_k^s=\tilde{\mu}_k(x_k^s)\) (#DOUBT Why is it \(\tilde{\mu}_k\) there) which are collected in an offline phase. \(\tilde{\mu}_k\) can be a neural network is fit via a least squares loss.

Approximation in value space is mainly an online play method, which can optionally use cost function approximations constructed from offline training for one-step/multi-step lookahead.

Approximation in policy space is mainly an offline training method, which can optionally be used to give a policy for online rollout.

[#DOUBT Difference between multistep lookahead and online rollout?]

See also Lecture 2 Slide 24/34 which is not noted here.

### Linear Quadratic Problems

The system in 1D is:

\[ x_{k+1} = ax_k+bu_k+w_k \]

Where \(w_k\) has 0 mean. The cost over \(N\) stages is:

\[ qx_N^2 + \sum_{k=0}^{N-1}(ax_k^2+ru_k^2) \]

The DP algorithm starts with \(J^*(x_N)=qx_N^2\), computes optimal cost function as:

\[ J^*_k(x_k) = \min_{u_k}\mathbb{ E}_{w_k}[qx_k^2 + ru_k^2 + J^*_{k+1}(ax_k+bu_k+w_k)] \]

There is a closed form solution where \(J^*_k(x_k)=K_kx_k^2 + \text{constant}\), \(\mu^*_k(x_k)=L_kx_k\) where \(K_k,L_k\) can be computed in closed form. The policy only requires that \(w_k\) has 0 mean (called the **certainty equivalence** property - we can pretend \(w_k\) is deterministic and fixed at its mean and we will get the same optimal policy). Results also generalize beyond 1D.

When generalized to the case where the state \(x_k\) is observed through linear measurements \(y_k = C_kx_k + D_kw_k\), we have the Kalman Filter.

Lecture 3 [2023 version] Slide 18/35 shows a lot of intuition on two-step look-ahead, truncated rollout in the value iteration perspective.

In the policy iteration perspective, rollout is a single Newton iteration. Optimistic policy iteration consists of repeated truncated rollout iterations.

Newton step interpretation of approximation in value space generalizes very broadly. Importantly:

- Newton's method for solving the min Riccati equation is Newton's methods for solving the min Bellman equation.
- Approximation in value space is a single Newton iteration enhanced by multipstep lookahead (if any) and by truncated rollout (if any).
- Rollout is a single Newton iteration starting from the cost function of the (stable) base policy.
- Exact policy iteration is a full Newton's method.
- Multi-step look-ahead and truncated rollout enhance the stability properties of the policy computed by approximation in value space.

### Infinite horizon problems

Assume the stochastic optimal control problem, we now have:

\[ J_\pi(x_0) = \lim_{N\to \infty}\mathbb{ E }_{w_k}[\sum_{k=1}^{N-1}\alpha^k g(x_k,\mu_k(x_k),w_k)] \]

where \(\alpha\in(0,1]\) is called the discounting factor, which is mathematically convenient to make this value finite. Problems with \(\alpha=1\) typically involve a special cost-free termination state \(t\), where the goal is to reach \(t\) with minimum expected cost.

Some important results:

- \(J^*(x)=\lim_{k\to\infty}J_k(x)\) where \(J_k(x) = \min_{u\in U(x)}\mathbb{ E }_w[g(x,u,w) + \alpha J_{k-1}(f(x,u,w))]\) with \(J_0(x)=0\).
- \(J^*\) satisfies the Bellman equations (all analysis in infinite-horizon problems revolves around this).
- Optimality condition - if \(\mu^*(x)\) minimizes the Bellman equation \(\forall \: x\), it is an optimal policy that depends only at the state and not the stage. Such a policy is called stationary.

Two main algorithms here are Policy Iteration (policy evaluation then policy improvement) and Value Iteration.

### Value Iteration

Generate finite-horizon optimal cost function sequence \(\{J_k\}\):

\[ J_k(x) = \min_{u\in U(x)}\mathbb{ E }_w[g(x,u,w)+\alpha J_{k-1}(f(x,u,w))] \]

Where the starting point \(J_0\) is arbitrary.

For linear quadratic problems, value iteration can be visualized nicely by plotting the Riccati equation to see a hill-climbing style movement to the solution (Lecture 2 Slide 33/34).

### Policy Iteration

Generate policy sequence \(\{\mu_k\}\) and their cost functions \(\{J_{\mu_k}\}\): Start with a policy \(\mu\) and generate the new policy \(\tilde{\mu}\)

**Policy evaluation**which computes \(J_\mu\) for the (base) policy \(\mu\). This can be done by:- Solving the Bellman equation for \(\mu\).
- Online Monte Carlo.
- Temporal Difference (TD) methods.

**Policy improvement**step, which computes the improved (rollout) policy \(\tilde{\mu}\) using one-step look-ahead minimization:\[ \tilde{\mu}(x) = \text{argmin}_{u\in U(x)}\mathbb{ E }_w[g(x,u,w)+\alpha J_\mu(f(x,u,w))] \]

Policy iteration is much faster than value iteration, and gives the optimal policy in the limit.

Faster than Value Iteration. Can be viewed as Newton's method for solving Bellman's equation.

### Approximate Policy Iteration

First, generate data from some base policy \(\mu\).

- In the policy evaluation step: Use value data to train a value network which does approximation in value space.
- In the policy improvement step: Use policy data to train a policy network which does approximation in policy space.

Then update the base policy using the policy from the policy improvement step after doing some rollout.

### Formulating DP problems

Informally, define the controls, then the stages alongside the information at each stage, then the states. Define the state as something that summarizes the past for the future, i.e. knowing the current state makes all the past irrelevant.

State augmentation - for e.g. if we have \(x_{k+1} = f_k(x_k,x_{k-1},u_k, u_{k-1},w_k)\), we can include additional state variables \(y_k=x_{k-1}, s_k=u_{k-1}\) and we have the new dynamics:

\[ x_{k+1} = f_k(x_k,y_k,u_k,s_k,w_k)\\ y_{k+1}=x_k\\ s_{k+1}=u_k \]

The new DP algorithm now has an optimal cost function \(J^*(x_k,y_k,s_k)\). Other examples for state augmentation include cases where forecasts of future uncertainty are relevant.

### Multi-agent problems

In this problem setting, there are multiple agents collecting and sharing information selectively with each other and with an environment - agent \(i\) applies decision \(u^i\) sequentially. There is a distinct problem structures:

- Classical information pattern: Agents are fully cooperative, fully sharing and never forgetting information - this can be treated by DP.
- Nonclassical information pattern: Agents are partially sharing the information, and may be antagonistic - this is hard to treat by DP. (Lecturer says this is extremely hard, and became more feasible recently with better computation power and function approximation methods etc.)

We will consider the classical information pattern for now. Here, the decision/control has \(m\) components corresponding to \(m\) agents. First, we need to deal with the exponential size of the search/control space, which can be done by a reformulation.

Toy example: 15 spiders in a grid who can take 5 moves (move to any of the nominal directions or stay still), with the goal being to catch 3 flies in minimum time. At each time step, there are \(\approx 5^{15}\) actions. We can reduce this to \(5\times 15\) without losing much - the idea is to break down the control such that one spider moves at a time.

The reformulation is:

- Augment the state space to start from \(x\) and for each successive control from agent \(k\), let \((x,u^1,\cdots,u^k)\) be the new state. After all \(m\) agents perform a control we get the new (random) transition \(\bar{x}=f(x,u,w)\) and (random) cost \(g(x,u,w)\). This gives us \(m-1\) additional cost functions \(J^1(x,u^1),\cdots,J^{m-1}(x,u^1,\cdots,u^{m-1})\). This allows for more efficient rollout (one agent at a time). Only one state and its successors are looked at each stage during online play. Most importantly, the one-step look-ahead branching factor is reduced from \(n^m\) to \(nm\) where \(m\) is the number of choices for agent \(u^i\).

### Partial Observations/Partially Observable Markov Decision Processes (POMDPs)

Toy example - parking deadline, at each step the car moves backward or forward, and we need to decide to park at stop \(m\) for cost \(c(m)\) if its available. If we do not park by time \(N\) there is a large cost \(C\). We only see if the current space is free or not, and parking spots may be taken/emptied at each time step randomly. How do we represent states? The belief state i.e. conditional probabilities of the free/taken status of all the parking spots conditioned on all the observations so far.

For problems with partial observations, we reformulate the DP algorithm by using belief states \(b_k\). Given observation (i.e. partial/noisy state) \(z_{k+1}\), we generate a belief state \(b_{k+1} = F_k(b_k, u_k, z_{k+1})\) with some belief estimator \(F_k\) - this gives a cost \(\hat{g}(b_k,u_k)\). The belief state is fed to a controller, and a new observation is received and the loop continues. The new DP algorithm gives:

\[ J^*(b_k) = \min_{u_k\in U_k} \hat{g}_k(b_k,u_k) + \mathbb{ E }_{z_{k+1}}[J^*_{k+1}(F_k(b_k,u_k,z_{k+1}))|b_k,u_k] \]

### Adaptive Control

Deals with problems where model parameters change with time.

Example: Cruise control, where velocity is \(x_{k+1}=ax_k+bu_k\) where friction \(a<1\) and \(b>0\) depends on the road, passengers etc. \(a,b,\bar{x}\) change all the time, and may be measured with error. Some possibilities:

- Ignore parameter changes and design a robust controller that works for a broad range of parameters.
- Try to estimate the parameters and use them to modify the controller. E.g. Online replanning by optimization, online replanning by rollout with a base policy and estimate the cost values with current parameter estimates.
- View the adaptive control problem as a POMDP and use approximate solution methods for them.

#### Robust control

Proportion-Integral-Derivation (PID) Control - does not require a model of the system. There is a cost to pay for not adaptively changing the parameters.

#### Online replanning by optimization (Indirect Adaptive Control)

Two phases - System identification and controller reoptimization. Can be time consuming.

#### Online replanning by rollout/approximation in value space

Use a robust policy as a base policy for rollout - no controller reoptimization which is faster. Then introduce new parameter estimates in the lookahead minimization and the rollout. Continue to use the same robust/base policy, and possibly recalculate it in the background.

#### POMDP approach to adaptive control

This is a more principled approach. Suppose we have a deterministic system \(x_{k+1}=f_k(x_k,\theta_k,u_k)\) with \(\theta_k\) unknown. Introducing an augmented state \((x_k,\theta_k)\) which is partially observed, we can have a belief state \(b_{k,i} = p(\theta_k|I_k)\) with \(I_k=(x_0,\cdots,x_k,u_0,\cdots,u_{k-1})\) being the information vector. We then have the Bellman equation:

\[ J^*_k(x_k) = \min_{u_k}\sum_{i=1}^{m}b_{k,i}(g_k(x_k,\theta_k^i, u_k)+J^*_{k+1}(I_k, f(x_k,\theta^i_k, u_k),u_k) \]

Here we can use a value space approximation \(\tilde{J}^i(f(x_k, \theta_k^i,u_k))\) that only depends on the new state. Minimizing over \(u_k\) gives a one-step look-ahead policy.

POMDP approach also takes action such that future observations may be better instead of just estimating parameters and minimizing the cost.

Toy example: Wordle game. \(\theta\) is the unknown mystery word. \(x_k\) is the list of possible mystery words given the guesses so far. \(u_k\) is the list of possible guess words. Let \((x_k,\theta)\) be the augmented state. We can apply rollout with one of the several base heuristics.

### Model Predictive Control (MPC)

The core problem of control theory is to control a system around a reference point and path planning i.e. follow a given trajectory within the state constraints.

This can be viewed as approximation in value space, where we minimize the cost function over the next \(l\) stages (impose a finite horizon) while requiring \(x_{k+l}=0\) (common control objective). After we minimize the cost function over the \(l\) stages, we apply the first control of the minimizing sequence and discard the others. **This is rollout with a base heuristic being the minimizer that that drives \(x_{k+l}\) to 0 in \(l\) steps**. It is well suited for online replanning.

MPC with terminal cost approximation involves a non-negative terminal cost \(G_{k+l}\) in the \(l\) stage MPC problem instead of driving \(x_{k+l}\) to 0. **This can be viewed as approximation in value space with multi-step look-ahead**. Truncated rollout with some base policy can also be introduced. MPC policy is stable when \(G\) is in the region of stability - truncated rollouts help here.

MPC often involves state constraints where we must have \(x_k\in X\). There is no guarantee that this problem has a feasible solution for all initial states \(x_k\in X\) - this is particularly true for unstable systems.

Toy example of state constrained MPC: Dynamics are \(x_{k+1}=2x_k+u_k\) with \(|u_k|<1\) and \(X=\{x: |x|\leq \beta\}\). If $β > 1$the state constraint cannot be satisfied for any \(x_0\in X\). The state constraint can only be satisfied in the set \(\hat{X}=\{x : |x|\leq 1\}\) - such sets are called invariant.

A set \(X\) is said to be invariant if for each \(x\in X\) there is a control \(u\in U\) such that \(f(x,u)\in X\). This begs the question - how do we compute an invariant subset of a given constraint set? This is necessarily an offline calculation and cannot be easily performed during online play. Given \(X\) there is a largest possible invariant set which can be computed in the limit with a value iteration like algorithm.

### Rollout for Deterministic DP

Finite horizon (w/ finite states) problems can be transformed to infinite horizon by let the new state be \((X_1,\cdots,X_N)\), which when written as a collection of columns, we have an edge going from each \(X_i\) in column \(j\) to \(X_{i+1}\) in column \(j+1\), with \(X_N\) going to a terminal state \(t\). This is an important conceptual idea.

Rollout is a special case of approximation in value space, where we find:

\[ \min_{u_k,\mu_{k+1},\cdots,\mu_{k+l-1}}\mathbb{ E }[g_k(x_k,u_k,w_k) + \sum_{i=k+1}^{k+l-1}g_i(x_i,\mu_i(x_i), w_i) + \tilde{J}_{k+l}(x_{k+l})] \]

Here, \(\tilde{J}_{k+l}(x_{k+l})\) is the cost function of some policy or heuristic. The policy **used** for rollout is called the **base policy**, and the policy **obtained** by look-ahead minimization is called the **rollout policy**.

Rollout is important as it gives important options for cost function approximation for value space methods, it is also the basic building block of the fundamental policy iteration algorithm and approximate variants.

Rollout is:

- The RL method that is easiest to understand and apply.
- It is by far the most reliable, and is very general (applies to deterministic/stochastic/finite and infinite horizon problems).
- As a special case of approximation in value space it relates to Newton's method.
- Deals well with online replanning and is a useful alternative to reoptimization in adaptive control.
- Relates to MPC and can be used to improve the stability of MPC schemes.
- Truncated rollout can be combined with many of the RL methods used in practice.

Recall deterministic rollout:

- Given current state \(x_k\):
- For each possible next control \(u_k\), generate a new Q-factor \(\tilde{Q}_k(x_k,u_k) = g_k(x_k,u_k) + H_{k+1}(f_k(x_k,u_k))\) using some base heuristic \(H_{k+1}\) for the cost starting at \(x_{k+1}\).
- Choose the control \(u_k\) with the smaller Q-factor value.

Cost improvement in rollout is not automatic - special conditions must be met for the rollout policy to be no worse than the basic heuristic.

A base heuristic is sequentially consistent if in a given state it chooses the control that depends only on that state i.e. a heuristic that generates \(\{x_k,\cdots,x_N\}\) will generate \(\{x_{k+1},\cdots,x_N\}\) when starting from \(x_{k+1}\). Greedy heuristics are sequentially consistent. We will focus on a less restrictive condition: sequential improvement i.e. cost of the rollout policy is less than or equal to cost of the base heuristic.

Simplified rollout: Assuming sequential improvement, choose any \(u_k\) that satisfies \(\tilde{Q}(x_k,u_k)=g_k(x_k,u_k)+H_{k+1}(f_k(x_k,u_k))\leq H_k(x_k)\). This saves computation.

Rollout with superheuristic: Use multiple heuristics and use the Q-factor of the best heuristic - requires all heuristics to be sequentially improving.

Fortified rollout: [Lecture 5 Slide 26/30] The idea is at each step, follow the best trajectory computed thus far. At state \(x_k\) store both the permanent rollout trajectory \(\bar{P}_k =\{x_0,u_0,\cdots,u_{k-1},x_k\}\) and the tentative best trajectory \(\bar{T}_k = \bar{P}_k \cup \{\bar{u}_k,\bar{x}_{k+1},\cdots,\bar{u}_{N-1},\bar{x}_N\}\). Reject the minimum Q-factor choice \(\tilde{u}_k\) if its complete trajectory is more costly than the current tentative best, else choose \(\tilde{u}_k\) and update \(\bar{T}_{k+1}\). (#DOUBT Why keep a tentative trajectory - is it there just for a reference?).

Mode-free rollout: By model-free we mean we have no equations for the model. Here we do not know \(G\) and/or the constraint sets \(U_k\). Instead we have a base heuristic, and a human who can rank any 2 complete solutions without assigning values to them. Deterministic rollout can be used here.

Example: Rollout with an expert - can be applied to RNA folding (, a). Given a sequence of nucleotides (molecules of type A,C,G,U), fold it in an "interesting" way - we want to pair nucleotides that result in an interesting structure. Biologists do not agree on what the best folding is, and often imperfect software that compares complete folding is used. At each step, go one step forward from the current pairing, where at each nucleotide in the sequence we can (a) close the pairing (b) do nothing (c) open the pairing. There is a base heuristic which generates a complete folding from partial foldings. Two complete foldings can be compared by the expert software. There is no explicit cost function here.

(#TODO Start from Lecture 6 and continue)