Topics in Reinforcement Learning (Arizona State University CSE691)

Introduction

Taught by Dimitri Bertsekas. 2024 version, see website.

Content

History and motivation

Offline and online aspects of both AlphaZero and TD-Gammon are discussed, specifically the importance of truncated rollout, which was first detailed in (, ).
Course aims to provide a unifying framework for Reinforcement learning (RL) (AI community), Approximate Dynamic Programming (DP) (Optimization and Operations Research, OR, community), Model Predictive Control (MPC) (control systems community) and discrete optimization (Algorithms community).
In principle, exact DP can be used in many optimization problems, but suffers from the curse of dimensionality and the need for a mathematical model. Approximate DP/RL overcomes this via (function) approximation and simulation (i.e. using a computer model instead of a mathematical model).

Terminology:

RL	DP/Control
Reward	Cost
Agent	Controller
Action	Decision/Control
Environment	Dynamical system (Discrete time systems, ODEs, PDEs etc)
Learning	Solving DP-related problem via simulation
Self-learning/self-play	Solving DP-related problem via simulation-based policy iteration
Planning vs. Learning	Solving DP-related problem with model-based vs. model-free simulation
Transition model \(p(s,a,s')\)	Discrete-time system equation \(x_{t+1}=f(x_t, u_k, w_k)\)

Also, OR community uses \(p_{ij}(u)\) for transition probability from state \(i\) to \(j\) from action \(u\).

Themes

RL	DP/Control
Learning through data/experience	Principle of optimality
Simulation, model-free methods	Markov Decision Problems
Feature-based representations	POMDPs
A*/Games/Heuristics	Policy Iteration, Value iteration

Finite-Horizon Deterministic Optimal Control Model

Consider a deterministic system \(x_{k+1}=f_k(x_k,u_k)\). Assuming we start and end at states \(x_0,x_N\), taking a sequence of controls \(u_0,\cdots,u_{N-1}\) results in a chain graph:

Nodes are states.
Edges come from controls.

Each transition from \(x_k\) to \(x_{k+1}\) via control \(u_k\) has the cost of doing control denoted \(g_k(x_k,u_k)\) - this cost is incurred at stage \(k\). The total cost is \(g_N(x_N) + \sum_{k=0}^{N-1}g_k(x_k,u_k)\).

If we assume a finite number of states and controls, all possibilities can be represented as a Directed Acyclic Graph (DAG):

Nodes are states.
Edges come from controls.
There is an extra terminal node \(t\) where all incoming edges to \(t\) are labelled with terminal costs \(g_N(x_N)\).

The DAG can be arranged such that all nodes corresponsing the state after some step are in the same vertical line.

Choosing a control that minimizes the total cost is equivalent to choosing a shortest path problem in this DAG.

Any finite-horizon problem can be transformed to an infinite-horizon problem. See slide 9/32, Lecture 5, 2024.

Principle of optimality

Let \(u^*_0,\cdots,u^*_k,\cdots,u^*_{N-1}\) be an optimal control sequence, and assume we have arrived at state \(x^*_k\) after control \(u^*_k\). The tail subproblem is the problem of starting at \(x_k^*\) and finding a control that minimizes the total cost-to-go (i.e. total cost starting from \(x^*_k\)). The principle of optimality states that the tail of an optimal control sequence (i.e. the control \(u^*_k,\cdots,u^*_{N-1}\)) is optimal for the tail subproblem.

Notice that if we are in state \(x^*_k\) where we can take possible controls \(u_k,u'_k,u''_k\) which result in next states \(x_{k+1},x'_{k+1},x''_{k+1}\), and if we happened to solve the tail subproblem that starts at \(x_{k+1}=f_k(x_k,u_k)\), (i.e. we know \(J^*_{k+1}(x_{k+1}), J^*_{k+1}(x'_{k+1}), J^*_{k+1}(x''_{k+1})\)) we can work backwards to get the cost of \(x^*_k\) by choosing the best of \(u_k,u'_k,u''_k\). The starting point is the cost for all terminal states \(x_N\) which are \(J^*_N(x_N)\). From this point, we get \(J^*_i(x_k)\) for each stage \(i\) and state \(x_k\) (backwards calculation). These cost functions can then be used to get the optimal control sequence (forward calculation).

Dynamic Programming (DP)

Given the terminal costs \(J^*_N(x_N)=g(x_N)\) for each \(x_N\), and \(k=0,\cdots,N-1\), we have:

\[ J^*_k(x_k) = \min_{u_k\in U_k(x_n)} g_k(x_k,u_k) + J^*_{k+1}(f_k(x_k,u_k)) \]

Which is one form of the Bellman equation. This gives the optimal cost \(J^*(x_0)\) at the final step, from which we can find the optimal control by working forward.

Any finite state discrete optimization problem (even with a sequential structure) can be solved by DP. In this case, we start from an artificial initial \(s\), and stage \(k\) will have state \(x_k = (u_0,\cdots,u_{k-1})\), and the terminal is \(G(u_0,\cdots,u_{N-1})\). This method however suffers from an exponential blowup of the state space quite quickly - it however makes sense for approximate DP/approximation in value space.

Equivalently, one can use Q-factors, where

\[ Q^*_k(x_k,u_k)=g_k(x_k,u_k) + J^*_{k+1}(f_k(x_k,u_k)) \],

and the optimal cost function can be recovered by

\[ J^*_k(x_k) = \min_{u_k\in U_k(x_k)} Q^*_k(x_k,u_k) \]

Where the Bellman equation is now:

\[ Q^*_k(x_k,u_k) = g_k(x_k,u_k) + \min_{u_{k+1}\in U_{k+1}(f_k(x_k,u_k))}Q^*_{k+1}(f_k(x_k,u_k),u_{k+1}) \]

General Discrete Optimization

Minimize \(G(u)\) subject to \(u\in U\).

Each solution \(u\) has \(N\) components, corresponding to \(N\) stages. Define the states \(x_k = (u_0, \cdots, u_{k-1})\), and let \(x_0 = s\) be an artificial initial state. Define the terminal cost to be \(G(u)\) and all others costs to be 0. Exact DP is not recommended for this formulation, however it is helpful to think about approximate DP/approximation in value space.

Infinite horizon problems

Assume the stochastic optimal control problem, we now have:

\[ J_\pi(x_0) = \lim_{N\to \infty}\mathbb{ E }_{w_k}[\sum_{k=0}^{N-1}\alpha^k g(x_k,\mu_k(x_k),w_k)] \]

where \(\alpha\in(0,1]\) is called the discounting factor, which is mathematically convenient to make this value finite. Problems with \(\alpha=1\) typically involve a special cost-free termination state \(t\), where the goal is to reach \(t\) with minimum expected cost.

Some important results:

\(J^*(x)=\lim_{k\to\infty}J_k(x)\) where \(J_k(x) = \min_{u\in U(x)}\mathbb{ E }_w[g(x,u,w) + \alpha J_{k-1}(f(x,u,w))]\) is the k-stage finite horizon optimal const function, with \(J_0(x)=0\).
\(J^*\) satisfies the Bellman optimality equations (all analysis in infinite-horizon problems revolves around this). \(J^*(x) = \min_{u\in U(x)}\mathbb{ E }_w[g(x,u,w) + \alpha J^*(f(x,u,w))]\).
Optimality condition - if \(\mu^*(x)\) minimizes the Bellman equation \(\forall \: x\), it is an optimal policy that depends only at the state and not the stage. Such a policy is called stationary.

Two main algorithms here are Policy Iteration (iteratively applying policy evaluation then policy improvement) and Value Iteration.

Next Word Prediction

Given a prompt, the current text window \(x_k\), choose the next work \(u_k\) to get the next text window \(x_{k+1}\) and so on. \(x_k\) and \(x_{k+1}\) are n-word strings differing by the single word \(u_k\) [#DOUBT Why are both states n-words?]. The state dynamics is \(x_{k+1} = f(x_k,u_k)\) which is deterministic, and the cost function \(G(x_N)\) encodes the quality of the final text string. Existing LLM models (GPTs) can be viewed as a heuristic/suboptimal control generation method, called a policy, or a base heuristic. The initial state \(x_0\) is the user supplied prompt.

Approximate DP/Approximation in value space

The optimal cost function \(J^*_k\) is very hard to compute, so use an approximation \(\tilde{J}_k\) (i.e. from offline training).

To find the sequence of optimal controls, start with

\[ \tilde{u}_0 \in \text{argmin}_{u_0\in U_0(x_0)}[g_0(x_0,u_0) + \tilde{J}_1(f_0(x_0,u_0))] \]

This gives the next state \(\tilde{x}_1 = f_0(x_0,\tilde{u}_0)\). From here go forward for \(k=1,\cdots,N-1\) (online play):

\[ \tilde{u}_k \in \text{argmin}_{u_k\in U_k(x_k)}[g_k(\tilde{x}_k,u_k)+\tilde{J}_{k+1}(f_k(\tilde{x}_k,u_k))] \]

An important question now is how to get \(\tilde{J}_{k+1}(x_{k+1})\). This can be done via:

Offline problem approximation, where \(\tilde{J}\) is the optimal cost function of a simpler problem, computed offline using exact DP.
Parametric approximation, e.g. a neural network fitted on training data.
Rollout with a heuristic.

Rollout (with a heuristic)

An approach to find an approximate cost function \(\tilde{J}_k\).

One-step rollout: When we reach \(x_k\), compute the next states \(x_{k+1}=f_k(x_k,u_k)\) for all possible \(u_k\). From each \(x_{k+1}\), run a heuristic to get a heuristic cost \(H_{k+1}(x_{k+1})\). Then take the control \(\tilde{u}_k\) that minimizes the heuristic Q-factor \(g_k(x_k,u_k)+H_{k+1}(x_{k+1})\).

Multi-step rollout: When we reach \(x_k\), compute \(l\) steps #TODO

For example, in the travelling salesman problem, we can use the nearest neighbour heuristic (in general many heuristics are greedy heuristics).

[#DOUBT Is there a relation between this and A*-search?]

Rollout can also be seen as applying one-step of Newton's method starting from the cost function of the base heuristic.

Stochastic DP

Here, the state transition dynamics are not determinstic, i.e. we have \(x_{k+1}=f_k(x_k,u_k,w_k)\) where \(w_k\) is some random disturbance. Now the cost function becomes an expectation. Unlike the deterministic case, stochastic finite horizon problems are not equivalent to a shortest path problem.

Let \(\pi = \{\mu_0,\cdots,\mu_{N-1}\}\) where \(\mu_k\) (called closed-loop controls/feedback policy) are now functions that map a state \(x_k\) to a control \(u_k\). This makes the problem harder - instead of find a control \(u_k\) we are now searching for a function \(\mu_k\). The problem is now to minimize over all possible \(\pi\) the cost function:

\[ J_\pi(x_0) = \mathbb{ E }_w[g_N(x_N)+\sum_{k=0}^{N-1}g_k(x_k,\mu_k(x_k),w_k)] \]

The optimal cost function is now \(J^*(x_0)=\min_\pi J_\pi(x_0)\). The optimal policy \(\pi^*\) is such that \(J_{\pi^*}(x_0) = J^*(x_0)\).

The stochastic DP algorithm computes \(J^*_k(x_k)\) (backward calculation) using the cost-to-go similar to deterministic DP, with the addition of taking an expectation wrt \(w_k\) at timestep \(k\). The optimal policy (forward calculation) can be implemented in 2 ways:

Offline: Get \(\mu^*_k\) simultaneously with \(J^*_k\) by minimizing \(u^*_k=\mu^*_k(x_k)\). Requires space for storing the policy. [#DOUBT]
Online: Sequentially move forward and find the minimal \(u^*_k\) via the Bellman equation asumming \(J^*_{k+1}\) is known. Requires computing the expectation for all possible controls \(u_k\) and then minimizing them. Thus takes more time.

Approximate DP/Approximation in value space in Stochastic environments

Online play: At state \(x_k\), perform \(\min_{u_k}\mathbb{ E }_w[g_k(x_k,u_k,w_k) + \tilde{J}_{k+1}(x_{k+1})]\).

There are 3 approximations which can be designed independently of each other:

Finding the approximate cost-to-go \(\tilde{J}_{k+1}\).
- Use certainty equivalence, i.e. pretend the problem is deterministic and set the random variable to a fixed value.
- Problem approximation, replace \(\tilde{J}_{k+1}\) with the (exact) optimal cost of a simpler problem.
- Rollout, Model Predictive Control.
- Parametric approximations e.g. using Neural Networks.
- Aggregations - replace the original problem with that of a smaller dimension.
The expected value, which may require full on MCMC.
- Adaptive simulation, i.e. simulate more accurate controls that are more promising. e.g. Monte-Carlo Tree Search (MCTS), which can be used in offline training and online play.
Minimization over \(u_k\), which could be in an infinite dimensional space.
- Discretize the problem.
- Selective minimization - select out a subset of promising controls e.g. via some heuristic.

The art and practice of Reinforcement Learning (RL) is mostly about different methods to deal with the above approximations.

One example is multi-step look-ahead: At state \(x_k\), perform

\[ \min_{u_k,\mu_{k+1},\cdots,\mu_{k+l-1}}\mathbb{ E }_{w}[g_k(x_k,u_k,w_k)+\sum_{m=k+1}^{k+l-1}g_m(x_m,\mu_m(x_m),w_m)+\tilde{J}_{k+1}(x_{k+1})] \]

i.e. do look-ahead minimization over the first \(l\) steps. This is a DP problem with horizon \(l\). This is a DP problem with horizon less than \(N\). Here, even though the minimization is done over multiple iterations, only \(u_k\) is used in the true environment, which then gives \(x_{k+1}\), and the process is repeated.

Value space approximation methods are primarily an online play method, where offline training can be optionally used to construct the cost function approximation for one-step/multi-step look-ahead. Policy space approximations may be optionally used to provide a policy for the online rollout.

See also Lecture 2 Slide 19/34 for illustration of truncated rollout.

Approximation in policy space

Considers a parametric form for the policy \(\tilde{\mu_k}(\cdot,\theta_k)\). Some methods used for optimization/offline training include:

Random search (not a good idea, but still an option).
Policy gradients.
Classification.

Note that the parameters \(\theta_k\) are learnt offline. Once \(\theta_k\) is known, the online computations are faster as we do not have to do all of the computations needed in value space alternatives.

However, once the policy is fixed we cannot adapt it online to a changing environment. Theoretically speaking, there is no Newton step/online replanning.

Combining policy space and value space approximations

Given \(\tilde{J}_{k+1}\), use an approximate model \(\tilde{\mu}_k\) trained on pairs \((x_k^s,u_k^s)\) with \(u_k^s=\tilde{\mu}_k(x_k^s)\) (#DOUBT Why is it \(\tilde{\mu}_k\) there) which are collected in an offline phase. \(\tilde{\mu}_k\) can be a neural network is fit via a least squares loss.

Approximation in value space is mainly an online play method, which can optionally use cost function approximations constructed from offline training for one-step/multi-step lookahead.

Approximation in policy space is mainly an offline training method, which can optionally be used to give a policy for online rollout.

[#DOUBT Difference between multistep lookahead and online rollout?]

See also Lecture 2 Slide 24/34 which is not noted here.

Linear Quadratic Problems

The system in 1D is:

\[ x_{k+1} = ax_k+bu_k+w_k \]

Where \(w_k\) has 0 mean. The cost over \(N\) stages is:

\[ qx_N^2 + \sum_{k=0}^{N-1}(ax_k^2+ru_k^2) \]

The DP algorithm starts with \(J^*(x_N)=qx_N^2\), computes optimal cost function as:

\[ J^*_k(x_k) = \min_{u_k}\mathbb{ E}_{w_k}[qx_k^2 + ru_k^2 + J^*_{k+1}(ax_k+bu_k+w_k)] \]

\(q, r\) quantify how far we want the terminal state and control to be close to 0 respectively. There is a closed form solution where \(J^*_k(x_k)=K_kx_k^2 + \text{constant}\), \(\mu^*_k(x_k)=L_kx_k\) where \(K_k,L_k\) can be computed in closed form. The policy only requires that \(w_k\) has 0 mean (called the certainty equivalence property - we can pretend \(w_k\) is deterministic and fixed at its mean and we will get the same optimal policy). Results also generalize beyond 1D.

When generalized to the case where the state \(x_k\) is observed through linear measurements \(y_k = C_kx_k + D_kw_k\), we have the Kalman Filter.

Lecture 3 [2023 version] Slide 18/35 shows a lot of intuition on two-step look-ahead, truncated rollout in the value iteration perspective.

In the policy iteration perspective, rollout is a single Newton iteration. Optimistic policy iteration consists of repeated truncated rollout iterations.

Newton step interpretation of approximation in value space generalizes very broadly. Importantly:

Newton's method for solving the min Riccati equation is Newton's methods for solving the min Bellman equation.
Approximation in value space is a single Newton iteration enhanced by multipstep lookahead (if any) and by truncated rollout (if any).
Rollout is a single Newton iteration starting from the cost function of the (stable) base policy.
Exact policy iteration is a full Newton's method.
Multi-step look-ahead minimzation (look at all reachable states beyond just 1 step) and truncated rollout enhance the stability properties of the policy computed by approximation in value space.

Riccati equations

The linear quadratic problem \(x_{k+1}=ax_k + bx_k\), in the infinite horizon case can be viewed using 2 Riccati equations (min-Riccati, L-Riccati), which arise from the assumptions:

The policy is of the form \(u(x)=Lx\).
The cost function is of the form \(J(x)=Kx^2\).

The closed-loop linear system is stable only if \(|a+bL|<1\), which can be seen geometrically.

It can be shown that the min-Riccati equation is the minimum of the L-Riccati equation for a fixed \(L\). This perspective shows that approximation in value space involves taking Newton steps.

Value Iteration

Generate finite-horizon optimal cost function sequence \(\{J_k\}\):

\[ J_k(x) = \min_{u\in U(x)}\mathbb{ E }_w[g(x,u,w)+\alpha J_{k-1}(f(x,u,w))] \]

Where the starting point \(J_0\) could be arbitrary.

For linear quadratic problems, value iteration can be visualized nicely by plotting the Riccati equation to see a hill-climbing style movement to the solution (Lecture 2 Slide 33/34).

Policy Iteration

Generate policy sequence \(\{\mu_k\}\) and their cost functions \(\{J_{\mu_k}\}\): Start with a policy \(\mu\) and generate the new policy \(\tilde{\mu}\)

Policy evaluation which computes \(J_\mu\) for the (base) policy \(\mu\). This can be done by:
- Solving the Bellman optimality equation \(J_\mu(x) = \mathbb{ E }_w[g(x,\mu(x),w) + \alpha J_\mu(f(x,\mu(x),w))]\) for \(\mu\).
- Online Monte Carlo.
- Temporal Difference (TD) methods.
Policy improvement step, which computes the improved (rollout) policy \(\tilde{\mu}\) using one-step look-ahead minimization:

\[ \tilde{\mu}(x) = \text{argmin}_{u\in U(x)}\mathbb{ E }_w[g(x,u,w)+\alpha J_\mu(f(x,u,w))] \] This is using the Bellman optimality equation with \(J_\mu\) instead of \(J^*\).

Policy iteration is much faster than value iteration, and gives the optimal policy in the limit. This can be viewed as using Newton's method for solving Bellman's equation.

Approximate Policy Iteration

First, generate data from some base policy \(\mu\).

In the policy evaluation step: Use value data to train a value network which does approximation in value space.
In the policy improvement step: Use policy data to train a policy network which does approximation in policy space.

Then update the base policy using the policy from the policy improvement step after doing some rollout.

Formulating DP problems

States \(x_k\) represent information that summarize the past for future optimization.
If we define the control/actions, it can help suggest what to set as states as they controls/actions depend on the state.

State augmentation - for e.g. if we have \(x_{k+1} = f_k(x_k,x_{k-1},u_k, u_{k-1},w_k)\), we can include additional state variables \(y_k=x_{k-1}, s_k=u_{k-1}\) and we have the new dynamics:

\[ x_{k+1} = f_k(x_k,y_k,u_k,s_k,w_k)\\ y_{k+1}=x_k\\ s_{k+1}=u_k \]

The new DP algorithm now has an optimal cost function \(J^*(x_k,y_k,s_k)\). Other examples for state augmentation include cases where forecasts of future uncertainty are relevant, in which case we would want to incoporate that information before making any decisions.

Multi-agent problems

In this problem setting, there are multiple agents collecting and sharing information selectively with each other and with an environment - agent \(i\) applies decision \(u^i\) sequentially. There is a distinct problem structures:

Classical information pattern: Agents are fully cooperative, fully sharing and never forgetting information - this can be treated by DP.
Nonclassical information pattern: Agents are partially sharing the information, and may be antagonistic - this is hard to treat by DP. (Lecturer says this is extremely hard, and became more feasible recently with better computation power and function approximation methods etc.)

Classical information example: The decision/control has \(m\) components corresponding to \(m\) agents. First, we need to deal with the exponential size of the search/control space, which can be done by a reformulation.

Toy case: 15 spiders in a grid who can take 5 moves (move to any of the nominal directions or stay still), with the goal being to catch 3 flies in minimum time. At each time step, there are \(\approx 5^{15}\) actions. We can reduce this to \(5\times 15\) without losing much - the idea is to break down the control such that one spider moves at a time.

The reformulation involves 'unfolding' the control action: Introduce \(m\) (in this case 15) extra states and corresponding cost functions. At each step, agent \(i\) receives the state information \((x, u_1, \cdots, u_{i-1})\) i.e. the actual state and the actions of the agents before it as its augmented state, which is then used to take an action. Most importantly, the one-step look-ahead branching factor is reduced from \(n^m\) to \(nm\) where \(m\) is the number of choices for agent \(u^i\).

Partial Observations/Partially Observable Markov Decision Processes (POMDPs)

Toy example: Parking deadline, at each step the car moves backward or forward, and we need to decide to park at stop \(m\) for cost \(c(m)\) if its available. If we do not park by time \(N\) there is a large cost \(C\). We only see if the current space is free or not. Parking spots may be taken/emptied at each time step randomly. How do we represent states? In the fully observable case, the state (empty/used) of all the parking spots can be used, however since we only observe the state of the current parking spot, we use the belief state. That is, the conditional probabilities of the free/taken status of all the parking spots conditioned on all the observations so far.

For problems with partial observations, we reformulate the DP algorithm by using belief states \(b_k\). Given observation (i.e. partial/noisy state) \(z_{k+1}\), we generate a belief state \(b_{k+1} = F_k(b_k, u_k, z_{k+1})\) with some belief estimator \(F_k\) - this gives a cost \(\hat{g}_k(b_k,u_k)\), which is itself an expectation. The belief state is fed to a controller, and a new observation is received and the loop continues. The new DP algorithm gives:

\[ J_k^*(b_k) = \min_{u_k\in U_k} \hat{g}_k(b_k,u_k) + \mathbb{ E }_{z_{k+1}}[J^*_{k+1}(F_k(b_k,u_k,z_{k+1}))\mid b_k,u_k] \]

Adaptive Control

Deals with problems where model parameters change with time.

Cruise control example: Assume the velocity is \(x_{k+1}=ax_k+bu_k\) where friction \(a<1\) and \(b>0\) depends on the road, passengers etc. \(a,b,\bar{x}\) change all the time, and may be measured with error. Some possibilities:

Ignore parameter changes and design a robust controller that works for a broad range of parameters.
Try to estimate the parameters and use them to modify the controller. E.g. Online replanning by optimization, online replanning by rollout with a base policy and estimate the cost values with current parameter estimates.
View the adaptive control problem as a POMDP and use approximate solution methods for them.

Robust control

Proportion-Integral-Derivation (PID) Control - does not require a model of the system. Applies to single-input single-output systems. Involves 3 parameters, tuned experimentally/heuristically, each proportional to:

Deviation from a desired value.
Accumulated deviation from a desired value.
Rate of change of the deviation from a desired value.

There is a cost to pay for not adaptively changing the parameters.

Online replanning by optimization (Indirect Adaptive Control)

Two phases:

System identification i.e. estimate parameters of the model.
Optimize the controller for the current parameters.

Online replanning by rollout/approximation in value space

Use a robust policy as a base policy for rollout - no controller reoptimization so this is faster. Then introduce new parameter estimates in the lookahead minimization and the rollout. Continue to use the same robust/base policy, and possibly recalculate it in the background.

POMDP approach to adaptive control

This is a more principled approach. Suppose we have a deterministic system \(x_{k+1}=f_k(x_k,\theta,u_k)\) with \(\theta \in \mathbb{ R }^m\) unknown. Introducing an augmented state \((x_k,\theta)\) which is partially observed, we can have a belief state \(b_{k,i} = p(\theta^i|I_k)\) with \(I_k=(x_0,\cdots,x_k,u_0,\cdots,u_{k-1})\) being the information vector. We then have the Bellman equation:

\[ J^*_k(x_k) = \min_{u_k}\sum_{i=1}^{m}b_{k,i}(g_k(x_k,\theta^i, u_k)+J^*_{k+1}(I_k, f(x_k,\theta^i, u_k),u_k) \]

One possible approximation is to use a value space approximation \(\tilde{J}^i(f(x_k, \theta^i,u_k))\) that now depends on the new state \(\theta^i\). Minimizing over \(u_k\) gives a one-step look-ahead policy.

POMDP approach also takes action such that future observations may be better instead of just estimating parameters and minimizing the cost.

Toy example: Wordle game. \(\theta\) is the unknown mystery word. \(x_k\) is the list of possible mystery words given the guesses so far. \(u_k\) is the list of possible guess words. Let \((x_k,\theta)\) be the augmented state. We can apply rollout with one of the several base heuristics.

Model Predictive Control (MPC)

The core problem of control theory is to control a system around a reference point and path planning i.e. follow a given trajectory within the state constraints.

This can be viewed as approximation in value space, where we minimize the cost function over the next \(l\) stages (impose a finite horizon) while requiring \(x_{k+l}=0\) (common control objective). After we minimize the cost function over the \(l\) stages, we apply the first control of the minimizing sequence and discard the others. This is rollout with a base heuristic being the minimizer that that drives \(x_{k+l}\) to 0 in \(l\) steps. It is well suited for online replanning.

MPC with terminal cost approximation involves a non-negative terminal cost \(G_{k+l}\) in the \(l\) stage MPC problem instead of driving \(x_{k+l}\) to 0. This can be viewed as approximation in value space with multi-step look-ahead. Truncated rollout with some base policy can also be introduced. MPC policy is stable when \(G\) is in the region of stability - truncated rollouts help here.

MPC often involves state constraints where we must have \(x_k\in X\). There is no guarantee that this problem has a feasible solution for all initial states \(x_k\in X\) - this is particularly true for unstable systems.

Toy example of state constrained MPC: Dynamics are \(x_{k+1}=2x_k+u_k\) with \(|u_k|<1\) and \(X=\{x: |x|\leq \beta\}\). If \(\beta > 1\) the state constraint cannot be satisfied for any \(x_0\in X\). Hence the constraint here can only be satisfied in the set \(\hat{X}=\{x : |x|\leq 1\}\) - such sets are called invariant.

A set \(X\) is said to be invariant if for each \(x\in X\) there is a control \(u\in U\) such that \(f(x,u)\in X\). Computing these subsets given a constraint set is necessarily an offline calculation. That being said, given \(X\), there is a largest possible invariant set which can be computed in the limit with a value iteration like algorithm.

Rollout for Deterministic DP

Finite horizon (w/ finite states) problems can be transformed to infinite horizon by let the new state be \((X_1,\cdots,X_N)\), which when written as a collection of columns, we have an edge going from each \(X_i\) in column \(j\) to \(X_{i+1}\) in column \(j+1\), with \(X_N\) going to a terminal state \(t\). This is an important conceptual idea.

Rollout is a special case of approximation in value space, where we find:

\[ \min_{u_k,\mu_{k+1},\cdots,\mu_{k+l-1}}\mathbb{ E }[g_k(x_k,u_k,w_k) + \sum_{i=k+1}^{k+l-1}g_i(x_i,\mu_i(x_i), w_i) + \tilde{J}_{k+l}(x_{k+l})] \]

Here, \(\tilde{J}_{k+l}(x_{k+l})\) is the cost function of some policy or heuristic. The policy used for rollout is called the base policy, and the policy obtained by look-ahead minimization is called the rollout policy.

Rollout is important as it gives important options for cost function approximation for value space methods, it is also the basic building block of the fundamental policy iteration algorithm and approximate variants.

Rollout is:

The RL method that is easiest to understand and apply.
It is by far the most reliable, and is very general (applies to deterministic/stochastic/finite and infinite horizon problems).
As a special case of approximation in value space it relates to Newton's method.
Deals well with online replanning and is a useful alternative to reoptimization in adaptive control.
Relates to MPC and can be used to improve the stability of MPC schemes.
Truncated rollout can be combined with many of the RL methods used in practice.

Recall deterministic rollout:

Given current state \(x_k\):
For each possible next control \(u_k\), generate a new Q-factor \(\tilde{Q}_k(x_k,u_k) = g_k(x_k,u_k) + H_{k+1}(f_k(x_k,u_k))\) using some base heuristic \(H_{k+1}\) for the cost starting at \(x_{k+1}\).
Choose the control \(u_k\) with the smaller Q-factor value.

Cost improvement in rollout is not automatic - special conditions must be met for the rollout policy to be no worse than the basic heuristic.

A base heuristic is sequentially consistent if in a given state it chooses the control that depends only on that state i.e. a heuristic that generates \(\{x_k,\cdots,x_N\}\) will generate \(\{x_{k+1},\cdots,x_N\}\) when starting from \(x_{k+1}\). Greedy heuristics are sequentially consistent. We will focus on a less restrictive condition: sequential improvement i.e. cost of the rollout policy is less than or equal to cost of the base heuristic.

Simplified rollout: Assuming sequential improvement, choose any \(u_k\) that satisfies \(\tilde{Q}(x_k,u_k)=g_k(x_k,u_k)+H_{k+1}(f_k(x_k,u_k))\leq H_k(x_k)\). This saves computation.

Rollout with superheuristic: Use multiple heuristics and use the Q-factor of the best heuristic - requires all heuristics to be sequentially improving.

Fortified rollout: [Lecture 5 Slide 26/30] The idea is at each step, follow the best trajectory computed thus far. At state \(x_k\) store both the permanent rollout trajectory \(\bar{P}_k =\{x_0,u_0,\cdots,u_{k-1},x_k\}\) and the tentative best trajectory \(\bar{T}_k = \bar{P}_k \cup \{\bar{u}_k,\bar{x}_{k+1},\cdots,\bar{u}_{N-1},\bar{x}_N\}\). Reject the minimum Q-factor choice \(\tilde{u}_k\) if its complete trajectory is more costly than the current tentative best, else choose \(\tilde{u}_k\) and update \(\bar{T}_{k+1}\). (#DOUBT Why keep a tentative trajectory - is it there just for a reference?).

Mode-free rollout: By model-free we mean we have no equations for the model. Here we do not know \(G\) and/or the constraint sets \(U_k\). Instead we have a base heuristic, and a human who can rank any 2 complete solutions without assigning values to them. Deterministic rollout can be used here.

Example: Rollout with an expert - can be applied to RNA folding (, a). Given a sequence of nucleotides (molecules of type A,C,G,U), fold it in an "interesting" way - we want to pair nucleotides that result in an interesting structure. Biologists do not agree on what the best folding is, and often imperfect software that compares complete folding is used. At each step, go one step forward from the current pairing, where at each nucleotide in the sequence we can (a) close the pairing (b) do nothing (c) open the pairing. There is a base heuristic which generates a complete folding from partial foldings. Two complete foldings can be compared by the expert software. There is no explicit cost function here.

(#TODO Start from Lecture 6 and continue)