Bellman Equations

Recall the intuition that if we denote \(G_t\) to be the total sum of (possibly discounted) expected rewards, then we have the recurrent relation \(G_t = R_t + \gamma G_{t+1}\).

For Markov Decision Processes (MDPs), the Bellman equations are as follows.

Bellman equation for state-value function ((, , Equation 3.14))
Bellman equation for action-value function
Bellman optimality equation for state-value function (special case of bellman eq for state-value functions for the optimal value function which does not depend on a generic policy pi, p63 cite:sutton-2018-reinf)
Bellman optimality equation for action-value function

They can derived via:

Backward reasoning for deterministic dynamics such as was done in Topics in Reinforcement Learning (Arizona State University CSE691).
Computing \(\mathbb{ E }_{A_t,R_{t+1},S_{t+1},A_{t+1},\cdots}[G_{t+1}|S_t=s]\) and using the Markov property (where \(G_{t+1}=\sum_{i=t+1}^{\infty}\gamma^{i-{t+1}}R_{t+1}\)).

Bellman Equations

Anki

Derive the Bellman equation for \(V_\pi(s)\).

Derive the Bellman equation for \(Q_\pi(s,a)\).