Bellman Equations
Recall the intuition that if we denote \(G_t\) to be the total sum of (possibly discounted) expected rewards, then we have the recurrent relation \(G_t = R_t + \gamma G_{t+1}\).
For Markov Decision Processes (MDPs), the Bellman equations are as follows.
- Bellman equation for state-value function ((, , Equation 3.14))
- Bellman equation for action-value function
- Bellman optimality equation for state-value function (special case of bellman eq for state-value functions for the optimal value function which does not depend on a generic policy pi, p63 cite:sutton-2018-reinf)
- Bellman optimality equation for action-value function
They can derived via:
- Backward reasoning for deterministic dynamics such as was done in Topics in Reinforcement Learning (Arizona State University CSE691).
- Computing \(\mathbb{ E }_{A_t,R_{t+1},S_{t+1},A_{t+1},\cdots}[G_{t+1}|S_t=s]\) and using the Markov property (where \(G_{t+1}=\sum_{i=t+1}^{\infty}\gamma^{i-{t+1}}R_{t+1}\)).