Markov Decision Processes (MDPs)
A Markov Decision Process (MDP) is a 5 tuple \((S,A,P_a,R_a,\gamma)\) where
- \(S\) is the state space.
- \(A\) is the action space.
- \(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the transition probability for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property.
- \(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate reward for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\). More generally, the reward may also depend on \(S_{t+1}\).
\(\gamma \in [0,1]\) is the discount factor for rewards.
An MDP is finite if the state, action and reward spaces are finite.
The above definition is from David Silver's UCL RL course.
In Sutton-Barto (, , p. 48), the MDP is imagined as a feedback loop between the agent and environment, where at time \(t\) the agent receives a state \(s_t\) and reward \(r_t\) from the environment, then generates action \(a_t\). Namely, the MDP and agent give rise to a trajectory \(s_0, a_0, r_1, s_1, a_1, r_2, s_2 ,\cdots\). The dynamics is defined as the distribution \(\mathbb{P}(S_{t}, R_{t}|S_{t-1}=s,A_{t-1}=a)\) over states and rewards given the current state and action taken, from which one derives the state transition probabilities, expected rewards for state-action pairs, state-action-next state triples, via the product rule and marginalization.
An MDP is considered solved if we have an optimal policy \(\pi^*(a,s) = \mathbb{P}(a_t=a|s_t=s)\) that maximizes the sum of cumulative future rewards \(R_ = \sum_{t=0}^\infty \gamma^t r_{t+1}\) where \(r_t\) is the reward after the action taken at time \(t-1\) i.e. \(a_{t-1}\) - this convention is from (, , p. 48).
For any MDP, there always exists a deterministic optimal policy.
Denote \(G_t\) to be the accumulated rewards since time \(t\) i.e. \(G_t = \sum_{k=0}^H \gamma^k R_{t+k+1}\) where \(H\) is the horizon i.e. number of steps, which could be infinity. Note that \(G_t = R_{t+1} + \gamma G_{t+1}\). For finite \(H\), the agent-environment interaction breaks into subsequences called episodes e.g. trips through a maze \(H\) could be a random variable.
There are 2 important concepts for MDPs, denoted below.
- State-value function: The expected cumulative rewards with policy \(\pi\) starting from state \(s\), \(V_\pi(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\). Intuition: The reward function maps a state-action to a reward value. Now suppose we are at state \(s\), meaning that we will take action \(a\sim \pi(a\mid s)\). For now, assume it is deterministic, so the reward function is \(R(s, \pi(a\mid s))\).
- Action-value function: (Q-function) The expected cumulative rewards with policy \(\pi\) starting from state \(s\), taking action \(a\), \(Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]\).
Importantly, we can write each in terms of the other.
\[ V_\pi(s) = \mathbb{E}[G_t|S_t=s] = \sum_g gp(g|s) \] \[ \sum_g \sum_a gp(g,a|s) = \sum_g \sum_a gp(g|s,a)p(a|s) = \sum_a p(a|s) \sum_g gp(g|s,a) = \sum_a \pi(a|s)Q_\pi(s,a)\]
The optimal state-value function \(V^*(s)\) and optimal action-value function \(Q^*(s,a)\) is the maximum of all such functions over the policy space. All optimal policies achieve optimal state-value and action-value functions.
One key tool to solve MDPs are the Bellman Equations which represent the state/action value functions and the optimal state/value functions in terms of themselves.