Markov Decision Processes (MDPs)
A Markov Decision Process (MDP) is a 5 tuple \((S,A,P_a,R_a,\gamma)\) where
- \(S\) is the state space.
- \(A\) is the action space.
- \(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the transition probability for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property.
- \(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate reward for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\).
\(\gamma \in [0,1]\) is the discount factor for rewards.
An MDP is finite if the state, action and reward spaces are finite.
The above definition is from David Silver's UCL RL course.
In Sutton-Barto (, , p. 48), the MDP is imagined as a feedback loop between the agent and environment, where at time \(t\) the agent receives a state \(s_t\) and reward \(r_t\) from the environment, then generates action \(a_t\). Namely, the MDP and agent give rise to a trajectory \(s_0, a_0, r_1, s_1, a_1, r_2, s_2 ,\cdots\). The dynamics is defined as the distribution \(\mathbb{P}(S_{t}, R_{t}|S_{t-1}=s,A_{t-1}=a)\) over states and rewards given the current state and action taken, from which one derives the state transition probabilities, expected rewards for state-action pairs, state-action-next state triples, via the product rule and marginalization.
An MDP is considered solved if we have an optimal policy \(\pi^*(a,s) = \mathbb{P}(a_t=a|s_t=s)\) that maximizes the sum of cumulative future rewards \(R_ = \sum_{t=0}^\infty \gamma^t r_{t+1}\) where \(r_t\) is the reward after the action taken at time \(t-1\) i.e. \(a_{t-1}\) - this convention is from (, , p. 48).
For any MDP, there always exists a deterministic optimal policy.
Denote \(G_t\) to be the accumulated rewards since time \(t\) i.e. \(G_t = \sum_{k=0}^H \gamma^k R_{t+k+1}\) where \(H\) is the horizon i.e. number of steps, which could be infinity. Note that \(G_t = R_{t+1} + \gamma G_{t+1}\). For finite \(H\), the agent-environment interaction breaks into subsequences called episodes e.g. trips through a maze \(H\) could be a random variable.
There are 2 important concepts for MDPs, denoted below.
- State-value function: The expected cumulative rewards with policy \(\pi\) starting from state \(s\), \(V_\pi(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\).
- Action-value function: (Q-function) The expected cumulative rewards with policy \(\pi\) starting from state \(s\), taking action \(a\), \(Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]\).
Importantly, we can write each in terms of the other.
\[ V_\pi(s) = \mathbb{E}[G_t|S_t=s] = \sum_g gp(g|s) \] \[ \sum_g \sum_a gp(g,a|s) = \sum_g \sum_a gp(g|s,a)p(a|s) = \sum_a p(a|s) \sum_g gp(g|s,a) = \sum_a \pi(a|s)Q_\pi(s,a)\]
The optimal state-value function \(V^*(s)\) and optimal action-value function \(Q^*(s,a)\) is the maximum of all such functions over the policy space. All optimal policies achieve optimal state-value and action-value functions.
One key tool to solve MDPs are the Bellman Equations which represent the state/action value functions and the optimal state/value functions in terms of themselves.