# Markov Decision Processes (MDPs)

A Markov Decision Process (MDP) is a 5 tuple \((S,A,P_a,R_a,\gamma)\) where

- \(S\) is the
**state space**. - \(A\) is the
**action space**. - \(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the
**transition probability**for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property. - \(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate
**reward**for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\). \(\gamma \in [0,1]\) is the discount factor for rewards.

An MDP is finite if the state, action and reward spaces are finite.

The above definition is from David Silver's UCL RL course.

In Sutton-Barto (, , p. 48), the MDP is imagined as a feedback loop between the agent and environment, where at time \(t\) the agent receives a state \(s_t\) and reward \(r_t\) from the environment, then generates action \(a_t\). Namely, the MDP and agent give rise to a

**trajectory**\(s_0, a_0, r_1, s_1, a_1, r_2, s_2 ,\cdots\). The**dynamics**is defined as the distribution \(\mathbb{P}(S_{t}, R_{t}|S_{t-1}=s,A_{t-1}=a)\) over states and rewards given the current state and action taken, from which one derives the state transition probabilities, expected rewards for state-action pairs, state-action-next state triples, via the product rule and marginalization.An MDP is considered solved if we have an

**optimal policy**\(\pi^*(a,s) = \mathbb{P}(a_t=a|s_t=s)\) that maximizes the sum of cumulative future rewards \(R_ = \sum_{t=0}^\infty \gamma^t r_{t+1}\) where \(r_t\) is the reward**after the action taken at time**\(t-1\) i.e. \(a_{t-1}\) - this convention is from (, , p. 48).For any MDP, there always exists a deterministic optimal policy.

Denote \(G_t\) to be the accumulated rewards since time \(t\) i.e. \(G_t = \sum_{k=0}^H \gamma^k R_{t+k+1}\) where \(H\) is the horizon i.e. number of steps, which could be infinity. Note that \(G_t = R_{t+1} + \gamma G_{t+1}\). For finite \(H\), the agent-environment interaction breaks into subsequences called

**episodes**e.g. trips through a maze \(H\) could be a random variable.There are 2 important concepts for MDPs, denoted below.

**State-value function**: The expected cumulative rewards with policy \(\pi\) starting from state \(s\), \(V_\pi(s)=\mathbb{E}_{\pi}[G_t|S_t=s]\).**Action-value function**: (Q-function) The expected cumulative rewards with policy \(\pi\) starting from state \(s\), taking action \(a\), \(Q_\pi(s,a)=\mathbb{E}_\pi[G_t|S_t=s,A_t=a]\).

Importantly, we can write each in terms of the other.

\[ V_\pi(s) = \mathbb{E}[G_t|S_t=s] = \sum_g gp(g|s) \] \[ \sum_g \sum_a gp(g,a|s) = \sum_g \sum_a gp(g|s,a)p(a|s) = \sum_a p(a|s) \sum_g gp(g|s,a) = \sum_a \pi(a|s)Q_\pi(s,a)\]

The optimal state-value function \(V^*(s)\) and optimal action-value function \(Q^*(s,a)\) is the maximum of all such functions over the policy space. All optimal policies achieve optimal state-value and action-value functions.

One key tool to solve MDPs are the Bellman Equations which represent the state/action value functions and the optimal state/value functions in terms of themselves.