Partially Observable Markov Decision Processes (POMDPs)
A Partially Observable Markov Decision Process (POMDP) is a 7 tuple \((S,A,O,P_a,R_a,Z,\gamma)\) where
- \(S\) is the state space.
- \(A\) is the action space.
- \(O\) is the observation space.
- \(P_a(s,s'):S \times A \times S \rightarrow [0,1] =\mathbb{P}(S_{t+1}=s'|S_t=s,A_t=a)\) is the transition probability for the next possible state \(s'\) given the current state \(s\) under action \(a\), which obeys the Markov property.
- \(R_a(s) : S \times A \rightarrow \mathbb{R} = \mathbb{E}[R_{t+1}|S_t=s,A_t=a]\) is the immediate or expected immediate reward for transitioning to the new state \(s'\) given the current state \(s\) under action \(a\).
- \(Z_a(o,s'): O \times S \times A \rightarrow [0,1] = \mathbb{P}(O_{t+1}=o|S_{t+1}=s', A_t=a)\) is the observation model.
- \(\gamma \in [0,1]\) is the discount factor for rewards.
It is an MDP with hidden states, or equivalently a Hidden Markov Model (HMM) with actions.