Policy Iteration

When using value iteration to solve an MDP, one can notice that the optimal actions do for each state become fixed before the value function itself converges.

Policy iteration thus alternates between policy evaluation (i.e. compute the value function for the policy we have currently) and policy improvement (i.e. updating the policy using one-step look-ahead from the converged but not optimal value function) until the policy converges.

The Policy Improvement Theorem provides mathematical grounding as to why this is not merely a heuristic.

When the policy converges, the value functions are the same for all states, which is equivalent to the Bellman Optimality Equation meaning that the value functions and thus the converged policy is optimal.

In the initial step, one could start with some value function (e.g. 0 for all states) and some predetermined policy (e.g. random or same action for all states).

Emacs 29.4 (Org mode 9.6.15)