SARSA

Named from sequence \(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}\) which spells SARSA. Some work may label the reward here as \(R_t\).

Applies Temporal-Difference (TD) learning to the Q-function, with ε-greedy exploration for policy improvement. At each time step, update the Q-function by: \(Q(S_t, A_t) = Q(S_t,A_t) + \alpha(R_t + \gamma Q(S_{t+1},A_{t+1}) - Q(S_t,A_t))\)

SARSA is an on-policy method.

SARSA converges under Greedy in the Limit of Infinite Exploration (GLIE) and \(\sum_t \alpha_t < \infty\) and \(\sum_t \alpha^2_t <\infty\).

SARSA(\(\lambda\)) uses TD(\(\lambda\)) updates, and makes use of eligibility traces.