Q-learning

Unlike the SARSA algorithm, Q-learning uses the current knowledge of the action-value function (Q-function) to update the target policy greedily:

\(Q(S_t,A_t) = Q(S_t,A_t) + \alpha(R_t+\gamma \max_a Q(S_{t+1},a) - Q(S_t,A_t))\)

Emacs 29.4 (Org mode 9.6.15)