Q-learning

Unlike the SARSA algorithm, Q-learning uses the current knowledge of the action-value function (Q-function) to update the target policy greedily:

Q(St,At)=Q(St,At)+α(Rt+γmaxaQ(St+1,a)Q(St,At))

Emacs 29.4 (Org mode 9.6.15)