Softmax Exploration

A method for choosing actions given a Q-function. The policy distribution is defined to be: \(\pi(a|s) = \frac{\text{exp}(Q(s,a)/\tau)}{\sum_{a'}\text{exp}(Q(s,a')/\tau)}\).

To ensure that exploration reduces uncertainty, we can apply the principle of optimism in the fact of uncertainty.

Emacs 29.4 (Org mode 9.6.15)