Softmax Exploration
A method for choosing actions given a Q-function. The policy distribution is defined to be: \(\pi(a|s) = \frac{\text{exp}(Q(s,a)/\tau)}{\sum_{a'}\text{exp}(Q(s,a')/\tau)}\).
To ensure that exploration reduces uncertainty, we can apply the principle of optimism in the fact of uncertainty.