Upper Confidence Bound (UCB)

An approach to selecting an arm \(a_t\) in Multi-Armed Bandits (MABs) governed by the following rule: \(a_t = \text{argmax}_{a_i} \hat{Q}_t(a_i) + \sqrt{\frac{\alpha \ln t}{N_t(a_i)}\) Where \(\hat{Q}_t\) is the empirical estimate of the action-value function, and the latter term is an optimism bonus which decreases with the number of times \(a_i\) has been pulled at time \(t\), denoted \(N_t(a_i)\), and increases with \(t\).