Contextual Bandits

Contextual bandits (also called associative search) generalize the Multi-Armed Bandits (MABs) setting, where we alongside a reward \(R_t\) at time \(t\), the agent also observes some state \(S_t\), called the context, which can be thought of as a situation the agent is currently in. E.g. in the casino analogy, imagine instead of one row of slot machines (which is the MAB setting), the problem now consists of multiple rows of slot machines where each row has its own colour. You get to see the current row where you are before you pull the arm, and after pulling the arm you are assigned another row randomly, where once again you get to know the row where you are.

The agent thus now has to learn how the contexts and rewards are intertwined, so they can use the context information to select the best action.

This problem is also called associative search, since the agent has to search for the best actions and associate them to the contexts they work best in. One application is in internet ad placement, where the web pages are the states, the ads to show the user are the actions, the user is the environment and the reward is the pay per click.

Similar to MABs, once the actions selected have an influence on reward and in this case the contexts, we have