# Reinforcement Learning in Recommender Systems: Some Challenges - 2019

## Details

Title : Reinforcement Learning in Recommender Systems: Some Challenges Author(s): Simons Institute Link(s) : https://www.youtube.com/watch?v=X3uozqaNCYE

## Rough Notes

We want to recommender sysetems (RSs) that are more user centric, (informally) that really understand a users genuine needs, goals, tasks, preferences, and can do a good job at attending these needs subject to the preferences, understand the task and interacts with the user in a natural, unobtrusive and transparent manner.

The authors thinking is mainly drived by:

- Understand, estimate, elicit, react to and influence
**user latent state**(e.g. satisfaction, transient vs. persistent preferences, needs, interests, patterns of activity etc). - Natural interaction with users (e.g. multi-turn, multi-modal, dialogue, etc).
- Acting in the user's best intersts (user preferences, behavourial processes governing user decision making and reasoning etc).

Some challenges in user-facing RL e.g. RSs include:

- Scaling to multi-user scenarios, combinatorial actions spaces etc.
- Idiosyncratic nature of actions i.e. stochastic action sets, dynamic slate (which contains recommendations) size.
- User latent state has high degree of unobservability, stochasticity etc. There is a tremendous level of unobservability as a result of preferences, personalities, activities, exogenous factors, behavourial effects etc which all add up to long-horizon low-SNR POMDPs.
- Eco-system effects/factors such as incentives, ecosystem dynamics, fairness etc.

This talk will mainly focus on stohcastic action sets, combinatorial slates and user learning.

Consider the case where the set of actions an agent can take is governed by some exogenous stochatic process which is out of the agents control, called stochastic action sets. This could happen in for e.g. ad auctions and RSs. To construct an MDP for this problem, there are 2 approaches that first come to mind:

- Embed the action set to the state space, this however leads to an exponential blowup of the state space since we need a copy of every state for every possible subset of the actions.
- Say the actions at every state are the set of mappings that take the revealed action set \(A_s\), and tells us what action to play from \(A_s\). This however leads to a factorial blowup in the action space.

The resulting MDPs from the 2 approaches mentioned do not have desirable space complexity, despite this the optimal policies and value functions are provable compact and polytime algorithms exist for solving them (, ).

In the space of user learning, there is evidence of very slow user learning and adaptation (e.g. Figure 6 in (, a)). Such problems have very low SNR, latent state (for e.g. satisfaction) has a very noisy impact on user engagement, and latent state moves slowly and persistently, resulting in significant cumulative effect over long horizons (, a).

For slate RSs, an MDP formulation is:

- States \(S\) - users features, user history, contextual features
- Actions \(A\) - possible recommended slates of recommendations
- Reward \(R\) - immediate engagement of a selected recommendation
- Discount rate \(\gamma\) - to reflect target/expected session length
- Objective - maximize cumulative user engagement over a session

The combinatorics of slates makes the problem challenging, namely there are \(N_Ck \times k!\) possible \(k\) -slates. Ideally we would want ot dempose the value of a slate into a value of its constituent items. Such a demposition is hard since the presence of some iterms on the slate impacts user response and hence the value of other items. The value of a slate also depends on the user choice model which could be Luce-Sheppard (MNL), Plackett-Luce, cascade etc.

SlateQ (, a) approaches this problem by decomposing slate Q-values into item Q-values. What is learnt here is called the conditional-on-click, item-wise Q-values, and the Q-value of a slate is derived from (and also feeds into) item-level Q-values. This model makes the following assumptions:

- User selects one item from slate (including null item \(\emptyset\)).
- State transitions and reward depend only on selected items.

To account for the combinatorial slate space, the problem is formulated as a Fractional Mixed Integer Program which could be relaxed to a Linear Program, which is fast enough during training time. In real time deployment, a greedy approximation is used.

In summary, some (more?) challenges in user-interactive RL:

- Modeling user latent state.
- Behavourial phenomena e.g. anchoring, framing, choice biases etc.
- Preference modeling, elicitation, dialogue.
- Ecosystem effects - fairness, long-term health etc.