Zero-Shot Assistance in Sequential Decision Problems - 2022

Details

Title : Zero-Shot Assistance in Sequential Decision Problems Author(s): De Peuter, Sebastiaan and Kaski, Samuel Link(s) : http://arxiv.org/abs/2202.07364

Rough Notes

Setting: Human agent and AI agent
Unseen sequential decision-making task with no explicit reward.

AI agent provides advice to the human agent. Human agent takes the action in the environment.

Humans face novel decision problems, e.g. in design tasks. In these tasks, the final solution is unknown apriori, and the human knows how to solve the problem in principle. Human decisions in these tasks are driven by a goal, which can be encoded as a reward function (#DOUBT) which however is only known to the human who is in general unable to explicitly define the reward.

This paper explores creating AI agents to assist human agents in these novel decision problems. The AI agent has the goal of improve the human agent's cumulative reward relative to the human's effort. (#DOUBT How is human effort measured?)

Some difficulties:

Cannot infer reward functions from prior tasks since each task is novel.
Human agent has biases which they are not aware of, and thus cannot be communicated to the AI agent.
Due to the novel nature of the task, both the reward function and the biases have to be inferred online.

AI agent's advice is of the form : "Have you considered doing action \(a\)?". The human agent does makes the final decision.

This paper looks at 2 concrete problems:

Planning a day trip.
Inventory management with stochastic demand.

AIAD significantly improves:

Compared to fully automated solutions where the human has to put no effort. (#DOUBT What is the reward function here, since full automation needs a reward function).
When the assistant infers and accounts for the human's biases.

(#NOTE Looks like (, ) introduces assistance as a decision problem and introduces new and relevant MDP variants).

The human agent solves an infinite-horizon MDP, where instead of a reward function we have a class of parametrized reward functions \(\{R_\omega\}_{\omega\in \Omega}\).

The AI agent aims to "maximize the cumulative discounted reward obtained by the agent" through its advice.

For the AI agent to plan, we assume it has a human agent model \(\hat{\pi}(a|s,a';\theta,\omega)\) of the human agent's fixed policy after it receives advice \(a'\). Note that this model depends on \(\theta \in \Theta\) called the bias parameters which model the human agent's human biases, and \(\omega \in \Omega\) which parametrize the human agent's reward function.

The AI agent's decision problem is modelled as Generalized Hidden Parameter Markov Decision Process (GHP-MDP) (, a), where a big difference is the transition function \(\mathcal{ T }_{\omega,\theta}(s_{t+1}|s_t,a'_t) = \sum_{a_t}^{}\hat{\pi}(a_t|s_t,a'_t;\omega,\theta)T(s_{t+1}|s_t,a_t)\) and \(T\) is the transition function of the MDP solved by the human agent.

One challenge here is that the AI agent does not know the true \(\theta,\omega\). For each transition \((s_t,a_t', s_{t+1})\) we observe, we can compute the likelihood of that transition under different parameter values in \(\Omega\times \Theta\) to update a posterior distribution over them.

Finding optimal policies here involves planning on belief distributions over MDPs (each \(\omega,\theta\) pair define a single MDP) and also accounting for how this distribution changes as we act (#DOUBT the "we" in "as we act" refers to the AI agent right?).

The human agent model modelled as \(p(a|u)\propto \exp(\beta u(a))p(a)\) for some utility \(u\) (chosen to be the Q-function). Given action \(a'\), the probability that the human agent changes their action from \(a\) to \(a'\) is

To plan over this GHP-MDP,

(#DOUBT I do not understand the mechanism which makes knowledge of the human agent's bias enable advice which improves the final cumulative reward.)

(#NOTE See if this is related to (, a).)