Reinforcement Learning From Human Feedback (RLHF)

One shortcoming of RLHF in Large Language Models (LLMs) mentioned by Sergey Levine is that currently there is more focus on the reward modelling aspect, than the longer horizon nature of dialogue with LLMs. For e.g. when negotiating for something, we have an end-goal in mind and hence we would definitely not want to be myopic. Same thing with the 20 questions game.

Emacs 29.4 (Org mode 9.6.15)