Why do we need RLHF? Imitation, Inverse RL, and the role of reward
This is a cross-post from my Substack to get some feedback from the alignment community. I have condensed some sections to get the message across. It's rarely questioned in the mainstream why we need RLHF in the first place? If we can do supervised fine-tuning (SFT) sufficiently well, then can...
This reminds me of the paper Chris linked as well. I think there's very solid evidence on the relationship between the kind of meta learning Transformers go through and Bayesian inference (e.g., see this, this, and this). The main question I have been thinking about is what is a state for language and how that can be useful if so discovered in this way? For state-based RL/control tasks this seems relatively straightforward (e.g., see this and this), but this is much less clear for more abstract tasks. It'd be great to hear your thoughts!