Ran W — LessWrong

LESSWRONG
LW

Ran W — LessWrong

Replying toTransformers Represent Belief State Geometry in their Residual Stream

Transformers Represent Belief State Geometry in their Residual Stream

This reminds me of the paper Chris linked as well. I think there's very solid evidence on the relationship between the kind of meta learning Transformers go through and Bayesian inference (e.g., see this, this, and this). The main question I have been thinking about is what is a state for language and how that can be useful if so discovered in this way? For state-based RL/control tasks this seems relatively straightforward (e.g., see this and this), but this is much less clear for more abstract tasks. It'd be great to hear your thoughts!

Why do we need RLHF? Imitation, Inverse RL, and the role of reward

Ran W

This is a cross-post from my Substack to get some feedback from the alignment community. I have condensed some sections to get the message across.

It's rarely questioned in the mainstream why we need RLHF in the first place? If we can do supervised fine-tuning (SFT) sufficiently well, then can we just ditch reward modeling and RL at all? This seems to be a question that many people silently have (e.g., in this tweet).

To better understand the goal of RLHF, I reflected on the related paradigm of inverse RL (IRL), which learns a reward function from human demonstrations rather than comparisons, and I took a deep dive into the roles of reward and... (read 1264 more words →)

Replying toModels Don't "Get Reward"

Ran W3y

Models Don't "Get Reward"

Hi thanks for share this interesting perspective on RL as a training process! Although it seems to only be a matter of seeking vs obeying and reward vs cost, the effect on the reader's mind seem to be huge!

One thing that seems to be happening here and I have not fully digested is the "intrinsicness" of rewards. In frameworks parallel to mainstream RL, such as active inference and the free energy principle, policy is a part of the agent's model such that the agent "self-organizes" to a characteristic state of the world. The policy can be constructed either through reward or not. However, in the active inference literature, how policies are constructed in real agents are currently unanswered (discussions exist but don't close the case).

How this intrinsic perspective is related to the post and safety and alignment? I am still thinking about it. If you have any thoughts please share!