All of Tomek Korbak's Comments + Replies

Fair point, I'm using "compositional" in an informal sense different from the one in formal semantics, closer to what I called "trivial compositionally" in this paper. But I'd argue it's not totally crazy to call such preference models compositional and that compositionally here still has some resemblance to Montague's account of compositionally as homeomorphism: basically, you have get_total_score(response) == sum([get_score(attribute) for attribute in decompose(response)])

In practice I think using a trained reward model (as in RLHF), not fixed labels, is the way forward. Then the cost of acquiring the reward model is the same as in RLHF, the difference is primarily that PHF typically needs much more calls to the reward model than RLHF.

Thanks, I found the post quite stimulating. Some questions and thoughts:

  1. Is LLM dynamics ergodic? I.e. is the time average equal to , the average page vector?.

  2. One potential issue with this formalisation is that you always assume a prompt of size (so you need to introduce artificial "null tokens" if the prompt is shorter) and you don't give special treatment to the token <|endoftext|>. For me, it would be more intuitive to consider LLM dynamics in terms of finite, variable length, token-level Markov chains (until <|endofte

... (read more)
5Cleo Nardo
1. Almost certainly ergodic in the limit. But it's highly period due to English grammar. 2. Yep, just for convenience. 3. Yep. 4. Temp = 0 would give exactly absorbing states.

I don't remember where I saw that, but something as dumb as subtracting the embedding of <|bad|> might even work sometimes.

That's a good point. But if you're using a distilled, inference-bandwith-optimised RM, annotating your training data might be a fraction of compute needed for pretraining.

Also, the cost of annotation is constant and can be amortized over many training runs. PHF shares an important advantage of offline RL over online RL approaches (such as RLHF): being able to reuse feedback annotations across experiments. If you already have a dataset, running a hyperparameter sweep on it is as cheap as standard pretraining and in contrast with RLHF you don't need to recompute rewards.

For filtering it was 25% of best scores, so we effectively trained for 4 epochs.

(We had different threshold for filtering and conditional training, note that we filter at document level but condition at sentence level.)

Good question! We're not sure. The fact that PHF scales well with dataset size might provide weak evidence that it would scale well with model size too.

I'm guessing that poison-pilling the <|bad|> sentences would have a negative effect on the <|good|> capabilities as well?

That would be my guess too.

Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode?

We did, LMs tends to generate toxic text when conditioned on <|bad|>. Though we tended to have a risk-aversive thresholds, i.e. we used <|good|> for only about 5% safest sentences and <|bad|> for the remaining 95%. So <|bad|> is not bad all the time.

Here it would be helpful to know what the AI produces when prompted by <|bad|>.

That's a good point. We haven't systematically investigate difference in capabilities between<|goo... (read more)

1Evan R. Murphy
Sounds like a good approach. How do you go about doing this?
2Logan Riggs
Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

I really liked the post and the agenda of improving safety through generative modelling is close to my heart.

we begin an online phase of its training: the agent starts acting in its environment and generating new task completions, which are recorded and fed back into the decision transformer as new training data

But you still need online access to our MDP (i.e. reward function and transition function), don't you? And it's access to MDP that drives novelty and improvement If you were just sampling whole trajectories from the model (asking the model itsel... (read more)

3Sam Marks
Yep, that's right! This was what I meant by "the agent starts acting in its environment" in the description of an ODT. So to be clear, during each timestep in the online phase, the ODT looks at a partial trajectory g1,o1,a1,…,gt−1,ot−1,at−1,gt,ot of rewards-to-go, observation, and actions; then selects an action at conditional on this partial trajectory; and then the environment provides a new reward rt (so that gt+1=gt−rt) and observation ot+1. Does that make sense? Thanks for the reference to the Levine paper! I might have more to say after I get a chance to look at it more closely.

Thanks for sharing your thoughts, I found these remarks extremely insightful!

It seems like ideal way forward is to more accurately capture what you actually care about, then optimize that---staying close to the original distribution feels like more of a hack to me. It seems like you view the original distribution of webtext as more principled or fundamental than I do, but I'm not sure what accounts for that difference.

A reply that comes to mins is that maybe being grounded in human knowledge, reasoning rules and values represented in web text has inher... (read more)

Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations.

That's a good point and helps to make a distinction between generative models and policies. In the interactive case, your policy pi(a|s) is conditional distr... (read more)

2RobertKirk
I feel like this wasn't exactly made clear in the post/paper that this was the motivation. You state that distribution collapse is bad without really justifying it (in my reading). Clearly distributional collapse to a degenerate bad output is bad, and also will often stall learning so is bad from an optimisation perspective as well (as it makes exploration much less likely), but this seems different from distributional collapse to the optimal output. For example, when you say I think I just disagree, in the case where we're considering LLMs as a model for future agent-like systems that we will have to align, which to me is the reason they're useful for alignment research. If there's a normative claim that diversity is important, then you should just have that in your objective/algorithm. I think the reason KL divergence is included in RLHF is an optimisation hack to make sure it does well. Maybe that's revealed that for alignment we actually wanted the Bayesian posterior distribution you describe rather than just the optimal distribution according to the reward function (i.e. a hardmax rather than a softmax on the reward over trajectories), although that seems to be an empirical question whether that was our preference all along or it's just useful in the current regime.

I'm glad you found our post insightful!

I'm not sure what is the best energy allocation between modelling and inference here. I think, however, that the modelling part is more neglected (the target distribution is rarely even considered as something that can be written down and analysed). Moreover, designing good target distributions can be quite alignment-specific whereas designing algorithms for inference in probabilistic graphical models is an extremely generic research problem so we can expect progress here anyway.

I expect that in the current regime (only optimizing the policy a small amount), any method that does a reasonable job of maximizing reward while controlling how much the policy changes can be made to work in practice

Yes, that seems plausible. Though as you said, most methods that only change the policy a bit (early stopping, clipping in PPO) do that via implicit KL penalties and still can be seen as updating a prior.

there would be an exploration-exploitation trade-off, which is something that the RL perspective may again offer insight into.

Definite... (read more)