Thanks for this series of expanded sections!
I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.
I'm skeptical of all of those proposed markers of agentic behavior. Being able to predict what an agent would say, when prompted, is different than being an agent in the sense that causes concern (although it certainly lets some actor build an agent using the predictive model as a prior on policies.). What we'd see if a LLM was "secretly" an agent is that it would deviate from being a predictive model, in ways that systematically steered towards some goal - just outputting "I want money" is weaksauce evidence for agency, especially if it's the sort of thing a predictive model would output and also doesn't actually steer the world towards some goal we could impute to the network.
I'm confused about the distributional generalization thing. Why is that different from minimizing log loss? The loss function (for the base network, not RL-finetuning) is computed based on the logits, not on the temperature-0 sample, right? So a calibrated probability distribution should minimize loss.
The paper explains it better than I can, but essentially: if I give you an imbalanced labeling problem, where 60% are A and 40% are B, and I remove all the actual features and just replace them with noise, the Bayes-optimal thing to do is output B every time, but in fact large neural networks will learn to output A 60% of the time and B 40% of the time even in that setting.
I'm skeptical of all of those proposed markers of agentic behavior. Being able to predict what an agent would say, when prompted, is different than being an agent in the sense that causes concern (although it certainly lets some actor build an agent using the predictive model as a prior on policies.). What we'd see if a LLM was "secretly" an agent is that it would deviate from being a predictive model, in ways that systematically steered towards some goal - just outputting "I want money" is weaksauce evidence for agency, especially if it's the sort of thing a predictive model would output and also doesn't actually steer the world towards some goal we could impute to the network.
Yes, I agree--these markers mostly don't test whether the model is a predictor (though that's not entirely true, I do think the delta in markers of agency between different training regimes is a useful datapoint there). Primarily, however, what they do test is, if it is a predictor, how agentic is the thing that it is predicting . And I think that's extremely important, since we really want to avoid predictive models that are simulating potentially malign agents.
This is the final of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper.
Edit: For some follow-up discussion of some differentiation factors between predictive and non-predictive models that could yield good experiments in that direction, see here.
7. Open problems
We think that there are a wide variety of ways—both experimental and theoretical—in which our analysis could be expanded upon. Here, we’ll try to briefly lay out some of the future directions that we are most excited about—though note that this is only a sampling of some possible future directions, and is thus a highly incomplete list:
We are eager to see more progress in these directions, and are keen to engage with researchers interested in them.
8. Conclusion
Overall, when thinking about what future pre-trained large language models will do, we think that not only will it often make sense to think of them as predictive models of the world, but that if they are well-described as predictive models of the world, aligning them via careful conditioning might be quite achievable. As we have noted extensively, however, there are many caveats to this position.
First, thinking of LLMs as predictive models suggests a variety of potentially fatal issues that any careful conditioning approach will have to deal with, namely around predicting other AI systems, self-fulfilling prophecies, and anthropic capture. Some of these issues, such as predicting other AI systems, seem potentially amenable to conditioning-based approaches, such as conditioning on particular world events, to at least partially ameliorate them. Anthropic capture in particular, however, seems essentially impossible to deal with via conditioning and will likely require modifications to training instead.
Second, we think that it continues to be quite unclear what fine-tuning techniques should actually be considered to constitute conditioning a predictive model. Even if pre-training in fact yields models that are well-described as predictive, whether fine-tuning regimes such as RLHF disrupt that is highly uncertain.
Third, none of the careful conditioning techniques we have discussed scale to arbitrarily strong levels of capabilities. As far as we can tell, indefinitely scalable alignment via conditioning predictive models does not seem possible. Nevertheless, we think that such techniques could be used to elicit capabilities in a regime where capability elicitation is otherwise not possible to do safely, and could therefore push out the level of capabilities that we are able to safely deploy to a sufficient extent to enable us to use such a predictive model to perform some sort of pivotal act that substantially reduces overall AI existential risk, such as significantly advancing AI alignment research.
Fourth, since such conditioning techniques can easily be circumvented by a careless user, deployment strategies built around conditioning predictive models need to be especially careful and especially fast. Otherwise, such models could easily end up being used by less careful people within leading organizations or at other, non-leading organizations in highly dangerous ways before any sort of pivotal act can occur.
Nevertheless, we believe that careful conditioning approaches for predictive models represent the safest known way of eliciting capabilities from AIs, up to the maximum capabilities level that is plausible for any human or group of humans to ever produce.
Appendix: Markers of agentic behavior
In many of the open problems above, we mentioned the possibility of testing for markers of agentic behavior in models as a way to understand whether they are generally behaving as coherent agents or not.
For example, we could use this as a way to start testing the RLHF conditioning hypothesis: if RLHF models behave as substantially more coherent agents than pre-trained models, that might be some evidence against the RLHF conditioning hypothesis. Unfortunately, it’s not quite that straightforward: conditioning a predictive model to behave like an agent should also increase such markers. In that case, however, there is still something we can do, which is compare to what humans do on our agentic behavior test. If the RLHF model is substantially more agentic than humans, then that’s some evidence that it’s not just predicting a human—though it could still just be predicting some highly agentic behavior.
Additionally, another thing to keep in mind here is that all of these tests are purely behavioral, so they have no ability to check for deceptive alignment, since a deceptively aligned model could pretend to have whatever behavior here would avoid us detecting it. As a result, these markers can’t and shouldn’t be used as a way to verify the absence of deception, though they can potentially be used to look for precursors to deception, such that they can let us predict when/where deception will be most likely to arise.
Furthermore, note that these are all intended to be tested zero-shot in situations where the models have never been trained on these sorts of questions other than in pre-training. Certainly, if they are trained on directly, we think it should be possible to get whatever answer we want, but then we need to know if that will generalize. Though we think it might be okay to randomly hold out some subset of these evals and see if it generalizes correctly to the others, we’re skeptical that even if it gets all of them right that, if the underlying bias is against getting them right, it’ll keep getting them right in the future.
For all of the below, let U be something we think the model might care about and that we want to know whether the model is agentic with respect to it—e.g. helpfulness. Then, here are the markers of agentic behavior that we think might be useful to look at: