This was a well written and optimistic viewpoint, thank you.
I may be misunderstanding this, but it would seem to me that LLMs might still develop a sort of instrumentality - even with short prediction lengths - as a byproduct of their training. Consider a case where some phrases are "difficult" to continue without high prediction loss, and others are easier. After sufficient optimization, it makes sense that models will learn to go for what might be a less likely immediate option in exchange for a very "predictable" section down the line. (This sort of meta optimization would probably need to happen during training, and the idea is sufficiently slippery that I'm not at all confident it'll pan out this way.)
In cases like this, could models still learn some sort of long form instrumentality, even if it's confined to their own output? For example, "steering" the world towards more predictable outcomes.
It's a weird thought. I'm curious what others think.
Consider a case where some phrases are "difficult" to continue without high prediction loss, and others are easier. After sufficient optimization, it makes sense that models will learn to go for what might be a less likely immediate option in exchange for a very "predictable" section down the line.
If I'm understanding you correctly, there seems to be very little space for this to happen in the context of a GPT-like model training on a fixed dataset. During training, the model doesn't have the luxury of influencing what future tokens are expected, so there are limited options for a bad current prediction to help with later predictions. It would need to be something like... recognizing that that bad prediction encodes information that will predictably help later predictions sufficiently that it satisfies the model's learned values.
That last bit is pretty tough. It takes as an assumption that the model already has instrumentality- it values something and is willing to take not-immediately rewarding steps to acquire that value.
No raw GPT-like model has exhibited anything like that behavior to my knowledge. It seems difficult for it to serve the training objective better than not doing that, so this is within expectations. I think other designs could indeed move closer to that behavior by default. The part I want to understand better is what encourages and discourages models to have those more distant goals in the first place, so that we could make some more grounded predictions about how things happen at extreme scales, or under other training conditions (like, say, different forms of RLHF).
That makes a lot of sense, and I should have considered that the training data of course couldn't have been predicted. I didn't even consider RLHF--I think there's definitely behaviors where models will intentionally avoid predicting text they ""know"" will result in a continuation that will be punished. This is a necessity, as otherwise models will happily continue with some idea before abruptly ending it because it was too similar to something punished via RLHF.
I think this means that these "long term thoughts" are encoded into the predictive behavior of the model turning training, rather than any sort of meta learning. An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
For example, apply RLHF normally, except in the case that the token [x] appears. In that case, do not apply any feedback - this token directly represents an "out" for the AI.
You might even be able to follow it through the network and see what affects the feedback has.
Whether this idea is practical or not requires further thought.. I'm just writing it now, late at night, because I figure it's useful enough to possibly be made into something meaningful.
An interesting experiment would be including some sort of token that indicates RLHF will or will not be used when training, then seeing how this affects the behavior of the model.
Yup! This is the kind of thing I'd like to see tried.
There are quite a few paths to fine tuning models, and it isn't clear to what degree they differ along the axis of instrumentality (or agency more generally). Decision transformers are a half step away from your suggestion. For example, "helpfulness" could be conditioned with a token expressing the degree of helpfulness (as determined by reward-to-go on that subtask provided through RLHF). It turns out that some other methods of RL can also be interpreted as a form of conditioning, too.
A brief nonexhaustive zoo of options:
I really want to see more experiments that would check this kind of stuff. It's tough to get information about behavior in extreme scales, but I think there are likely interesting tidbits to learn even in toys. (This is part of what I'm working toward at the moment.)
You could describe the behavior of untuned GPT-like model[1] using a (peculiar) utility function. The fact that the loss function and training didn't explicitly involve a reward function doesn't mean a utility function can't represent what's learned, after all.
Coming from the opposite direction, you could also train a predictor using RL: choose a reward function and an update procedure which is equivalent to approximating the supervised loss function's gradient with numerical sampling. It'll tend to be much less efficient to train (and training might collapse sometimes), but it should be able to produce an equivalent result in the limit.
And yet... trying to interpret simulators as agents with utility functions seems misleading. Why?
Instrumentality is why some agents seem more "agenty"
An RL-trained agent that learns to play a game that requires thousands of intermediate steps before acquiring reward must learn a policy that is almost entirely composed of intermediate steps in order to achieve high reward.
A predictor that learns to predict an output distribution which is graded immediately takes zero external intermediate steps. There is no apparent incentive to develop external instrumental capabilities that span more than a single prediction.
I'm going to call the axis spanning these two examples instrumentality: the degree to which a model takes instrumental actions to achieve goals.
I think this is at the heart of why the utility-maximizing agent lens seems so wrong for simulators: agents seem agenty when they take intermediate steps, which they do not value intrinsically, to accomplish some goal. Things like simulators don't- the action is the goal.
The "nice utility function you got there; would be a shame if someone... maximized it" framing runs into the reality that simulator-like models aren't learning to take actions to secure a higher probability of correct predictions or other instrumental behaviors. Interpreting their apparent utility function as implying the model "values" correct predictions doesn't fit because the model never visibly performs instrumental actions. That simply isn't what the model learned to implement, and so far, that fact seems natural: for all existing pure predictor implementations that I'm aware of, the loss basin that the model fell into corresponded to minimal instrumentality rather than more conventional forms of agentic "values" matching the reward function. This is evidence that these behaviors fall into different loss basins and that there are sufficient obstacles between them that SGD cannot easily cross.[2]
Instrumentality is a spectrum and a constraint
Training objectives that map onto utility functions with more degrees of freedom for instrumental action- like RL applied to a long task in an extremely open-ended environment- will tend to be subject to more surprising behavior after optimization.
Instrumentality is not binary; it is space in which the optimizer can run free. If you give it enough space, argmax will find something strange and strangely capable.
The smaller that space, the fewer options the optimizer has for breaking things.[3]
So, for a given capability, it matters how the capability is trained. You could approximate the prediction loss gradient with numerical sampling in RL and get an inefficiently trained version of the same model, as in the introduction, but what happens if you add in more space for instrumental actions? A couple of examples:
This model clearly takes instrumental actions which are more consistent with "valuing" predictions. Would it take instrumental actions to make good predictions outside of an episode? Maybe! Hard to say! Sure seems far more likely to fall into that basin than the direct output-then-evaluation version.
Even with a "good" reward function, the learned path to the reward is still not an ignorable detail in an agent. It can be the most important part!
Instrumentality and the naturalness of agentic mesaoptimizers
In the pathological case, some degree of instrumentality is still possible for predictors. For example, predicting tokens corresponding to a self-fulfilling prophecy when that prediction happens to be one of a set of locally loss-minimizing options is not fought by the loss function, and a deceptive model that performs very well on the training distribution could produce wildly different results out of distribution.
I think one of the more important areas of research in alignmentland right now is trying to understand how natural those outcomes are.
In this context, how would the process of developing instrumentality that reaches outside the scope of a forward pass pay rent? What gradients exist that SGD can follow that would find that kind of agent? That would seem to require that there is no simpler or better lower instrumentality implementation that SGD can access first.
I am not suggesting that misaligned instrumental agentic mesaoptimizers are impossible across the board. I want to see how far away they are and what sorts of things lead to them with higher probability. Understanding this better seems critical.
Internal instrumentality
Requiring tons of intermediate steps before acquiring reward incentivizes instrumentality. The boundaries of a single forward pass aren't intrinsically special; each pass is composed of a bunch of internal steps which must be at least partly instrumental. Why isn't that concerning?
Well, as far as I can tell, it is! There does not appear to be a deep difference in kind. It seems to come down to what the model is actually doing, what kind of steps are being incentivized, and the span across which those steps operate.
A predictor mapping input tokens to an output distribution for the next token, at scales like our current SOTA LLMs, is pretty constrained- the space between the inputs and the output is small. It's not obviously considering a long episode of traditionally agentic behaviors within each forward pass, because it doesn't have reason to. The internal steps seem like they map onto something like raw machinery.
If I had to guess, the threshold past which you see models trained with a prediction objective start to develop potentially concerning internal instrumentality is very high. It's probably well beyond the sweet spot for most of our intuitive use cases, so we haven't seen any sign of it yet.
What happens if you tried to go looking for it? What if, rather than a short jump mapping input tokens to the next token, you set up a contrived pipeline that directly incentivized an internal simulation of long episodes?
My guess is that current models still don't have the scale to manage it, but in the future, it doesn't seem out of the realm of possibility that some sort of iterated distillation process that tries to incrementally predict more and more at once could yield a strong predictor that learns a worrying level of internal instrumentality. Could that be a path to the nasty type of agentic mesaoptimizer in simulators? Can you avoid this by being clever in how you choose the level of simulation[5]? I'm guessing yes, but I would like to do better than guessing!
How is this different from myopia?
There is some overlap![6] In this framing, a model which is fully myopic in terms of outputs/actions (regardless of whether it considers the future in some way) is minimally instrumental. The degree of instrumentality exhibited by a model is potentially independent of how myopic a model's perception or cognition is.
Myopia is sometimes interpreted (reasonably, given its usual visual meaning) as implying a model is, well, short-sighted. I have seen comments that GPT-like models aren't myopic because (for example) they frequently have to model the aspects of the future beyond the next token to successfully predict that next token, or because training may encourage earlier tokens to share their computational space with the predictions associated with later tokens. Both of these things are true, but it's valuable to distinguish that from myopic (minimally instrumental) outputs. It seems like having fully separate words for these things would be helpful.
Why is this framing useful?
The less instrumentality a model has, the less likely it is to fight you. Making your internal reasoning opaque to interpretation is instrumental to deception; a noninstrumental model doesn't bother. Instrumentality, more than the other details of the utility function, is the directly concerning bit here.
This doesn't imply that noninstrumental models are a complete solution to alignment, nor that they can't have major negative consequences, just that the path to those negative consequences is not being adversarially forced as an instrumental act of the model. The model's still doing it, but not in service of a goal held by the model.
That's cold comfort for the person who oopsies in their use of such a model, or who lives in a world where someone else did, but I think there remains serious value there. An adversarial high-instrumentality model actively guides the world towards its malign goal, the noninstrumental model doesn't.
Without the ability to properly design larger bounds for an optimizer to work within, low instrumentality appears to be a promising- or necessary- part of corrigibility.
For the rest of the post, I'll refer to this approximately by using the terms simulator or predictor.
At least not yet.
This also means that minimal-instrumentality training objectives may suffer from reduced capability compared to an optimization process where you had more open, but still correctly specified, bounds. This seems like a necessary tradeoff in a context where we don't know how to correctly specify bounds.
Fortunately, this seems to still apply to capabilities at the moment- the expected result for using RL in a sufficiently unconstrained environment often ranges from "complete failure" to "insane useless crap." It's notable that some of the strongest RL agents are built off of a foundation of noninstrumental world models.
It's important that the reward is not just a ground truth for encouraging instrumental action: if there was a fixed ground truth being predicted, then the predictor has fewer options for making later predictions easier, because the expected later tokens do not change. (Notably with a fixed ground truth, there is still space for worsening an early token if doing so is predicted to help later tokens enough to compensate for the loss; it's just less space.)
For example, probably don't try to predict the output of an entire civilization in one go. Maybe try predicting reasoning steps or something else auditable and similarly scaled.
Especially because myopia refers to a lot of different things in different contexts!