Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it’s being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a certain type of want-like behavior in the behaviorist sense. An AI's ability to win no matter how you move is the same as its ability to reliably steer the game-board into states where you're check-mated, as though it had an internal check-mating “goal” it were trying to achieve. This is again a quantitative gap that’s being eroded.)
Well, I claim that these are more-or-less the same fact. It's no surprise that the AI falls down on various long-horizon tasks and that it doesn't seem all that well-modeled as having "wants/desires"; these are two sides of the same coin.
Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise. Because the way to achieve long-horizon targets in a large, unobserved, surprising world that keeps throwing wrenches into one's plans, is probably to become a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target no matter what wrench reality throws into its plans.
This observable "it keeps reorienting towards some target no matter what obstacle reality throws in its way" behavior is what I mean when I describe an AI as having wants/desires "in the behaviorist sense".
I make no claim about the AI's internal states and whether those bear any resemblance to the internal state of a human consumed by the feeling of desire. To paraphrase something Eliezer Yudkowsky said somewhere: we wouldn't say that a blender "wants" to blend apples. But if the blender somehow managed to spit out oranges, crawl to the pantry, load itself full of apples, and plug itself into an outlet, then we might indeed want to start talking about it as though it has goals, even if we aren’t trying to make a strong claim about the internal mechanisms causing this behavior.
If an AI causes some particular outcome across a wide array of starting setups and despite a wide variety of obstacles, then I'll say it "wants" that outcome “in the behaviorist sense”.
Why might we see this sort of "wanting" arise in tandem with the ability to solve long-horizon problems and perform long-horizon tasks?
Because these "long-horizon" tasks involve maneuvering the complicated real world into particular tricky outcome-states, despite whatever surprises and unknown-unknowns and obstacles it encounters along the way. Succeeding at such problems just seems pretty likely to involve skill at figuring out what the world is, figuring out how to navigate it, and figuring out how to surmount obstacles and then reorient in some stable direction.
(If each new obstacle causes you to wander off towards some different target, then you won’t reliably be able to hit targets that you start out aimed towards.)
If you're the sort of thing that skillfully generates and enacts long-term plans, and you're the sort of planner that sticks to its guns and finds a way to succeed in the face of the many obstacles the real world throws your way (rather than giving up or wandering off to chase some new shiny thing every time a new shiny thing comes along), then the way I think about these things, it's a little hard to imagine that you don't contain some reasonably strong optimization that strategically steers the world into particular states.
(Indeed, this connection feels almost tautological to me, such that it feels odd to talk about these as distinct properties of an AI. "Does it act as though it wants things?" isn’t an all-or-nothing question, and an AI can be partly goal-oriented without being maximally goal-oriented. But the more the AI’s performance rests on its ability to make long-term plans and revise those plans in the face of unexpected obstacles/opportunities, the more consistently it will tend to steer the things it's interacting with into specific states—at least, insofar as it works at all.)
The ability to keep reorienting towards some target seems like a pretty big piece of the puzzle of navigating a large and complex world to achieve difficult outcomes.
And this intuition is backed up by the case of humans: it's no mistake that humans wound up having wants and desires and goals—goals that they keep finding clever new ways to pursue even as reality throws various curveballs at them, like “that prey animal has been hunted to extinction”.
These wants and desires and goals weren’t some act of a god bequeathing souls into us; this wasn't some weird happenstance; having targets like “eat a good meal” or “impress your friends” that you reorient towards despite obstacles is a pretty fundamental piece of being able to eat a good meal or impress your friends. So it's no surprise that evolution stumbled upon that method, in our case.
(The implementation specifics in the human brain—e.g., the details of our emotional makeup—seem to me like they're probably fiddly details that won’t recur in an AI that has behaviorist “desires”. But the overall "to hit a target, keep targeting it even as you encounter obstacles" thing seems pretty central.)
The above text vaguely argues that doing well on tough long-horizon problems requires pursuing an abstract target in the face of a wide array of real-world obstacles, which involves doing something that looks from the outside like “wanting stuff”. I’ll now make a second claim (supported here by even less argument): that the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular.
For instance, humans find themselves wanting things like good meals and warm nights and friends who admire them. And all those wants added up in the ancestral environment to high inclusive genetic fitness. Observing early hominids from the outside, aliens might have said that the humans are “acting as though they want to maximize their inclusive genetic fitness”; when humans then turn around and invent birth control, it’s revealed that they were never actually steering the environment toward that goal in particular, and instead had a messier suite of goals that correlated with inclusive genetic fitness, in the environment of evolutionary adaptedness, at that ancestral level of capability.
Which is to say, my theory says “AIs need to be robustly pursuing some targets to perform well on long-horizon tasks”, but it does not say that those targets have to be the ones that the AI was trained on (or asked for). Indeed, I think the actual behaviorist-goal is very unlikely to be the exact goal the programmers intended, rather than (e.g.) a tangled web of correlates.
A follow-on inference from the above point is: when the AI leaves training, and it’s tasked with solving bigger and harder long-horizon problems in cases where it has to grow smarter than ever before and develop new tools to solve new problems, and you realize finally that it’s pursuing neither the targets you trained it to pursue nor the targets you asked it to pursue—well, by that point, you've built a generalized obstacle-surmounting engine. You've built a thing that excels at noticing when a wrench has been thrown in its plans, and at understanding the wrench, and at removing the wrench or finding some other way to proceed with its plans.
And when you protest and try to shut it down—well, that's just another obstacle, and you're just another wrench.
So, maybe don't make those generalized wrench-removers just yet, until we do know how to load proper targets in there.
I agree with the main point of the post. But I specifically disagree with what I see as an implied assumption of this remark about a "quantitative gap". I think there is a likely assumption that the quantitative gap is such that the ability to play chess better would correlate with being higher in the most relevant quantity.
Something that chooses good chess moves can be seen as "wanting" its side to do well within the chess game context. But that does not imply anything at all outside of that context. If it's going to be turned off if it doesn't do a particular next move, it doesn't have to take that into account. It can just play the best chess move regardless, and ignore the out-of-context info about being shut down.
LLMs aren't trained directly to achieve results in a real-world context. They're trained:
To be sure, at least item 1 above would eventually result in selecting outputs to achieve results if taken to the limit of infinite computing power, etc., and in the same limit item 2 would result in humans being mind-controlled.
But both these items naturally better reward the LLM for appearing agentic than for actually being agentic (being agentic = actually choosing outputs based on their effect on the future of the real world). The reward for actually being agentic, up to the point that it is agentic enough to subvert the training regime, is entirely downstream of the reward for appearance of agency.
Thus, I tend to expect the appearance of agency in LLMs to be Goodharted and discount apparent evidence accordingly.
Other people look at the same evidence and think it might, by contrast, be even more agentic than the apparent evidence due to strategic deception. And to be sure, at some agency level you might get consistent strategic deception to lower the apparent agency level.
But I think more like: at the agency level I've already discounted it down to it really doesn't look likely it would engage in strategic deception to consistently lower its apparent agency level. Yes I'm aware of, e.g., the recent paper that LLMs engage in strategic deception. But they are doing what looks like strategic deception when presented a pretend, text-based scenario. This is fully compatible with them following story-logic like they learned from training. Just like a chess AI doesn't have to care about anything outside the chess context, the LLM doesn't have to care about anything outside the story-logic context.
To be sure, story-logic by itself could still be dangerous. Any real-world effect could be obtained by story-logic within a story with intricate enough connections to the real world, and in some circumstances it wouldn't have to be that intricate.
And in this sense - the sense that some contexts are bigger and tend to map onto real-world dangerous behaviour better than others - the gap can indeed be quantitative. It's just that it's another dimension of variation in agency than the ability to select best actions in a particular context.
I'm not convinced that LLMs are currently selecting actions to affect the future within a context larger than this story-level context - a large enough domain to have some risk (in particular, I'm concerned with the ability to write code to help make a new/modified AI targeting a larger context) - but one that I think is still likely well short of causing it to choose actions to take over the world (and indeed, well short of being particularly good at solving long-term tasks in general) without it making that new or modified AI first.