I feel like a lot of the objections around agency are answered by the Clippy scenario, and gwern's other essay on the topic, Tool AIs want to be Agent AIs. The AGI need not start with any specific goal or agency. However, the moment it starts executing a prompt that requires it to exhibit agency or goal directed behavior, it will. And at that point, unless the goal is set up such that the agent pursues its goal in a manner that is compatible with the continued existence of humanity over the long term, humanity is doomed. Crafting a goal in this manner is very difficult, and making sure that the AGI pursues this goal and no others are both very difficult tasks individually. Together, they are nigh impossible. Thus, with a very strong likelihood, the moment the AGI either receives a prompt or discovers a prompt that requires it to behave like an agent, humanity is doomed.
I agree that AGIs need to possess a world model, but I disagree that we will be able to distinguish an AI that possesses a world model from an AI that "merely" knows word associations. The internals of an AI are opaque, despite the best efforts of interpretability research to shine light on the giant inscrutable matrices. An AI with a world model, I predict, won't look much different from an AI without a world model. Maybe some weights will be different, and some update functions will have changed. Will we be able to point to any specific weight or combination of weights and say, "Aha, the AI has developed a world model!" Probably not, no more than we can look at any specific set of neurons in the human brain and say, "Aha, there lies the seat of consciousness!"
Given the two points above, we may not be able to tell when any given AI passes the threshold to becoming an AGI. And once an AI has passed the threshold, we won't necessarily be able to control which prompt causes the AI to begin simulating an agent. Given those two, I fail to see why we shouldn't behave as if AGI is on a short timeline. After all, if one is approaching a cliff from an unknown distance in the darkness, the wise thing to do is not to assume that the cliff is still miles away and stride boldly into the unknown. Instead it behooves us to probe carefully, trying to determine whether there's solid ground or empty space ahead.
However, the moment it starts executing a prompt that requires it to exhibit agency or goal directed behavior, it will.
This seems to make a jump from "the prompt requires agency to execute well" to "the AI develops the cognitive capability for agency"?
I read Sarah's point as being that current AIs are fundamentally incapable of having agency (as she defines it). If that's the case, it doesn't matter if the prompt requires the AI to have agency to execute the prompt well: instead, the AI will just fail to execute the prompt well.
This seems to make a jump from “the prompt requires agency to execute well” to “the AI develops the cognitive capability for agency”?
In my scenario the AI already has the cognitive capability for agency. It's just that the capability is latent until the right prompt causes it to be expressed. We've seen early examples of this with ChatGPT, where, if you ask it to plan something or think about adversarial scenarios, it will demonstrate agent-ish behavior.
My point is that while current AIs are probably incapable of having agency, future AIs probably will have that capability. Furthermore, we may not be able to tell the difference between an AI that is capable of building a world-model and engaging in long-term goal directed behavior and the current AI systems that mostly aren't.
This post made me pretty sad, because I think it focuses on precisely the wrong parts of the AI doom case (AGI is near), and concedes all of the points she shouldn't concede (alignment is hard).
If I were like Sarah and thought that alignment is super hard but AGI is a long way off, I would probably still consider myself a "doomer" and would be very worried.
Agency requires reasoning about the consequences of one’s actions. "I need to do such-and-such, to get to my goal." This requires counterfactual, causal reasoning.
Have you ever tried to explain the difference between correlation and causation to someone who didn't understand it? I'm not convinced that this is even something humans innately have, rather than some higher-level correction by systems that do that.
A computer chess engine trained exclusively on one format for representing the game would generally not be able to transfer its knowledge to a different format.
You can hook a chess-playing network up to a vision network and have it play chess using images of boards - it's not difficult. Perhaps a better example is that language models can be easily coupled to image models to get prompted image generation. You can also translate between language pairs that didn't have direct translations in the training data.
thus we do not know how to build machines that can pursue goals coherently and persistently
This post seems rather specific to LLMs for how much it's trying to generalize; I think there's been more progress on that than Sarah seems to realize.
Have you ever tried to explain the difference between correlation and causation to someone who didn't understand it?
Understanding a high-level description of an abstract concept is different from having more low-level cognitive machinery that can apply the concept intuitively; you can have one without having the other (this goes in both directions). One classic example is that if you catch a flying ball with your hand, your brain needs to do something like solving a set of differential equations in order to predict the path of the ball... but this doesn't imply that the people who were good at catching balls would be any good with solving explicit sets of differential equations. (Nor that people who were good at math would be any good at catching balls, for that matter.)
'Have you ever tried to explain the difference between correlation and causation to someone who didn't understand it? I'm not convinced that this is even something humans innately have, rather than some higher-level correction by systems that do that.'
You are outside and feel wind on your face. In front of you, you can see trees swaying in the wind. Did the swaying of the trees cause the wind? Or did the wind cause the trees to sway?
The cat bats at a moving toy. Usually he misses it. If he hits it, it usually makes a noise, but not always. The presence of the noise is more closely correlated with the cat successfully hitting the toy than the cat batting at the ball. But did the noise cause the cat to hit the ball or did the cat batting at the ball cause the hit?
The difference between correlation and causation is something we humans have a great sense for, so these questions seem really stupid. But they're actually very challenging to answer using only observations (without being able to intervene).
I was talking about (time-shifted correlation) vs causation. That's what people get confused about.
Mark Teixeira wears 2 different socks when playing baseball. That's because he did that once and things went better. Why do you think he does that?
You can hook a chess-playing network up to a vision network and have it play chess using images of boards - it's not difficult.
I think you have to be careful here. In this setup, you have two different AI's: One vision network that classified images, and the chess AI that plays chess, and presumably connecting code that translates the output of the vision into a format suitable for the chess player.
I think what Sarah is referring to is that if you tried to directly hook up the images to the chess engine, it wouldn't be able to figure it out, because reading images is not something it's trained to do.
I honestly think of specialised models not as brains in their own right, but as cortexes. Pieces of a brain. But you can obviously hook them up together to do all sorts of things (for example, a multimodal LLM could take an image of a board and turn it into a series of coordinates and piece names). The one thing is that these models all would exist one level below the emergent simulacra that have actual agency. They're the book or the operator or the desk in the Chinese Room. But it's the Room as a whole that is intelligent and agentic.
Or in other words: our individual neurons don't optimise for world-referenced goals either. Their goal is just "fire if stimulated so-and-so".
Yes and networks of sensory neurons are apparently minimizing prediction error similar to LLM with next word prediction but with neurons also minimizing prediction across hierarchies. They are obviously not agents but combine into one.
Really great post.
The two parts that stand out most to me are the Causality and Ontology Change sections.
Regarding Causality, I agree that there will be little investment into robotics as a mechanism for intervening on the world and building causal models. However, I don't see why practicing on videogames doesn't produce this sort of interventionist data and why AIs wouldn't learn causal models from that. And it doesn't seem that expensive to create the data. It's already happened a bunch with AIs trained on mutltiplayer videogames, and later on it will get cheaper, and overall investment will continue to increase by orders of magnitude. So I expect this will happen.
Regarding Ontology Change, I currently think of this argument as a variant of "The AI needs to understand and model itself well enough to be able understand when its concepts are mistaken. To build this sort of self-aware AGI requires lots of new insights about how agents work that we do not have. Therefore we cannot build one."
From an x-risk reduction perspective, I try to think of the worst case. Even if I believe that machine learning systems will not find programs that have the sorts of self-reflective properties that an agent needs in order to be able to notice that its concepts are mistaken (which I am honestly uncertain about), I also think that 15 years from now the world would still be massively overrun with Machine Learning, and the smartest minds in the world would be excited about AGI, and the top researchers would work on things related to this question. This is my main counterargument to a lot of arguments that we're not there yet theoretically — I expect the investment into AI research over the next decade will increase massively with respect to today.
However the last four paragraphs of the post are a direct response to this, and feel notably hopeful to me.
So, the kind of advance we’re worried about must come from the rare maverick dreamer types who have their sights fixed on a distant vision of “true” AGI and are willing to spend years scribbling in the wilderness to get there.
Such an advance is of course not impossible -- but it’s a very different threat model from the armies of machine learning researchers and engineers making rapid incremental progress on deep neural nets because they are immediately rewarded with professional success for doing so.
You could probably find all the intellectually generative “AI dreamer” types and talk to them individually -- those sorts of people tend to share their ideas in writing.
If the lines of communication remain open -- if the current AI debate doesn’t tribalize to the point that “pro-AI” and “anti-AI” factions hate each other and can’t have friendly discussions -- then it might be remarkably tractable to just, y’know, persuade a handful of individuals that they should maybe not work too hard to get the world to take notice of their theoretical ideas.
This is easily the best anti near-term AGI article I have read so far. Especially cross-ontology goal robustness is a great point.
This is a good post from Sarah Constantin explaining why her expected timeline to agentic AGI is long (> 10 years). Topics discussed include whether LLMs and other current major research directions will endow AIs with adequate world models, causal inference, and goal robustness across ontological shifts.