expectation calibrator: freeform draft, posting partly to practice lowering my own excessive standards.
So, a few months back, I finally got around to reading Nick Bostrom's Superintelligence, a major player in the popularization of AI safety concerns. Lots of the argument was stuff I'd internalized a long time ago, reading the sequences and even living with AI alignment researchers for a few months. But I wanted to see the circa-2015 risk-case laid out as thorougly as possible, partly as an excusion to learn more about my own subculture, and partly to see if I'd missed anything crucial to why all of us consider future advanced AI systems so dangerous.
But for the most part, the main effect of reading that book was actually to finally shatter a certain story I'd been telling myself about the AI doom scenario, the cracks in which had been starting to show for several months prior. Reading Superintelligence for myself made me notice just how much of AGI X-risk concern grew out of an extremely specific model of how advanced AI was actually going to work, one which had received very little supporting evidence (and even some countervailing evidence) since the rise of large language models.
Of course, I'm talking about how, when it comes to explaining why AI systems are risky, Bostrom's book leans hard on expected utility maximization (like Yudkowsky's sequences before it). Bostrom does step into discussions about other risks from time to time, for example pointing out wireheading as a potential failure mode of reinforcement learning systems (which are importantly not AIXI-style utility maximizers and pose an at least slightly different set of risks). However, most of the book's discussion of AI risk frames the AI as having a certain set of goals from the moment it's turned on, and ruthlessly pursuing those to the best of its ability.
It'd be easy to get sidetracked here with a debate about whether or not current AI systems are on-track to post that exact kind of x-risk. Indeed, modern deep learning systems don't have explicit utility functions; instead they're trained with loss functions, which are plausibly quite different in their dynamics.
We could talk about MIRI's Risks from Learned Optimization paper and its arguments that deep learning systems could plausibly develop algorithms explicit optimization anyway. We could try to analyze that argument in more depth, or evaluate the empirical evidence for whether or not that's actually going on in current deep learning systems. And I do have a lot of thoughts to share on precisely that topic. But, that's not what I want to focus on right now. Instead, I want to make a point about the genealogy of this debate. What Bostrom's book made me realize was that so much of AI x-risk concern came from a time and place where it was seemingly vastly more difficult to even imagine what general intelligence might look like, if not some kind of utility maximizer.
Case-in-point: Bostrom's discussion of x-risks from so-called tool AIs being used as software engineers. This is a particularly valuable example because modern AI systems, like Claude 3.5 Sonnet, are already quite damn good at software engineering, so we actually have a reality against which to compare some of Bostrom's speculations.
Here's the vision Bostrom lays out for how AGI-enhanced software engineering might look.
With advances in artificial intelligence, it would become possible for the programmer to offload more of the cognitive labor required to figure out how to accomplish a given task. In an extreme case, the programmer would simply specify a formal criterion of what counts as success and leave it to the AI to find a solution. To guide its search, the AI would use a set of powerful heuristics and other methods to discover structure in the space of possible solutions. It would keep searching until it found a solution that satisfied the success criterion. The AI would then either implement the solution itself or (in the case of an oracle) report the solution to the user.
Notice that Bostrom imagines having to specify a formal criterion for what counts as a solution to one's programming problem. This seems like a clear relic of a culture in which explicit utility maximizers were by far the most conceivable form advanced AI systems could take, but that's largely changed by now. You can actually just use natural language to outline the code you want a Claude-like AI system to write, and it will do so, with a remarkable intuitive knack for "doing what you mean" (in contrast to the extreme rigidity of computer programs from decades past).
And this isn't trivial in terms of its implications for Bostrom's threat models. It actually rules out language models causing the first kind of catastrophe Bostrom imagines as following from AGI software engineers. Here's him discussing the user-specified formal criterion as roughly a kind of utility function, which an AI software engineer might malignantly maximize.
There are (at least) two places where trouble could then arise. First, the superintelligent search process might find a solution that is not just unexpected but radically unintended. This could lead to a failure of one of the types discussed previously (“perverse instantiation,” “infrastructure profusion,” or “mind crime”). It is most obvious how this could happen in the case of a sovereign or a genie, which directly implements the solution it has found. If making molecular smiley faces or transforming the planet into paperclips is the first idea that the superintelligence discovers that meets the solution criterion, then smiley faces or paperclips we get.
Bostrom's second threat model is that the AI would paperclip the world in the process of trying to produce a solution, for example to get rid of sources of interference, rather than as a side effect of whatever its solution actually is. This would also not come about as a result of a language model relentlessly striving to fulfill a user's simple, natural-language request for a certain kind of computer program, contra Bostrom's vision. If deep learning systems could be made to ruthlessly paperclip the universe, it would be as a result of some other failure-mode, like wireheading, mesa-optimization, or extremely malicious prompt engineering. It wouldn't follow from the user mis-specifying an explicit utility function in the course of trying to use a language model normally.
The reason I'm bringing all of this up isn't that I think it blows up the AI risk case entirely. It's possible that current AI systems have their own, hidden "explicit goals" which will become dangerous eventually. And even if they don't, it's at least feasible that we might lose control of advanced AI systems for other reasons. However, the point I want to make is that many extremely high p(doom)s seem to have derived from a memeplex in which AGI that didn't look like a von Neuman-rational paperclip maximizer was barely even conceivable, a condition which no longer holds; this should correspond to lesser fears about human extinction. To paint a clearer picture of what I mean, let's go over some reasons to think that current deep learning systems might work differently than utility maximizers.
Firstly, modern AI researchers have developed at least partial stories for what might be going on inside deep learning systems. For instance, take the concept of word embeddings. At the very beginning of the algorithm run by language models, they turn each word from their input into a unique vector, embedded in a high-dimensional space. It turns out that, after the model has been trained a bit, these embeddings encode the meaning of each word in a quite intelligible manner. For example, if you take the vector for "uncle," subtract from it the numbers for the "male" vector, and then add the numbers for the "female" vector, you'll end up with a vector whose numbers are very close to the vector numbers for "aunt".
We also have some idea of what the later layers of a GPT-like neural network might be doing, as they have these word vectors mathematically dance with each other. Lots of those later layers are so-called "self-attention blocks", which perform a mathematical operation designed to let each word-vector the model is processing "soak up" meaning from previous words in the input (as encoded in the numbers of their own word-vectors), for example to help the model tell what person a pronoun refers to, or to help it infer the extra, unique meaning that words like "Bill" and "Clinton" should have when they show up next to each other. The model does this kind of thing over and over again as it sequentially processes the text you fed into it, and it could plausibly account for much of its general intelligence, without invoking utility maximization.
(By the way, for those who haven't seen them, I highly recommend 3blue1brown's recent videos on how transformer-based LLMs work. One of their strengths is that they this kind of "why it might work" story for many details of the transformer architecture, more than I'm covering here.)
Lastly, some dedicated interpretability research into (very small) transformers has uncovered some basic strategies they use for next-word prediction. Anthropic's paper on so-called induction heads is one example: they show that at one point in the training process, their very small transformer models learned to start checking if the current string they were completing showed up earlier in the context window, and assuming that string would be completed the same way again.[1] Anthropic has another paper arguing that a scaled-up, fuzzier version of this process may be a major strategy used by larger transformer models.
And notably, like the other examples I've talked about here, this strategy does not look very much like AIXI-style "searching over worlds I might be in, and searching over actions I might take on each of those worlds to maximize expected value per my well-defined utility function."
Again, I'm not bringing up these glimmers of interpretability in an effort to prove that current, advanced deep learning systems don't implement something like mesa-optimization, or that no future deep learning system would do so, or that all future AI systems will be deep learning-based at all . There's just not enough information to tell yet, and it's very important to be careful considering the disastrous consequences such systems could have if we accidentally deployed them.
No, the real reason I bring this up to point out that in the 10 years since Superintelligence was published, and in the 20 years since MIRI was founded, it has become vastly more conceivable that a generally intelligent AI system might work by methods other AIXI-like expected utility maximization. If AI safety research were just starting up today, the probabilities we assigned naturally to advanced AI systems working like utility maximizers would probably be a lot lower than they were in SL4-ish spaces 20 years ago. We have acquired alternate, partial models of how advanced AI might work, so we don't have to anchor so hard AIXI-inspired models when making predictions (even if their extreme danger still means they merit a lot of attention). In short, we ought to try seeing the alignment problem with fresh eyes.
This was the position which emerged from the ashes of my old views while reading Superintelligence. More specifically, it was the position I'd been lying to myself to avoid seeing for months; its emergence felt like a newborn phoenix amid the necrotic tissue of my past view. I'd spent lots of self-deceptive energy trying to muster up an inner story of why language models were likely to be extremely dangerous, in order to justify the high p(doom) I'd picked up from the sequences, and wanted to maintain to avoid embarrassment, and for fear of losing my ingroup status among rationalists if I became enough of an AI optimist.
(Hell, I was afraid of becoming someone who was adamant about merely ensuring AI safety despite already thinking it was likely, e.g. someone who focuses on preventing the future from being eaten by other AI failure modes such as deep learning systems choosing to wirehead, and then tiling the universe with computro-hedonium.)[2]
The loss-of-ingroup-status didn't end up happening, really, since I still have a bunch of values and interests in common with most rationalists. But that didn't stop the process of learning to let myself lower my p(doom) from feeling a little bit like I was leaving a religion. It's possible that other rationalists' strong predictions of doom come from their having better evidence than me, or being better at interpreting it. But I worry that at least in some cases, there's also some sunk-cost fallacy going on, or fear of lonely dissent, or some other biases like that. I think both of those particular ones were present in my case. If you can relate, I want to tell you that it's okay to reconsider your beliefs.
(For the sake of completeness, I'll note that there's probably some wishful thinking and contrarianism powering my current outlook. But I think there's a true, meaningful update in here too.)
Anyway, AI safety is important enough that I want to keep working on it, despite my relative optimism. With any luck, this update on my part might even make me a more motivated researcher, and perhaps a more convincing communicator of the fact that alignment poses some legitimate challanges.
Overall, this arc of mine has has felt like a kind of shadow integration, a practice which is supposed to have its benefits.
- ^
I'd like to note the rapid improvement this "insight" caused in the models' performance. From the outside, one might see the loss curve going down very fast and think "oh god oh fuck it's FOOMING, it has become a MESA-OPTIMIZER". But Anthropic's analysis gives a very concrete example of something else that might be going on there. This may be relevant for dampening future risk forecasts.
- ^
I'm especially interested in safety research that treats language models as at least somewhat neuromorphic or brain-like. Here's a whole essay I wrote on that topic; it explores analogies like predictive learning in ML and predictive processing in humans, RL in both types of systems, and consciuosness as a type of context window.
If deep learning systems are sufficiently neuromorphic, we might be able to import some insights from human alignment into AI safety research. For example, why don't most humans like wireheading? If we figured that out, it might help us ensure deep learning systems strive to avoid doing so themselves.
This is an interesting historical perspective... But it's not really what the fundamental case for AGI doom routes through. In particular: AGI doom is not about "AI systems", as such.
AGI doom is, specifically, about artificial generally intelligent systems capable of autonomously optimizing the world the way humans can, and who are more powerful at this task than humans. The AGI-doom arguments do not necessarily have anything to do with the current SoTA ML models.
Case in point: A manually written FPS bot is technically "an AI system". However, I think you'd agree that the AGI-doom arguments were never about this type of system, despite it falling under the broad umbrella of "an AI system".
Similarly, if a given SoTA ML model architecture fails to meet the definition of "a generally intelligent system capable of autonomously optimizing the world the way humans can", then the AGI doom is not about it. The details of its workings, therefore, have little to say, one way or another, about the AGI doom.
Why are the AGI-doom concerns extended to the current AI-capabilities research, then, if the SoTA models don't fall under said concerns? Well, because building artificial generally intelligent systems is something the AGI labs are specifically and deliberately trying to do. Inasmuch as the SoTA models are not the generally intelligent systems that are within the remit of the AGI-doom arguments, and are instead some other type of systems, the current AGI labs view this as their failure that they're doing their best to "fix".
And this is where the fundamental AGI-doom arguments – all these coherence theorems, utility-maximization frameworks, et cetera – come in. At their core, they're claims that any "artificial generally intelligent system capable of autonomously optimizing the world the way humans can" would necessarily be well-approximated as a game-theoretic agent. Which, in turn, means that any system that has the set of capabilities the AI researchers ultimately want their AI models to have, would inevitably have a set of potentially omnicidal failure modes.
In other words: The set of AI systems defined by "a generally intelligent world-optimization-capable agent", and the set of AI systems defined by "the subject of fundamental AGI-doom arguments", is the same set of systems. You can't have the former without the latter. And the AI industry wants the former; therefore, the arguments go, it will unleash the latter on the world.
While, yes, the current SoTA models are not subjects of the AGI doom arguments, that doesn't matter, because the current SoTA models are incidental research artefacts that are produced on AI industry's path to building an AGI. The AGI-doom arguments apply to the endpoint of that process, not the messy byproducts.
So any evidence we uncover about how the current models are not dangerous the way AGI-doom arguments predict AGIs to be dangerous, is just evidence that they're not AGI yet. It's not evidence that AGI would not be dangerous. (Again: FPS bots' non-dangerousness isn't evidence that AGI would be non-dangerous.)
(I'd written some more about this topic here. See also gwern's Why Tool AIs Want to Be Agent AIs for more arguments regarding why AI research's endpoint would be an AI agent, instead of something as harmless and compliant as the contemporary models.)
Counterarguments to AGI-doom arguments that focus on pointing to the SoTA models, as such, miss the point. Actual counterarguments would instead find some way to argue that "generally intelligent world-optimizing agents" and "subjects of AGI-doom arguments" are not the exact same type of system; that you can, in theory, have the former without the latter. I have not seen any such argument, and the mathematical noose around them is slowly tightening (uh, by which I mean: their impossibility may be formally provable).
Re treating utility as the optimization target, I think this isn't properly speaking an embedded agency problem, but rather an empirical problem of what the first AIs that automate everything will look like algorithmically, as there are algorithms that are able to be embedded in reality that do optimize the utility/reward like MCTS, and TurnTrout limits the post to the model-free policy gradient case like PPO and REINFORCE.
TurnTrout is correct to point out that not all RL algorithms optimize for the reward, and reward isn't what the agent optimizes for by ... (read more)