This post is the version of Yudkowsky's argument for inner misalignment that I wish I'd had in my head a few years ago. I don't claim that it's novel, that I endorse it, or even that Yudkowsky would endorse it; it's primarily an attempt to map his ideas into an ontology that makes sense to me (and hopefully others).
This post is formulated in terms of three premises, which I explore in turn. My arguments deliberately gloss over some nuances and possible objections; in a follow-up post, I'll explore three of them. In a third post I'll dig into the objection I find most compelling, and outline a research agenda that aims to flesh it out into a paradigm for thinking about cognition more generally, which I'm calling coalitional agency.
Background
An early thought experiment illustrating the possibility of misaligned AI is the "paperclip maximizer", an AI with the sole goal of creating as many paperclips as possible. This thought experiment has often been used to describe outer misalignment—e.g. a case where the AI was given the goal of making paperclips. However, Yudkowsky claims that his original version was intended to refer to an inner alignment failure in which an AI developed the goal of producing “tiny molecules shaped like paperclips” (with that specific shape being an arbitrary example unrelated to human paperclips).
So instead of referring to paperclip maximizers, I'll follow Yudkowsky's more recent renaming and talk about "squiggle maximizers": AIs that attempt to fill the universe with some very low-level pattern that's meaningless to humans (e.g. "molecular squiggles" of a certain shape).
I'll argue for the plausibility of squiggle-maximizers via three claims:
- Increasing intelligence requires compressing representations; and
- The simplest goals are highly decomposable broadly-scoped utility functions; therefore
- Increasingly intelligent AIs will converge towards squiggle-maximization.
In this post I'll explore each of these in turn. I'll primarily aim to make the positive case in this post; if you have an objection that I don't mention here, I may discuss it in the next post.
Increasing intelligence requires compressing representations
There's no consensus definition of intelligence, but one definition that captures the key idea in my mind is: the ability to discover and take advantage of patterns in the world. When you look at a grid of pixels and recognize a cat, or look at a string of characters and recognize a poem, you're doing a type of pattern-recognition. Higher-level patterns include scientific laws, statistical trendlines, theory of mind, etc. Discovering such patterns allows an agent to represent real-world information in a simpler way: instead of storing every pixel or every character, they can store higher-level patterns along with whichever low-level details don’t fit the pattern.
This is (at a high level) also how compression algorithms work. The thesis that intelligence is about compression has most prominently been advocated by Marcus Hutter, who formulated AIXI and created a prize for text compression. The enormous success of the self-supervised learning paradigm a few decades later is a vindication of his ideas (see also this talk by llya Sutskever exploring the link between them).
However, we shouldn’t interpret this thesis merely as a claim about self-supervised learning. We can be agnostic about whether compression primarily occurs via self-supervised learning, or fine-tuning, or regularization, or meta-learning, or directed exploration, or chain-of-thought, or new techniques that we don’t have yet. Instead we should take it as a higher-level constraint on agents: if agents are intelligent, then they must consistently compress their representations somehow.
(A human analogy: scientists sometimes make breakthroughs via solitary reasoning, or writing down their thoughts, or debating with others, or during dreams, or in a flash of insight. We don’t need to make claims about the exact mechanisms involved in order to argue that successful science requires finding highly compressed representations of empirical data.)
For the purposes of my current argument, then, we just need to accept the following claim: as agents become superintelligent there will be strong forces pushing their representations to become highly compressed.
The simplest goals are highly decomposable broadly-scoped utility functions
In general it's hard to say much about which goals will be simpler or more complex for superintelligences to represent. But there are a few properties that seem like they'll be highly correlated with the simplicity of goals. The first one I'll talk about is decomposability. Specifically, I'll focus on linearly decomposable goals which can be evaluated by adding together evaluations of many separate subcomponents. More decomposable goals are simpler because they can focus on smaller subcomponents, and don't need to account for interactions between those subcomponents.
To illustrate the idea, here are four types of linear decomposability (though there may be more I'm missing):
- Decomposability over time. The goal of maximizing a reward function is decomposable over time because the overall goal can be evaluated by decomposing a trajectory into individual timesteps, then adding together the rewards at each timestep.
- Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it's defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we're talking about molecular squiggles. (By contrast, you can't evaluate the amount of higher-level goals like "freedom" or "justice" in a nanoscale volume, even in principle.)
- Decomposability over possible worlds. This is one of the main criteria which qualifies a goal as a utility function. Expected utility maximizers make decisions about lotteries over possible worlds as if they were adding together the (weighted) values of each of those possible worlds. Conversely, an agent’s goals might not be linearly decomposable over possible worlds due to risk-aversion, or because they value fairness, or various other reasons.
- Decomposability over features. One final way in which a goal can be decomposable is if the value it assigns to an outcome can be calculated by adding together evaluations of different features of that outcome. For example, if my goal is to write a well-reviewed, bestselling, beautiful novel, my goal is more linearly decomposable if I can evaluate each of these properties separately and optimize for the sum of them. This occurs when features have fixed marginal utility, rather than being substitutes or complements.
Decomposability doesn't get us all the way to squiggle maximizers though. For that we need a second property: being broadly-scoped. A narrowly-scoped goal is one which has tight limits on where it applies. For example, we can imagine a goal like "increase the number of squiggles in this room as much as possible" which has very strongly diminishing returns to gaining more resources, compared with versions of the goal that aren’t bounded to that room.
However, the concept of a “room” is tied up with many human norms, and has many edge cases which would be complicated to fully pin down. So intuitively speaking, the goal above would be simpler if its bounds were defined in terms of scientifically-grounded concepts—like “on this planet” or “in our lightcone”. The latter in particular is very clearly-defined and unambiguous, making it a plausible element of the simplest versions of many goals.
(An earlier version of this section focused on unbounded goals like “increase the number of squiggles as much as possible”, which seem even simpler than broadly-scoped goals. But Scott Garrabrant pointed out that unbounded utility functions violate rationality constraints, which suggests that they actually have hidden complexity upon reflection. Alex Zhu also noted that even “in our lightcone” runs into complications when we consider possible multiverses, but I’ll leave those aside for now.)
Arguments about the simplicity of different goals are inherently very vague and speculative; I’m not trying to establish any confident conclusion. The arguments in this section are merely intended to outline why it’s plausible that the simplest goals will be highly decomposable, broadly-scoped utility functions—i.e. goals which roughly resemble squiggle-maximization.
Increasingly intelligent AIs will converge towards squiggle-maximization
Premise 1 claims that, as AIs become more intelligent, their representations will become more compressed. Premise 2 claims that the simplest goals resemble squiggle-maximization. The relationship described in premise 1 may break down as AIs become arbitrarily intelligent—but if it doesn’t, then premise 2 suggests that their goals will converge toward some kind of squiggle-maximization. (Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post.)
What forces might push back on this process, though? The most obvious is training incentives. For example, AIs that are trained via reinforcement learning might get lower reward for carrying out squiggle-maximizing behavior instead of the behavior intended by humans. However, if they have situational awareness of their training context, they might realize that behaving in aligned ways in the short term will benefit their goals more in the long term, by making humans trust them more—the strategy of deceptive alignment.
Deceptive alignment might lead agents with nearly any broadly-scoped goals (including very misaligned goals) to act as if they were aligned. One common hope is that, during the period when they’re acting aligned, regularization will push them away from their misaligned goals. But if their behavior depends very little on their goals, then regularization towards simple representations would actually push them towards goals like squiggle maximization. We can therefore picture AIs gradually becoming more misaligned during training without changing their behavior, even if they started off aligned.
Can we say anything else meaningful about the evolution of goals during that process, except that they'll become very simple? In a previous post I described value systematization as
the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values.
This seems like a central way in which complex goals will be replaced by simpler goals. In that post, I illustrated value systematization with the example of utilitarianism. Through a process of philosophical reasoning that prioritizes simplicity, utilitarians converge towards the overriding value of maximizing a highly-decomposable broadly-scoped utility function. As they do so, they decide that existing values (like honesty, dignity, kindness, etc) should be understood as approximations to or special cases of utilitarian strategies. While their behavior stays the same in many everyday scenarios, the way they generalize to novel scenarios (e.g. thought experiments) often changes radically.
To better understand squiggle maximization in particular, it's worth zooming in further on utilitarianism in more detail. All utilitarians want to maximize some conception of welfare, but they disagree on how to understand welfare. The three most prominent positions are:
- Objective list utilitarianism, which defines welfare in terms of the achievement of certain values.
- Preference utilitarianism, which defines welfare in terms of the satisfaction of an agent's preferences.
- Hedonic utilitarianism, which defines welfare in terms of the valence of conscious experiences.
We can think of each of these positions as making a different tradeoff between simplicity and preserving existing values. Objective list utilitarianism requires the specification of many complex values. Preference utilitarianism gets rid of those, but at the cost of being indifferent between intuitively-desirable preferences and seemingly-meaningless preferences. It also still requires a definition of preferences, which might be complicated. Meanwhile hedonic utilitarianism fully bites the bullet, and gets rid of every aspect of life that we value except for sensory pleasure.
Extreme hedonic utilitarians don't even care whether the pleasure is instantiated in human minds. They talk about filling the universe with "hedonium": matter arranged in the optimal configuration for producing happiness. We don't know yet how to characterize pleasure on a neural level, but once we can, hedonic utilitarianism will essentially be a type of squiggle-maximization, with the "squiggles" being whichever small-scale brain circuits best instantiate happiness.
In a sense, then, the squiggle maximizer hypothesis is just the hypothesis that AIs will have similar motivations as extreme hedonic utilitarians, for similar reasons, but with the specific thing they want to fill the universe with being even less palatable to everyone else. The fact that sympathy for hedonic utilitarianism is strongly correlated with intelligence is a somewhat worrying datapoint in favor of the plausibility of squiggle-maximizers.
However, there are still a range of reasons to doubt the argument I've presented in this post, as I'll explore in the next two posts.
It feels to me like this post is treating AIs as functions from a first state of the universe to a second state of the universe. Which in a sense, anything is... but, I think that the tendency to simplification happens internally, where they operate more as functions from (digital) inputs to (digital) outputs. If you view an AI as a function from an digital input to a digital output, I don't think goals targeting specific configurations of the universe are simple at all and don't think decomposability over space/time/possible worlds are criteria that would lead to something simple.