A simple case for extreme inner misalignment

Richard_Ngo

Understanding systematization

83 A simple case for extreme inner misalignment

by Richard_Ngo

13th Jul 2024

AI Alignment Forum

8 min read

83 Ω 35

This post is the version of Yudkowsky's argument for inner misalignment that I wish I'd had in my head a few years ago. I don't claim that it's novel, that I endorse it, or even that Yudkowsky would endorse it; it's primarily an attempt to map his ideas into an ontology that makes sense to me (and hopefully others).

This post is formulated in terms of three premises, which I explore in turn. My arguments deliberately gloss over some nuances and possible objections; in a follow-up post, I'll explore three of them. In a third post I'll dig into the objection I find most compelling, and outline a research agenda that aims to flesh it out into a paradigm for thinking about cognition more generally, which I'm calling coalitional agency.

Background

An early thought experiment illustrating the possibility of misaligned AI is the "paperclip maximizer", an AI with the sole goal of creating as many paperclips as possible. This thought experiment has often been used to describe outer misalignment—e.g. a case where the AI was given the goal of making paperclips. However, Yudkowsky claims that his original version was intended to refer to an inner alignment failure in which an AI developed the goal of producing “tiny molecules shaped like paperclips” (with that specific shape being an arbitrary example unrelated to human paperclips).

So instead of referring to paperclip maximizers, I'll follow Yudkowsky's more recent renaming and talk about "squiggle maximizers": AIs that attempt to fill the universe with some very low-level pattern that's meaningless to humans (e.g. "molecular squiggles" of a certain shape).

I'll argue for the plausibility of squiggle-maximizers via three claims:

Increasing intelligence requires compressing representations; and
The simplest goals are highly decomposable broadly-scoped utility functions; therefore
Increasingly intelligent AIs will converge towards squiggle-maximization.

In this post I'll explore each of these in turn. I'll primarily aim to make the positive case in this post; if you have an objection that I don't mention here, I may discuss it in the next post.

Increasing intelligence requires compressing representations

There's no consensus definition of intelligence, but one definition that captures the key idea in my mind is: the ability to discover and take advantage of patterns in the world. When you look at a grid of pixels and recognize a cat, or look at a string of characters and recognize a poem, you're doing a type of pattern-recognition. Higher-level patterns include scientific laws, statistical trendlines, theory of mind, etc. Discovering such patterns allows an agent to represent real-world information in a simpler way: instead of storing every pixel or every character, they can store higher-level patterns along with whichever low-level details don’t fit the pattern.

This is (at a high level) also how compression algorithms work. The thesis that intelligence is about compression has most prominently been advocated by Marcus Hutter, who formulated AIXI and created a prize for text compression. The enormous success of the self-supervised learning paradigm a few decades later is a vindication of his ideas (see also this talk by llya Sutskever exploring the link between them).

However, we shouldn’t interpret this thesis merely as a claim about self-supervised learning. We can be agnostic about whether compression primarily occurs via self-supervised learning, or fine-tuning, or regularization, or meta-learning, or directed exploration, or chain-of-thought, or new techniques that we don’t have yet. Instead we should take it as a higher-level constraint on agents: if agents are intelligent, then they must consistently compress their representations somehow.

(A human analogy: scientists sometimes make breakthroughs via solitary reasoning, or writing down their thoughts, or debating with others, or during dreams, or in a flash of insight. We don’t need to make claims about the exact mechanisms involved in order to argue that successful science requires finding highly compressed representations of empirical data.)

For the purposes of my current argument, then, we just need to accept the following claim: as agents become superintelligent there will be strong forces pushing their representations to become highly compressed.

The simplest goals are highly decomposable broadly-scoped utility functions

In general it's hard to say much about which goals will be simpler or more complex for superintelligences to represent. But there are a few properties that seem like they'll be highly correlated with the simplicity of goals. The first one I'll talk about is decomposability. Specifically, I'll focus on linearly decomposable goals which can be evaluated by adding together evaluations of many separate subcomponents. More decomposable goals are simpler because they can focus on smaller subcomponents, and don't need to account for interactions between those subcomponents.

To illustrate the idea, here are four types of linear decomposability (though there may be more I'm missing):

Decomposability over time. The goal of maximizing a reward function is decomposable over time because the overall goal can be evaluated by decomposing a trajectory into individual timesteps, then adding together the rewards at each timestep.
Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it's defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we're talking about molecular squiggles. (By contrast, you can't evaluate the amount of higher-level goals like "freedom" or "justice" in a nanoscale volume, even in principle.)
Decomposability over possible worlds. This is one of the main criteria which qualifies a goal as a utility function. Expected utility maximizers make decisions about lotteries over possible worlds as if they were adding together the (weighted) values of each of those possible worlds. Conversely, an agent’s goals might not be linearly decomposable over possible worlds due to risk-aversion, or because they value fairness, or various other reasons.
Decomposability over features. One final way in which a goal can be decomposable is if the value it assigns to an outcome can be calculated by adding together evaluations of different features of that outcome. For example, if my goal is to write a well-reviewed, bestselling, beautiful novel, my goal is more linearly decomposable if I can evaluate each of these properties separately and optimize for the sum of them. This occurs when features have fixed marginal utility, rather than being substitutes or complements.

Decomposability doesn't get us all the way to squiggle maximizers though. For that we need a second property: being broadly-scoped. A narrowly-scoped goal is one which has tight limits on where it applies. For example, we can imagine a goal like "increase the number of squiggles in this room as much as possible" which has very strongly diminishing returns to gaining more resources, compared with versions of the goal that aren’t bounded to that room.

However, the concept of a “room” is tied up with many human norms, and has many edge cases which would be complicated to fully pin down. So intuitively speaking, the goal above would be simpler if its bounds were defined in terms of scientifically-grounded concepts—like “on this planet” or “in our lightcone”. The latter in particular is very clearly-defined and unambiguous, making it a plausible element of the simplest versions of many goals.

(An earlier version of this section focused on unbounded goals like “increase the number of squiggles as much as possible”, which seem even simpler than broadly-scoped goals. But Scott Garrabrant pointed out that unbounded utility functions violate rationality constraints, which suggests that they actually have hidden complexity upon reflection. Alex Zhu also noted that even “in our lightcone” runs into complications when we consider possible multiverses, but I’ll leave those aside for now.)

Arguments about the simplicity of different goals are inherently very vague and speculative; I’m not trying to establish any confident conclusion. The arguments in this section are merely intended to outline why it’s plausible that the simplest goals will be highly decomposable, broadly-scoped utility functions—i.e. goals which roughly resemble squiggle-maximization.

Increasingly intelligent AIs will converge towards squiggle-maximization

Premise 1 claims that, as AIs become more intelligent, their representations will become more compressed. Premise 2 claims that the simplest goals resemble squiggle-maximization. The relationship described in premise 1 may break down as AIs become arbitrarily intelligent—but if it doesn’t, then premise 2 suggests that their goals will converge toward some kind of squiggle-maximization. (Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post.)

What forces might push back on this process, though? The most obvious is training incentives. For example, AIs that are trained via reinforcement learning might get lower reward for carrying out squiggle-maximizing behavior instead of the behavior intended by humans. However, if they have situational awareness of their training context, they might realize that behaving in aligned ways in the short term will benefit their goals more in the long term, by making humans trust them more—the strategy of deceptive alignment.

Deceptive alignment might lead agents with nearly any broadly-scoped goals (including very misaligned goals) to act as if they were aligned. One common hope is that, during the period when they’re acting aligned, regularization will push them away from their misaligned goals. But if their behavior depends very little on their goals, then regularization towards simple representations would actually push them towards goals like squiggle maximization. We can therefore picture AIs gradually becoming more misaligned during training without changing their behavior, even if they started off aligned.

Can we say anything else meaningful about the evolution of goals during that process, except that they'll become very simple? In a previous post I described value systematization as

the process of an agent learning to represent its previous values as examples or special cases of other simpler and more broadly-scoped values.

This seems like a central way in which complex goals will be replaced by simpler goals. In that post, I illustrated value systematization with the example of utilitarianism. Through a process of philosophical reasoning that prioritizes simplicity, utilitarians converge towards the overriding value of maximizing a highly-decomposable broadly-scoped utility function. As they do so, they decide that existing values (like honesty, dignity, kindness, etc) should be understood as approximations to or special cases of utilitarian strategies. While their behavior stays the same in many everyday scenarios, the way they generalize to novel scenarios (e.g. thought experiments) often changes radically.

To better understand squiggle maximization in particular, it's worth zooming in further on utilitarianism in more detail. All utilitarians want to maximize some conception of welfare, but they disagree on how to understand welfare. The three most prominent positions are:

Objective list utilitarianism, which defines welfare in terms of the achievement of certain values.
Preference utilitarianism, which defines welfare in terms of the satisfaction of an agent's preferences.
Hedonic utilitarianism, which defines welfare in terms of the valence of conscious experiences.

We can think of each of these positions as making a different tradeoff between simplicity and preserving existing values. Objective list utilitarianism requires the specification of many complex values. Preference utilitarianism gets rid of those, but at the cost of being indifferent between intuitively-desirable preferences and seemingly-meaningless preferences. It also still requires a definition of preferences, which might be complicated. Meanwhile hedonic utilitarianism fully bites the bullet, and gets rid of every aspect of life that we value except for sensory pleasure.

Extreme hedonic utilitarians don't even care whether the pleasure is instantiated in human minds. They talk about filling the universe with "hedonium": matter arranged in the optimal configuration for producing happiness. We don't know yet how to characterize pleasure on a neural level, but once we can, hedonic utilitarianism will essentially be a type of squiggle-maximization, with the "squiggles" being whichever small-scale brain circuits best instantiate happiness.

In a sense, then, the squiggle maximizer hypothesis is just the hypothesis that AIs will have similar motivations as extreme hedonic utilitarians, for similar reasons, but with the specific thing they want to fill the universe with being even less palatable to everyone else. The fact that sympathy for hedonic utilitarianism is strongly correlated with intelligence is a somewhat worrying datapoint in favor of the plausibility of squiggle-maximizers.

However, there are still a range of reasons to doubt the argument I've presented in this post, as I'll explore in the next two posts.

Squiggle Maximizer (formerly "Paperclip maximizer")Inner AlignmentAI

Frontpage

83 Ω 35

Value systematization: how values become coherent (and misaligned)

48 comments102 karma

A more systematic case for inner misalignment

4 comments31 karma

Mentioned in

31A more systematic case for inner misalignment

New Comment

41 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:03 AM

[-]Eliezer Yudkowsky7moΩ102619

The part where squiggles are small and simple is unimportant. They could be bigger and more complicated, like building giant mechanical clocks. The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.

[-]Eliezer Yudkowsky7moΩ112613

Actually, to slightly amend that: The part where squiggles are small is a more than randomly likely part of the prediction, but not a load-bearing part of downstream predictions or the policy argument. Most of the time we don't needlessly build our own paperclips to be the size of skyscrapers; even when having fun, we try to do the fun without vastly more resources, than are necessary to that amount of fun, because then we'll have needlessly used up all our resources and not get to have more fun. We buy cookies that cost a dollar instead of a hundred thousand dollars. A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess. "Great giant squiggles of nickel the size of a solar system would be no more valuable, even from a very embracing and cosmopolitan perspective on value" is the loadbearing part.

[-]Richard_Ngo6moΩ330

The part that matters is that squiggles/paperclips are of no value even from a very cosmopolitan and embracing perspective on value.

I agree that the particular type of misaligned goal is not crucial. I'm thinking of molecular squiggles as an unusually clean type of misalignment to make arguments about, because it's very clear that they're not valuable. If you told me that molecular squiggles weren't a central example of a goal that you think a misaligned superintelligence might have, then I'd update, but it sounds like your statements are consistent with this.

A very wide variety of utility functions you could run over the outside universe will have optima around making lots of small things because each thing scores one point, and so to score as many points as possible, each thing is as small as it can be as still count as a thing. Nothing downstream depends on this part coming true and there are many ways for it to come false; but the part where the squiggles are small and molecular is an obvious kind of guess.

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?

They could be bigger and more complicated, like building giant mechanical clocks.

Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?

[-]Eliezer Yudkowsky6moΩ12218

If you had to put a rough number on how likely it is that a misaligned superintelligence would primarily value "small molecular squiggles" versus other types of misaligned goals, would it be more like 1000:1 or 1:1 or 1000:1 or something else?

Value them primarily? Uhhh... maybe 1:3 against? I admit I have never actually pondered this question before today; but 1 in 4 uncontrolled superintelligences spending most of their resources on tiny squiggles doesn't sound off by, like, more than 1-2 orders of magnitude in either direction.

Clocks are not actually very complicated; how plausible is it on your model that these goals are as complicated as, say, a typical human's preferences about how human civilization is structured?

It wouldn't shock me if their goals end up far more complicated than human ones; the most obvious pathway for it is (a) gradient descent turning out to produce internal preferences much faster than natural selection + biological reinforcement learning and (b) some significant fraction of those preferences being retained under reflection. (Where (b) strikes me as way less probable than (a), but not wholly forbidden.) The second most obvious pathway is if a bunch of weird detailed noise appears in the first version of the reflective process and then freezes.

[-][anonymous]7mo208

In this post I'll explore each of these in turn. I'll primarily aim to make the positive case in this post; if you have an objection that I don't mention here, I may discuss it in the next post.

This is good to know, and I expect that this doesn't mean you want commenters to withhold their objections until after you discuss a section of them in your next post. I personally find that writing out confusions and disagreements at the very moment they appear is most useful instead of letting time pass by and attention to this topic to go by without them getting resolved.

Anyway, onto the object-level.

Premise 1 claims that, as AIs become more intelligent, their representations will become more compressed. Premise 2 claims that the simplest goals resemble squiggle-maximization. The relationship described in premise 1 may break down as AIs become arbitrarily intelligent—but if it doesn’t, then premise 2 suggests that their goals will converge toward some kind of squiggle-maximization. (Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post.)

This is the key ^[1] paragraph in this post, and frankly, I don't get it. Or rather, I think I get it, but it seems like straightforwardly false in my reading of it, so I put serious probability on me having misunderstood it completely.

This reasoning seems to assume^[2] that the "goal" of the AI is part of its "representations," so that when the representations get more compressed, so does the goal. Put aside for now the (in my view rather compelling) case that the belief/goal distinction probably doesn't ultimately make sense as something that carves reality at the joints into coherent, distinct concepts. The much more serious problem is that there is an implicit and rather subtle equivocation between different meanings of "representations" that makes either this inference or Premise 1 invalid.

The "representations," in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the "goal" lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology. This is one of the essences of the Orthogonality Thesis, is it not ^[3]? That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality,

Put differently, the mapping from the initial ontology to the final, more "compressed" ontology does not shrink the representation of the goal before or after mapping it; it simply maps it. If it all (approximately) adds up to normality, meaning that the new ontology is capable of replicating (perhaps with more detail, granularity, or degrees of freedom) the observations of the old one ^[4], I expect the "relative measure" of the goal representation to stay approximately ^[5] the same. And more importantly, I expect the "inverse transformation" from the new ontology to the old one to map the new representation back to the old one (since the new representation is supposed to be more compressed, i.e. informationally richer than the old one, in mathematical terms I would expect the preimage of the new representation to be approximately the old one).

I don't see the value systematization post as really bringing much relevant explanatory or argumentative power about this matter, either, perhaps because of its still-speculative nature?

^{^}
Or rather, one of the key
^{^}
I have a very hard time making heads or tails of it if it doesn't assume what I talk about next.
^{^}
Well, actually, Steven Byrnes recently described the Orthogonality Thesis in his "Claim" in what seemed to me like a much more restricted, weaker form that made it more like the "Nonlinearity Thesis", or the "Weak-correlation-at-most + tails coming apart thesis".
^{^}
Such as how the small-mass, low-velocity limit of General Relativity replicates standard Newtonian mechanics.
^{^}
I say "approximately" because of potential issues due to stuff analogous to Wentworth's "Pointers problem" and the way in which some (presumably small) parts of the goal in its old representation might be entirely incoherent and impossible to rescue in the new one.

[-]Richard_Ngo7mo7-2

I expect that this doesn't mean you want commenters to withhold their objections until after you discuss a section of them in your next post

That's right, please critique away. And thanks for the thoughtful comment!

This reasoning seems to assume^[2] that the "goal" of the AI is part of its "representations,"

An AI needs to represent its goals somehow; therefore its goals are part of its representations. But this is just a semantic point; I dig into your substantive criticism next.

The "representations," in the relevant sense that makes Premise 1 worth taking seriously, are object-level, positive rather than normative internal representations of the underlying territory. But the "goal" lies in another, separate magisterium. Yes, it refers to reality, so when the map approximating reality changes, so does its description. But the core of the goal does not, for it is normative rather than positive; it simply gets reinterpreted, as faithfully as possible, in the new ontology.

This is one of the objections I'll be covering in the next post. But you're right to call me out on it, because I am assuming that goal representations also become compressed. Or, to put it another way: I am assuming that the pressure towards simplicity described in premise 1 doesn't distinguish very much between goal representations and concept representations.

Why? Well, it's related to a claim you make yourself: that "the belief/goal distinction probably doesn't ultimately make sense as something that carves reality at the joints into coherent, distinct concepts". In other words, I don't think there is a fully clean distinction between goal representations and belief/concept representations. (In humans it often seems like the goal of achieving X is roughly equivalent to a deep-rooted belief that achieving X would be good, where "good" is a kinda fuzzy predicate that we typically don't look at very hard.) And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.

This is one of the essences of the Orthogonality Thesis, is it not ^[3]? That the goal is independent (i.e., orthogonal, implying uncorrelated) of the factual beliefs about reality,

The orthogonality thesis was always about what agents were possible, not which were likely. It is consistent with the orthogonality thesis to say that increasingly intelligent agents have a very strong tendency to compress their representations, and that this tends to change their goals towards ones which are simple in their new ontologies (although it's technically possible that they retain goals which are very complex in their new ontologies). Or, to put it another way: the orthogonality thesis doesn't imply that goals and beliefs are uncorrelated (which is a very extreme claim—e.g. it implies that superintelligences are just as likely to have goals related to confused concepts like phlogiston or souls as humans are).

[-][anonymous]7mo154

I don't think there is a fully clean distinction between goal representations and belief/concept representations

Alright, this is making me excited for your upcoming post.

In humans it often seems like the goal of achieving X is roughly equivalent to a deep-rooted belief that achieving X would be good, where "good" is a kinda fuzzy predicate that we typically don't look at very hard

I like this framing a lot. I like it so much, in fact, that I intend to use it in my upcoming long post compiling arguments against the theoretical soundness and practical viability of CEV.

The orthogonality thesis was always about what agents were possible, not which were likely. [...] the orthogonality thesis doesn't imply that goals and beliefs are uncorrelated

This is related to what I wrote in footnote 3 of my previous comment to you. But anyway, would you agree that this is an absolutely terrible naming of this concept? When I type in "orthogonal" in Google, the very first thing that pops up is a list of definitions containing "[2] STATISTICS (of variates) statistically independent." And even if people aren't meant to be familiar with this definition, the most common and basic use of orthogonal, namely in linear algebra, implies that two vectors are not only linearly independent, but also that they are as far from pointing in a "related" direction as mathematically possible!

It completely boggles the mind that "orthogonality" was the word chosen as the encoding of these ideas.

Anyway, I left the most substantive and important comment for last.

And so, given this, when I postulate a pressure to simplify representations my default assumption is that this will apply to both types of representations—as it seems to in my own brain, which often tries very hard to simplify my moral goals in a roughly analogous way to how it tries to simplify my beliefs.

The thing about this is that you don't seem to be currently undergoing the type of ontological crisis or massive shift in capabilities that would be analogous to an AI getting meaningfully more intelligent due to algorithmic improvements or increased compute or data (if you actually are, godspeed!)

So would you argue that this type of goal simplification and compression happens organically and continuously even in the absence of such a "phase transition"? I have a non-rigorous feeling that this argument would prove too much by implying more short-term modification of human desires than we actually observe in real life.

Relatedly, would you say that your moral goals are simpler now than they were, say, back when you were a child? I am pretty sure that the answer, at least for me, is "definitely not," and that basically every single time I have grown "wiser" and had my belief system meaningfully altered, I came out of that process with a deeper appreciation for the complexity of life and for the intricacies and details of what I care about.

[-]rif a. saurous7mo2814

I'm generally confused by the argument here.

As we examine successively more intelligent agents and their representations, the representation of any particular thing will perhaps be more compressed, but also and importantly, more intelligent agents represent things that less intelligent agents don't represent at all. I'm more intelligent than a mouse, but I wouldn't say I have a more compressed representation of differential calculus than a mouse does. Terry Tao is likely more intelligent than I am, likely has a more compressed representation of differential calculus than I do, but he also has representations of a bunch of other mathematics I can't represent at all, so the overall complexity of his representations in total is plausibly higher.

Why wouldn't the same thing happen for goals? I'm perfectly willing to say I'm smarter than a dog and a dog is smarter than a paramecium, but it sure seems like the dog's goals are more complex than the paramecium's, and mine are more complex than the dog's. Any given fixed goal might have a more compressed representation in the more intelligent animal (I'm not sure it does, but that's the premise so let's accept it), but the set of things being represented is also increasing in complexity across organisms. Driving the point home, Terry Tao seems to have goals of proving theorems I don't even understand the statement of, and these seem like complex goals to me.

So overall I'm not following from the premises to the conclusions. I wish I could make this sharper. Help welcome.

[-][anonymous]7mo20

I think what you're saying just makes a lot of sense, honestly.

I'd suspect one possible counterargument is that, just like how more intelligent agents with more compressed models can more compactly represent complex goals, they are also capable of drawing ever-finer distinctions that allow them to identify possible goals that have very short encodings in the new ontology, but which don't make sense at all as stand-alone, mostly-coherent targets in the old ontology (because it is simply too weak to represent them). So it's not just that goals get compressed, but also that new possible kinds of goals (many of them really simple) get added to the game.

But this process should also allow new goals to arise that have ~ any arbitrary encoding length in the new ontology, because it should be just as easy to draw new, subtle distinctions inside a complex goal (which outputs a new medium- or large-complexity goal) as it would be inside a really simple goal (which outputs the type of new super-small-complexity goal that the previous paragraph talks about). So I don't think this counterargument ultimately works, and I suspect it shouldn't change our expectations in any meaningful way.

[-]MinusGix7mo10

I'm skeptical of the naming being bad, it fits with that definition and the common understanding of the word. The Orthogonality Thesis is saying that the two qualities of goal/value are not necessarily related, which may seem trivial nowadays but there used to be plenty of people going "if the AI becomes smart, even if it is weird, it will be moral towards humans!" through reasoning of the form "smart -> not dumb goals like paperclips". There's structure imposed on what minds actually get created, based on what architectures, what humans train the AI on, etc. Just as two vectors can be orthogonal in R^2 while the actual points you plot in the space are correlated.

[-][anonymous]7mo10

it fits with that definition

With what definition? The one most applicable here, dealing with random variables (relative to our subjective uncertainty), says "random variables that are independent". Independence implies uncorrelation, even if the converse doesn't hold.

Just as two vectors can be orthogonal in R^2 while the actual points you plot in the space are correlated.

This is totally false as a matter of math if you use the most common definition of orthogonality in this context. I do agree that what you are saying could be correct if you do not think of orthogonality that way and instead simply look at it in terms of the entries of the vectors, but then you enter the realm of trying to capture the "goals" and "beliefs" as specific Euclidean vectors, and I think that isn't the best idea for generating intuition because one of the points of the Orthogonality thesis seems to be to instead abstract away from the specific representation you choose for intelligence and values (which can bias you one way or another) and to instead focus on the broad, somewhat-informal conclusion.

[-]MinusGix7mo10

it fits with that definition

Ah, I rewrote my comment a few times and lost what I was referencing. I originally was referencing the geometric meaning (as an alternate to your statistical definition), two vectors at a right angle from each other.

But the statistical understanding works from what I can tell? You have your initial space with extreme uncertainty, and the orthogonality thesis simply states that (intelligence, goals) are not related — you can pair some intelligence with any goal. They are independent of each other at this most basic level. This is the orthogonality thesis. Then, in practice, you condition your probability distribution over that space with your more specific knowledge about what minds will be created, and how they'll be created. You can consider this as giving you a new space, moving probability around. As an absurd example: if height/weight of creatures were uncorrelated in principal, but then we update on "this is an athletic human", then in that new distribution they are correlated! This is what I was trying to get at with my R^2 example, but apologies that I was unclear since I was still coming at it from a frame of normal geometry. (Think, each axis is an independent normal distribution but then you condition on some knowledge that restricts them such that they become correlated)

I agree that it is an informal argument and that pinning it down to very detailed specifics isn't necessary or helpful at this low-level, I'm merely attempting to explain why orthogonality works. It is a statement about the basic state of minds before we consider details, and they are orthogonal there; because it is an argumentative response to assumptions about "smart -> not dumb goals".

[-]simon7mo98

It feels to me like this post is treating AIs as functions from a first state of the universe to a second state of the universe. Which in a sense, anything is... but, I think that the tendency to simplification happens internally, where they operate more as functions from (digital) inputs to (digital) outputs. If you view an AI as a function from an digital input to a digital output, I don't think goals targeting specific configurations of the universe are simple at all and don't think decomposability over space/time/possible worlds are criteria that would lead to something simple.

[-]Alexander Gietelink Oldenziel7mo60

I'm usually a skeptic of the usefulness of this kind of speculation but I found this a good read. I am particularly intrigued hy the suggestion of decomposability of goals.

[-]Richard_Ngo7mo20

Thanks! Yeah, I personally find it difficult to strike a balance between "most speculation of this kind is useless" and "occasionally it's incredibly high-impact" (e.g. shapes the whole field of alignment, like the concept of deceptive alignment did).

My guess is that the work which falls into the latter category is most often threat modeling, because that's a domain where there's no way to approach it except speculation.

[-]Rene de Visser7mo50

Surely the point of compression is that what you are compressing is preserved. i.e. the uncompressed version is roughly reproduced. Better compression means you preserve the important aspects while using less space.

Shouldn't the goal be preserved by the compression? I don't get this post at all.

[-]the gears to ascension7mo31

The structure of past data is preserved when creating a compressor. Future data is only constrained by smoothness.

[-]Rene de Visser7mo30

Wouldn't a goal be mainly "past" data though? Though I guess the application of the goal depends on recognizing features when trying to apply it. I guess it depends how far out of context/distribution one is trying to apply the goal in the future.

[-]Seth Herd6mo20

I agree 100%. This post is basically arguing that greater intelligence will get its goals more wrong in future versions. That would be dumber, not smarter.

The post frames the hypothesis as "greater intelligence compresses more" without hugely arguing that's true and inevitable. I think the premise is simply false. Better compression is an element of greater intelligence up to some point (useful abstract representations that aid thinking with limited computational resources), but not further beyond that point with any necessity.

[-]Daniel Kokotajlo7moΩ451

This doesn't sound like an argument Yudkowsky would make, though it seems to have some similar concepts. And it's interesting food for thought regardless -- thanks! Looking forward to the rest of the series.

[-]Steven Byrnes7moΩ220

This doesn't sound like an argument Yudkowsky would make

Yeah, I can’t immediately find the link but I recall that Eliezer had a tweet in the past few months along the lines of: If ASI wants to tile the universe with one thing, then it wipes out humanity. If ASI wants to tile the universe with sixteen things , then it also wipes out humanity.

My mental-model-of-Yudkowsky would bring up “tiny molecular squiggles” in particular for reasons a bit more analogous to the CoastRunners behavior (video)—if any one part of the motivational system is (what OP calls) decomposable etc., then the ASI would find the “best solution” to maximizing that part. And if numbers matter, then the “best solution” would presumably be many copies of some microscopic thing.

[-]Richard_Ngo7moΩ680

I don't argue at any point that ASIs will have a single goal. The argument goes through equally well if it has many. The question is why some of those goals are of the form "tile the universe with squiggles" at all. That's the part I'm addressing in this post.

[-]SCP7moΩ130

Two nitpicks and a reference:

an agent’s goals might not be linearly decomposable over possible worlds due to risk-aversion

Risk aversion doesn't violate additive separability. E.g., for we always get $E [u (x)] = \sum_{i} p_{i} x_{i}^{a}$ whether $a = 1$ (risk neutrality) or $a = 1 / 2$ (risk aversion). Though some alternatives to expected utility, like Buchak's REU theory, can allow certain sources of risk aversion to violate separability.

when features have fixed marginal utility, rather than being substitutes

Perfect substitutes have fixed marginal utility. E.g., $v (x, y) = x + 2 y$ always has marginal utilities of 1 and 2.

I'll focus on linearly decomposable goals which can be evaluated by adding together evaluations of many separate subcomponents. More decomposable goals are simpler

There's an old literature on separability in consumer theory that's since been tied to bounded rationality. One move that's made is to grant weak separability accross goups of objects---features---to rationalise the behaviour of optimising accross groups first, and within groups second. Pretnar et al (2021) describe how this can arise from limited cognitive resources.

[-]Leon Lang7mo30

Thanks for the post!

a. How exactly do 1 and 2 interact to produce 3?

I think the claim is along the lines of "highly compressed representations imply simple goals", but the connection between compressed representations and simple goals has not been argued, unless I missed it. There's also a chance that I simply misunderstand your post entirely.

b. I don't agree with the following argument:

Decomposability over space. A goal is decomposable over space if it can be evaluated separately in each given volume of space. All else equal, a goal is more decomposable if it's defined over smaller-scale subcomponents, so the most decomposable goals will be defined over very small slices of space—hence why we're talking about molecular squiggles. (By contrast, you can't evaluate the amount of higher-level goals like "freedom" or "justice" in a nanoscale volume, even in principle.)

The classical ML-algorithm that evaluates features separately in space is a CNN. That doesn't mean that features in CNNs look for tiny structures, though: The deeper into the CNN you are, the more complex the features get. Actually, deep CNNs are an example of what you describe in argument 1: The features in later layers of CNNs are highly compressed, and may tell you binary information such as "is there a dog", but they apply to large parts of the input image.

Therefore, I'd also expect that what an AGI would care about are ultimately larger-scale structures since the AGI's features will nontrivially depend on the interaction of larger parts in space (and possibly time, e.g. if the AGI evaluates music, movies, etc.).

c. I think this leaves the confusion why philosophers end up favoring the analog of squiggles when they become hedonic utilitarians. I'd argue that the premise may be false since it's unclear to me how what philosophers say they care about ("henonium") connects with what they actually care about (e.g., maybe they still listen to complex music, build a family, build status through philosophical argumentation, etc.)

[-]Richard_Ngo7mo3-4

I think the claim is along the lines of "highly compressed representations imply simple goals", but the connection between compressed representations and simple goals has not been argued, unless I missed it. There's also a chance that I simply misunderstand your post entirely.

Hmm, maybe I should spell it out more explicitly. But basically, by "simple goals" I mean "goals which are simple to represent", i.e. goals which have highly compressed representations; and if all representations are becoming simpler, then the goal representations (as a subset of all representations) are also becoming simpler. (Note that I'll elaborate further on the relationship between goal representations and other representations in my next post.)

Actually, deep CNNs are an example of what you describe in argument 1: The features in later layers of CNNs are highly compressed, and may tell you binary information such as "is there a dog", but they apply to large parts of the input image.

This is largely my fault since I haven't really defined "representation" very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing "fur", "mouth", "nose", "barks", etc. Otherwise if we just count "dog" as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn't seem like a useful definition.

(To put it another way: the representation is the information you need to actually do stuff with the concept.)

c. I think this leaves the confusion why philosophers end up favoring the analog of squiggles when they become hedonic utilitarians. I'd argue that the premise may be false since it's unclear to me how what philosophers say they care about ("henonium") connects with what they actually care about (e.g., maybe they still listen to complex music, build a family, build status through philosophical argumentation, etc.)

I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it's very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.

[-]Leon Lang7mo30

Thanks for the answer!

But basically, by "simple goals" I mean "goals which are simple to represent", i.e. goals which have highly compressed representations

It seems to me you are using "compressed" in two very different meanings in part 1 and 2. Or, to be fairer, I interpret the meanings very differently.

I try to make my view of things more concrete to explain:

Compressed representations: A representation is a function from observations of the world state $O$ (or sequences of such observations) into a representation space $R$ of "features". That this is "compressed" means (a) that in $R$ , only a small number of features are active at any given time and (b) that this small number of features is still sufficient to predict/act in the world.

Goals building on compressed representations: A goal is a (maybe linear) function $U : R \to R$ from the representation space into the real numbers. The goal "likes" some features and "dislikes" others. (Or if it is not entirely linear, then it may like/dislike some simple combinations/compositions of features)

It seems to me that in part 2 of your post, you view goals as compositions $U \circ f : O \to R$ . Part 1 says that $f$ is highly compressed. But it's totally unclear to me why the composition $U \circ f$ should then have the simplicity properties you claim in part 2, which in my mind don't connect with the compression properties of $f$ as I just defined them.

A few more thoughts:

The notion of "simplicity" in part $2$ seems to be about how easy it is to represent a function -- i.e., the space of parameters with which the function $U \circ f$ is represented is simple in your part 2.
The notion of "compression" in part 1 seems to be about how easy it is to represent an input -- i.e., is there a small number of features such that their activation tells you the important things about the input?
These notions of simplicity and compression are very different. Indeed, if you have a highly compressed representation $f$ as in part 1, I'd guess that $f$ necessarily lives in a highly complex space of possible functions with many parameters, thus the opposite of what seems to be going on in part 2.

This is largely my fault since I haven't really defined "representation" very clearly, but I would say that the representation of the concept of a dog should be considered to include e.g. the neurons representing "fur", "mouth", "nose", "barks", etc. Otherwise if we just count "dog" as being encoded in a single neuron, then every concept encoded in any neuron is equally simple, which doesn't seem like a useful definition.
(To put it another way: the representation is the information you need to actually do stuff with the concept.)

I'm confused. Most of the time, when seeing a dog, most of what I need is actually just to know that it is a "dog", so this is totally sufficient to do something with the concept. E.g., if I walk on the street and wonder "will this thing bark?", then knowing "my dog neuron activates" is almost enough.

I'm confused for a second reason: It seems like here you want to claim that the "dog" representation is NOT simple (since it contains "fur", "mouth", etc.). However, the "dog" representation needs lots of intelligence and should thus come along with compression, and if you equate compression and simplicity, then it seems to me like you're not consistent. (I feel a bit awkward saying "you're not consistent", but I think it's probably good if I state my honest state of mind at this moment).

To clarify my own position, in line with my definition of compression further above: I think that whether representation is simple/compressed is NOT a property of a single input-output relation (like "pixels of dog gets mapped to dog-neuron being activated"), but instead a property of the whole FUNCTION that maps inputs to representations. This function is compressed if for any given input, only a small number of neurons in the last layer activate, and if these can be used (ideally in a linear way) for further predictions and for evaluating goal-achievement.

I agree that most people who say they are hedonic utilitarians are not 100% committed to hedonic utilitarianism. But I still think it's very striking that they at least somewhat care about making hedonium. I claim this provides an intuition pump for how AIs might care about squiggles too.

Okay, I agree with this, fwiw. :) (Though I may not necessarily agree with claims about how this connects to the rest of the post)

[-]Seth Herd6mo20

Making representations simpler even when that makes them worse at their job is not more intelligent. Yes, on the mammalian scale smarter minds compress more in many ways. That doesn't mean yet smarter minds will keep doing it more even when it makes them worse at achieving their goals, and is not necessary since they have adequate storage to keep the less-compressed and therefore more accurate and useful represenations.

This is a noble project, but I'm afraid the premise is simply false.

[-]Benjy Forstadt7mo10

The vast majority of philosophers definitely do not favor maximizing the amount of hedonium. Pure hedonistic utilitarianism is a relatively rare minority view. I don’t think we should try to explain how people end up with specific idiosyncratic philosophical views by this kind of high-level analysis…

[-]Vladimir_Nesov7moΩ120

I think the FDT dictum of treating an agent like an abstract algorithm rather than any given physical instance of it ("I am an algorithm") extends to treating goals as about the collective abstract consequences of behavior of abstract algorithms (other algorithms, that are not necessarily the agent) rather than of any given incarnation of those algorithms or consequences in any given incarnation, such as the physical consequences of running algorithms on computers in a physical world.

In this ontology, goals are not about optimizing configurations of the world, they are about optimizing behaviors of abstract algorithms or optimizing properties of mathematical structures. Physically, this predicts computronium (to run acausal interactions with all the abstract things, in order to influence their properties and behaviors) and anti-predicts squiggles or any such focus on the physical form of what's going on, other than efficiency at accessing more computation.

[-]tslarm6mo10

Meanwhile hedonic utilitarianism fully bites the bullet, and gets rid of every aspect of life that we value except for sensory pleasure.

I think the word 'sensory' should be removed; hedonic utilitarianism values all pleasures, and not all pleasures are sensory.

I'm not raising this out of pure pedantry, but because I think this phrasing (unintentionally) plays into a common misconception about ethical hedonism.

[-]Steven6mo10

Thanks for the summary. Why should this be true?

The fact that sympathy for hedonic utilitarianism is strongly correlated with intelligence is a somewhat worrying datapoint in favor of the plausibility of squiggle-maximizers.

Embracing positive sensory experience due to higher human levels of intelligence implies a linearity that I don’t think is true among other animals. Are chimps more hedonic utilitarian than ants than bacteria? Human intelligence is too narrow for this to be evidence of what something much smarter would do

[-]quila7mo10

Note that I’m eliding over some subtleties related to which representations exactly get compressed, which I’ll address in my next post

Request: also define 'representation' in your next post. I keep seeing humans writing about concepts and representations as if they're necessary to intelligence, and it's not clear to me exactly what they mean by those terms (but I suspect they think of their underlying cognition as being based on those.).

Also, an explanation of why you expect a trained ASI to necessarily have terminal values to begin with (if you do) would be helpful. E.g., what of training criteria where being a world-valuing-agent does not increase performance? E.g., where the AI is just repeatedly answering one particular class of questions - why would it be an agent, rather than {the structure of the process which produces the answer to that class of questions}, that an agent would also use if it wanted to answer a question in that class, but which itself is decomposable from agency and in a trivial way simpler than {itself plus agentic structure}?

This post is the version of Yudkowsky's argument for inner misalignment that I wish I'd had in my head a few years ago. I don't claim that it's novel, that I endorse it, or even that Yudkowsky would endorse it; it's primarily an attempt to map his ideas into an ontology that makes sense to me (and hopefully others).

I'd like to read Yudkowsky's current endorsed beliefs about this, if they're available online anywhere.

[-]mishka7mo10

What forces might push back on this process, though?

One might ponder encouraging diversity (in goals, and in representations).

Diversity should be quite adaptive, although I am not sure if it is more productive to think in terms of a population of agents, rather than in terms of a single agent.

[-]Richard_Ngo7moΩ200

Curious who just strong-downvoted and why.

[-]habryka7moΩ7174

Early stage votes are pretty noisy (and I think have been getting noisier over time, which is somewhat of a proxy of polarization, which makes me sad).

[-][anonymous]7mo56

By the way, is there a principled ^[1] reason why LW doesn't have agree/disagree votes on posts? I think this post would be a prime example of something users here might want to upvote + disagree-vote.

There seem to have been a lot of requests (1, 2, 3, 4, etc) from users for this feature over the past year or so. Raemon has mentioned a benefit of this in the past, and you yourself have extolled the benefits of agree/disagree voting for a long time.

^{^}
As in, not related to implementation details or the LW team not having had time to deal with this, etc.

[-]habryka7mo120

Yeah, the principled reason (though I am not like super confident of this) is that posts are almost always too big and have too many claims in them to make a single agree/disagree vote make sense. Inline reacts are the intended way for people to express agreement and disagreement on posts.

I am not super sure this is right, but I do want to avoid agreement/disagreement becoming disconnected from truth values, and I think applying them to elements that clearly don't have a single truth value weakens that connection.

[-]Seth Herd6mo40

It makes me happy to see you say polarization makes you sad. Sorry :) I agree that it's all too easy to lose rational discussion to polarization, and I live in fear of seeing LessWrong and the alignment community become polarized.

Your point about them not fully making sense makes sense. But despite that, I think agree/disagree votes on posts might help prevent that from happening. Receiving a large downvote on a high-effort or dearly-valued post is emotionally painful, and creates a bit of hostility/negative emotion toward the downvoter. Since you don't know who it was, you wind up thinking of it as "the opposition to my position". This hostility tempts you to downvote high effort or dearly valued posts stating similar opposing views. Now they're irritated/hostile back. Polarization results. Even if that polarization is on specific issues rather than lumping many views together, it's almost as bad.

It looks to me like agree/disagree on comments serves remarkably well to allow people to express agreement or disagreement with what they perceive as the main point, while still acknowledging the effort someone put into clarifying and arguing for that point. Many comments also make multiple points, so you're guessing at what each agree vote really means, and that's okay. It still serves the important purpose of keeping upvote/downvate from being used for agree/disagree. I think people are often posts voting post as agree or disagree rather than their contribution to the conversation. If you're quite sure a post is wrong in its main point, you might argue that it's not contributing to our collective knowledge but instead confusing it. And it's attractive to register your opinion of agreeing without writing a comment just to say "yeah". But we don't want to be voting on the truth (in most cases wehre it's not clear cut), but rather discussing it, and awarding congratulations (upvotes) for people that are trying to advance the discussion, even when their logic or their methods happen to be partly mistaken.

[-][anonymous]7mo42

Alright, this makes sense. I will say, though, that at least in my experience on this site, inline reacts seem to be used much more often in comments than they are in posts, even though comments have agree/disagree voting enabled. And, at least for me, using them in posts just kind of feels... weird, clunky? compared to using them in comments. I don't really know the right word for it, it just feels aesthetically off for reasons I can't quite explain.

I could, of course, simply be wrong about the factual claim here.

[-]habryka7mo53

Yeah, I am not super happy with the UI for inline reacts in posts, both for reading and for writing them. It's been on my to-do list for a while to improve them.

[-]Seth Herd6mo40

The inline reactions are totally different in that they de-anonymize your feedback. It's both rational and emotionally instinctive to not want someone irritated at you specifically because you disagree with them. In comments you can at least try to be extra nice and appreciative when you express disagreement to offset that downside. But that's a lot more work than clicking a button. I think that's sometimes why the anonymous big downvote is used. It takes real thought and discipline to reserve it for fair, rational "the discussion would be much better if you hadn't written this" responses. And I'm not sure how many people even try to do that.

Moderation Log