Inner Alignment: Explain like I'm 12 Edition

(This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.)

Note that bold and italics means "this is a new term I'm introducing," whereas underline and italics is used for emphasis.

What is Inner Alignment?

Let's start with an abridged guide to how Deep Learning works:

  1. Choose a problem
  2. Decide on a space of possible solutions
  3. Find a good solution from that space

If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set ) defines one solution. We call each such solution a model. The space of possible models is depicted below.

Since that's all possible models, most of them are utter nonsense.

Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats."

How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works:

SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is "close" and hopefully a little better. Eventually, it stops and outputs the most recent model.[1] Note that, in the example above, we don't end up with the perfect cat-recognizer (the red box) but with something close to it – perhaps a model that looks for cats but has some unintended quirks. SGD does generally not guarantee optimality.

The speech bubbles where the models explain what they're doing are annotations for the reader. From the perspective of the programmer, it looks like this:

The programmer has no idea what the models are doing. Each model is just a black box.[2]

A necessary component for SGD is the ability to measure a model's performance, but this happens while treating them as black boxes. In the cat example, assume the programmer has a bunch of images that are accurately labeled as "contains cat" and "doesn't contain cat." (These images are called the training data and the setting is called supervised learning.) SGD tests how well each model does on these images and, in each step, chooses one that does better. In other settings, performance might be measured in different ways, but the principle remains the same.

Now, suppose that the images we have happen to include only white cats. In this case, SGD might choose a model implementing the rule "output yes if there is something white and with four legs." The programmer would not notice anything strange – all she sees is that the model output by SGD does well on the training data.

In this setting, there is thus only a problem if our way of obtaining feedback is flawed. If it is perfect – if the pictures with cats are perfectly representative of what images-with-cats are like, and the pictures without cats are perfectly representative of what images-without-cats are like, then there isn't an issue. Conversely, if our images-with-cats are non-representative because all cats are white, the model SGD outputs might not be doing precisely what the programmer wanted. In Machine Learning slang, we would say that the training distribution is different from the distribution in deployment.

Is this Inner Alignment? Not quite. This is about a property called distributional robustness, and it's a well-known problem in Machine Learning. But it's close.

To explain Inner Alignment itself, we have to switch to a different setting. Suppose that, instead of trying to classify whether images contain cats, we are trying to train a model that solves mazes. That is, we want an algorithm that, given an arbitrary solvable maze, outputs a route from the Maze Entry to the Maze Exit.

As of before, our space of all possible models will consist primarily of nonsense solutions:

(If you don't know what depth-first search means: as far as mazes are concerned, it's simply the "always go along one wall" rule.)

The annotation "I perform depth-first search" means the model contains a formal algorithm that implements depth-first search, and analogously with the other annotations.

As with the previous example, we might apply SGD to this problem. In this case, the feedback mechanism would come from evaluating the model on test mazes. Now, suppose that all of the test mazes have this form,

where the red areas represent doors. That is, all mazes are such that the shortest path leads through all of the red doors, and the exit is itself a red door.

Looking at this, you might hope that SGD finds the "depth-first" model. However, while that model would find the shortest path, it is not the best model. (Note that it first performs depth-first search and then, once it has found the right path, discards dead ends and outputs the shortest path only.) The alternative model with annotation "perform breadth-first search to find the next red door, repeat forever" would perform better. (Breadth-first means exploring all possible paths in parallel.) Both models always find the shortest path, but the red-door model would find it more quickly. In the maze above, it would save time by finding the path from the first to the second door without wasting time exploring the lower-left part of the maze.

Note that breadth-first search only outperforms depth-first search because it can truncate the fruitless paths after having reached the red door. Otherwise, it wouldn't know that the bottom-left part is fruitless until much later in the search.

As of before, all the programmer will see is that the left model performs better on the training data (the test mazes).

The qualitative difference to the cat picture example is that, in this case, we can talk about the model as running an optimization process**.** That is, the breadth-first search model does itself have an objective (go through red doors), and it tries to optimize for that in the sense that it searches for the shortest path that leads there. Similarly, the depth-first model is an optimization process with the objective "find exit of maze."

This is enough to define Inner Alignment, but to make sure the definition is the same that one reads elsewhere, let's first define two new terms.

  • The Base Objective is the objective we use to evaluate models found by SGD. In the first example, it was "classify pictures correctly (i.e., say "contains cat" if it contains a cat and "doesn't contain cat" otherwise). In the second example, it was "find [a shortest path that solves mazes] as quickly as possible."
  • In the cases where the model is running an optimization process, we call the model a Mesa Optimizer, and we call its objective the Mesa Objective (in the maze example, the mesa objective is "find shortest path through maze" for the depth-first model, and "repeatedly find shortest path to the next red door" for the breadth-first model).

With that said,

Inner Alignment is the problem of aligning the Base Objective with the Mesa Objective.

Some clarifying points:

  • The red-door example is thoroughly contrived and would not happen in practice. It only aims to explain what Inner Alignment is, not why misalignment might be probable.
  • You might wonder what the space of all models looks like. The typical answer is that the possible models are sets of weights for a neural network. The problem exists insofar as some sets of weights implement specific search algorithms.
  • As of before, the reason for the inner alignment failure was that our way of obtaining feedback was flawed (in ML language: because there was distributional shift). (Although misalignment may also arise for other very complicated reasons.)
  • If the Base Objective and Mesa Objective are misaligned, this causes problems as soon as the model is deployed. In the second example, as soon as we take the model output by SGD and apply it to real mazes, it would still search for red doors. If those mazes don't contain red doors, or the red doors aren't always on paths to the exit, the model would perform poorly.

Here is the relevant Venn-Diagram. (Relative sizes don't mean anything.)

Note that {What AI tries to do} = {Mesa Objective} by definition.

Most classical discussion of AI alignment, including most of the book Superintelligence, is about Outer Alignment. The classical examples where we assume the AI is optimized to cure cancer and then kills humans so that no-one can have cancer anymore is about a misalignment of {What Programmers want} and the {Base Objective}. (The Base Objective is {minimize the number of people who have cancer}, and while it's not clear what the programmers want, it's certainly not that.)

Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep learning revolution, it may not describe what the field looks like in the future. In the context of black box search, inner alignment is a well-defined property and Venn-Diagram a valid way of slicing up the problem, but there are people who expect that AGI will not be built that way.[3] There are even concrete proposals for safe AI where the concept doesn't apply. Evan Hubinger has since written a follow-up post about what he calls "training stories", which is meant to be "a general framework through which we can evaluate any proposal for building safe advanced AI".

The Analogy to Evolution

Arguments about Inner Alignment often make reference to evolution. The reason is that evolution is an optimization process – it optimizes for inclusive genetic fitness. The space of all models is the space of all possible organisms.

Humans are certainly not the best model in this space – I've added the description on the bottom right to indicate that there are better models that haven't been found yet. However, humans are the best model that evolution has found so far.

As with the maze example, humans do themselves run optimization processes. Thus, we can call them/us Mesa Optimizes, and we can compare the Base Objective (the one evolution maximizes for) with the Mesa Objective (the one humans optimize for).

  • Base Objective: maximize inclusive genetic fitness
  • Mesa Objective: avoid pain, seek pleasure

(This is simplified – some humans optimize for other things, such as the well-being of all possible minds in the universe – but those are no closer to the Base Objective.)

We can see that humans are not aligned with the base objective of evolution. And it is easy to see why – the way Evan Hubinger put it is to imagine the counterfactual world where evolution did select inner-aligned models. In this world, a baby who stabs its toe has to compute how stabbing its toe affects its inclusive genetic fitness before knowing whether or not to repeat this behavior in the future. This would be computationally expensive, whereas the "avoid pain" objective immediately tells the baby that stabbing toe=bad, which is much cheaper and usually the correct answer. Thus, an unaligned model outperforms the hypothetical aligned model. Another interesting aspect is that the size of the misalignment (the difference between the Base Objective and the Mesa Objective) has widened over the last few millennia. In the ancestral environment, they were pretty close, but now, they are so far apart that we need to pay people to donate their sperm, which, according to the Base Objective, ought to be a highly desirable action.

Consequently, the analogy might be an argument for why Inner Misalignment is probable since it has occurred "naturally" in the biggest non-human-caused optimization process we know. However, the big caveat here is that evolution does not implement Stochastic Gradient Descent. Evolution navigates the model space by performing random mutations and then evaluating performance, which is fundamentally different (and a billion times less efficient) from modifying the model according to the expected derivative of the loss function, which is what SGD does. Thus, while the analogy works in most ways, it stops working as soon as one makes arguments that rely on properties of SGD other than that it optimizes the Base Objective.

Deceptive Alignment

This is the abridged version of the fourth part of the sequence. I'm linking to it because this is probably the one where leaving out the technical details is the most problematic.

The concept

In this section, we make the following assumptions:

  • The learning task is hard, and therefore, models are very complex. Think of a question-answering system, rather than an image classifier.
  • Instead of having a single learning process, we update a model over time.
  • The learning process will select a Mesa Optimizer.
  • The Base Objective is complicated, and the model won't get it right immediately (i.e., the model starts out not being inner-aligned).

Since the model is sophisticated, we assume that it builds up a model of the world somehow. Think of GPT-3 (the language model that can write text): it clearly recognizes whether you're prompting it about Lord of the Rings or about politics. This shows that it has an internal model of these things, however flawed or incomplete.

Thus, if we look inside the model (which, again, the programmers cannot do), we have the following two components:

Recall that the model is a Mesa Optimizer by assumption, hence we know it has an objective. This (red blob) is the Mesa Objective.

As its model of the world improves, it might eventually include a model of the Base Objective. Recall that the Base Objective is what SGD optimizes for.

I've drawn the model of the Base Objective (purple blob) larger than the Mesa Objective since we assume the Base Objective is fairly complex.

SGD tries to make the model better, and if [the thing that the model optimizes for] becomes more similar to the Base Objective, the model does become better. Therefore, we speculate that the model will change such that this happens. We further speculate that there are three different ways this could happen, which I'll illustrate below.


The first is called Internalization. The model adopts the Base Objective outright.


The second is called Proxy Alignment. The Mesa Objective is no longer represented explicitly – instead, it is a pointer to the part of its world model that includes the Base Objective.


The third is called Deceptive Alignment. Here, the Mesa Objective remains different from the Base Objective, but the model chooses to optimize the Base Objective anyway because doing so makes the learning procedure think it is aligned. Thus, the dotted pointer does not indicate that the Mesa Objective is the Base Objective, nor that it points to the base objective, but that the model will optimize the Base Objective for instrumental reasons, i.e., as part of its strategy for optimizing the Mesa Objective.


For deceptive alignment to be possible, we require that (a) the model knows it is in a training process, and (b) it cares about its objective long-term. For deceptive alignment to become a problem, we further require that (c) the model expects that it eventually leaves the training process permanently. If all of those are given, the paper uses the following example to illustrate why a model might be deceptively aligned (i.e., might optimize the Base Objective even though it's not what it terminally wants).

In this case, the Base Objective is "go to A" and the Mesa Objective "go to B." If the model that wants to go to B is smart, it will go to A in the training process since that allows it to go to B during deployment, and there are three deployment episodes vs. only two training episodes. Thus, optimizing the Base Objective during training yields better results for the Mesa Objective (in the long-term) than optimizing for the Mesa Objective outright. This is deceptive alignment.

Buck Shlegeris (Miri/LW) has the following analogy illustrating the three cases. Suppose the Christian God wants to create humans that do what they want (god's utility function is the Base Objective). The three kinds of alignment have analogs in what kinds of humans do what god wants.

  • For internalization, think of Jesus. He does what god wants because he has the same utility function as god.
  • For proxy alignment, think of Martin Luther. He does what God wants because his utility function is "figure out what the bible wants and do that."
  • For deceptive alignment, his example was Blaise Pascal; however, I don't need that example because I used to have a friend in middle and high school who explicitly wanted to do what the bible said as he wanted to go to heaven. He openly admitted (to me, anyway) to have terminal goals that contradict what the bible says, but made genuine efforts to adhere to the bible anyway. I'm very confident that this type of person is more common than Martin Luthers or Jesuses.

One purpose of this analogy is to illustrate why deceptive models might be more probable than what intuition might claim. Deceptive alignment sounds like a weird, specific kind of thing, but the analogy suggests that there are, in fact, many more deceptively aligned models out there than proxy aligned models or models aligned via internalization. Only one utility function is the same as that of the Christian god, but many utility functions place very high utility on "don't go to hell." Similarly, there is a range of possible Mesa Objectives that allow for deceptive alignment, and all of those will optimize the Base Objective during training.

The above is just an analogy, but there are other reasons to expect that deception is likely. Specifically, I'll cover two further arguments why the alternatives are "difficult" to achieve and hence might be less likely.

Why internalization might be difficult

If a model is heavily optimized for performance on a complex training distribution, we expect it to have a complex world model. This is already true for GPT-3, which requires explicit models of millions of human concepts (money, salt shakers, the Lord of the Rings movies, you name it) to predict the subsequent word on arbitrary inputs.

If the task is sufficiently hard, we may further expect this world model includes a model of the Base Objective. Not for nefarious reasons, but simply as one more thing that slightly increases performance on the training distribution.

If the above is true at the point where the model first switches from a bunch of heuristics to an explicit optimization target, Internalization would require rebuilding [the model of the Base Objective] in full, whereas proxy alignment or deception only require building a pointer to the [model of the Base objective] that already exists in the world model. Since rebuilding the entire model is likely much harder than installing a pointer (which can typically be represented by a single number), SGD is unlikely to go down that path.

In the analogy to evolution (which is still just an analogy and might not work in this case), it seems intuitive that programming a detailed model of "parents" into our DNA is harder for evolution than implementing love for parents via reference to our world model.

Why proxy alignment might be difficult

The argument in the previous subsection does not address the difference between deceptive and proxy alignment as neither of them needs to represent a complex objective explicitly.

Instead, the argument made by the paper is one of robustness. It goes like so:

  • If a model is proxy-aligned, but the pointer is wrong, this can cause the model's performance to degrade.

  • Even if the model's world model (gray blob) improves to the point where it recognizes this failure, it will not correct the pointer because the pointer's current target is the Mesa Objective.

  • SGD can update the pointer, but this isn't easy to get right.

  • Conversely, a deceptive model will immediately adjust its behavior if it learns that the pointer was flawed as doing so is helpful for its objective. (This is so because its strategy is "hide my true objective for now, such that it won't get modified and I can pursue it in deployment.")

Note that this does not exhaust all arguments that have been made in the paper.

Miscellaneous

It seems to me that one under-appreciated aspect of Inner Alignment is that, even if one had the one-true-utility-function-that-is-all-you-need-to-program-into-AI, this would not, in fact, solve the alignment problem, nor even the intent-alignment part. It would merely solve outer alignment (provided the utility function can be formalized). If we do SGD based on the one true utility function, this could still lead to a mesa optimized that wants something else.

Another interesting point is that the plausibility of internalization (i.e., of a model representing the Base Objective explicitly) does not solely depend on the complexity of the objective. For example, evolution's objective of "maximize inclusive genetic fitness" is quite simple, but it is still not represented explicitly because figuring out how actions affect the objective is computationally hard. Thus, {probability of Mesa Optimizer adopting an objective} is at least dependent on {complexity of objective} as well as {difficulty of assessing how actions impact objective}.


  1. In practice, one often runs SGD multiple times with different initializations and uses the best result. Also, the output of SGD may be a linear combination of all models on the way rather than just the final model._ ↩︎

  2. However, there are efforts to create transparency tools to look into models. Such tools might be helpful if they become really good. Some of the proposals for building safe advanced AI explicitly include transparency tools ↩︎

  3. If an AGI does contain more hand-coded parts, the picture gets more complicated. E.g., if a system is logically separated into a bunch of components, inner alignment may apply to some of the components but not others. It may even apply to parts of biological systems, see e.g., Steven Byrne's Inner Alignment in the brain. ↩︎

New Comment
47 comments, sorted by Click to highlight new comments since:
[-]evhubΩ16460

This is great—thanks for writing this! I particularly liked your explanation of deceptive alignment with the diagrams to explain the different setups. Some comments, however:

(These models are called the training data and the setting is called supervised learning.)

Should be “these images are.”

Thus, there is only a problem if our way of obtaining feedback is flawed.

I don't think that's right. Even if the feedback mechanism is perfect, if your inductive biases are off, you could still end up with a highly misaligned model. Consider, for example, Paul's argument that the universal prior is malign—that's a setting where the feedback is perfect but you still get malign optimization because the prior is bad.

For proxy alignment, think of Martin Luther King.

The analogy is meant to be to the original Martin Luther, not MLK.

If we further assume that processing input data doesn't directly modify the model's objective, it follows that representing a complex objective via internalization is harder than via "modelling" (i.e., corrigibility or deception).

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

Many thanks for taking the time to find errors.

I've fixed #1-#3. Arguments about the universal prior are definitely not something I want to get into with this post, so for #2 I've just made a vague statement that misalignment can arise for other reasons and linked to Paul's post.

I'm hesitant to change #4 before I fully understand why.

I'm not exactly sure what you're trying to say here. The way I would describe this is that internalization requires an expensive duplication where the objective is represented separately from the world model despite the world model including information about the objective.

So, there are these two channels, input data and SGD. If the model's objective can only be modified by SGD, then (since SGD doesn't want to do super complex modifications), it is easier for SGD to create a pointer rather than duplicate the [model of the base objective] explicitly.

But the bolded part seemed like a necessary condition, and that's what I'm trying to say in the part you quoted. Without this condition, I figured the model could just modify [its objective] and [its model of the Base Objective] in parallel through processing input data. I still don't think I quite understand why this isn't plausible. If the [model of Base objective] and the [Mesa Objective] get modified simultaneously, I don't see any one step where this is harder than creating a pointer. You seem to need an argument for why [the model of the base objective] gets represented in full before the Mesa Objective is modified.

Edit: I slightly rephrased it to say

If we further assume that processing input data doesn't directly modify the model's objective (the Mesa Objective), or that its model of the Base Objective is created first,

The post still contains a misplaced mention of MLK shortly after the first mention of Luther:

I'm very confident that this type of person is more common than Martin Luther Kings or Jesuses.

Ah, shoot. Thanks.

Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the 'classic arguments' for AI safety - the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.

I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence - rather than the general a priori reasons given by the classic arguments.

I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I'm convinced that inner misalignment is possible).

So, I currently tend to prefer the following as the strongest "solid, specific reason to expect dangerous misalignment":

We don't yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.

Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they're sufficiently powerful - because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.

Meanwhile, AI capabilities are marching on scarily fast, so we probably don't have that much time to find a solution. And it's plausible that a solution will be very difficult because corrigibility seems "anti-natural" in a certain sense.

Curious what you think about this?

For those who, like me, have the attention span and intelligence of a door hinge the ELI5 edition is:

Outer alignment is trying to find a reward function that is aligned with our values (making it produce good stuff rather than paperclips)

Inner alignment is the act of ensuring our AI actually optimizes the reward function we specify.

An example of poor inner alignment would be us humans in the eyes of evolution. Instead of doing what evolution intended, we use contraceptives so we can have sex without procreation. If evolution had gotten its inner alignment right, we would care as much about spreading our genes as evolution does!

This post is both a huge contribution, giving a simpler and shorter explanation of a critical topic, with a far clearer context, and has been useful to point people to as an alternative to the main sequence. I wouldn't promote it as more important than the actual series, but I would suggest it as a strong alternative to including the full sequence in the 2020 Review. (Especially because I suspect that those who are very interested are likely to have read the full sequence, and most others will not even if it is included.)

Thanks for the ELI12, much appreciated.

evolution's objective of "maximize inclusive genetic fitness" is quite simple, but it is still not represented explicitly because figuring out how actions affect the objective is computationally hard

This doesn’t seem like the bottleneck in many situations in practice. For example, a lot of young men feel like they want to have as much sex as possible, but not father as many kids as possible. I’m not sure exactly what the reason is, but I don’t think it’s the computational difficulty of representing having kids vs. having sex, because humans already build a world model containing the concept of “my kids”.

It seems to me that one under-appreciated aspect of Inner Alignment is that, even if one had the one-true-utility-function-that-is-all-you-need-to-program-into-AI, this would not, in fact, solve the alignment problem, nor even the intent-alignment part. It would merely solve outer alignment (provided the utility function can be formalized).

Damn, yep I for one under-appreciated this for the past 12 years.

What else have people said on this subject? Do folks think that scenarios where we solve outer alignment most likely involve us not having to struggle much with inner alignment? Because fully solving outer alignment implies a lot of deep progress in alignment.

This doesn’t seem like the bottleneck in many situations in practice. For example, a lot of young men feel like they want to have as much sex as possible, but not father as many kids as possible. I’m not sure exactly what the reason is, but I don’t think it’s the computational difficulty of representing having kids vs. having sex, because humans already build a world model containing the concept of “my kids”.

In this case, I would speculate that the kids objective wouldn't work that well because the reward is substantially delayed. The sex happens immediately, the kids only after 9 months. Humans tend to discount their future.

Also, how exactly would the kids objective even be implemented?

What else have people said on this subject? 

I believe that Miri was aware of this problem for a long time, but that it didn't have the nice, comparatively non-confused and precise handle of "Inner Alignment" until Evan published the 'risks from learned optimizations' paper. But I'm not the right person to say anything else about this.

Do folks think that scenarios where we solve outer alignment most likely involve us not having to struggle much with inner alignment? Because fully solving outer alignment implies a lot of deep progress in alignment.

Probably not. I think Inner alignment is, if anything, probably the harder problem. It strikes me as reasonably plausible that Debate is a proposal which solves outer alignment, but as very unlikely that it automatically solves Inner Alignment.

Hm ya I guess the causality between sex and babies (even sex and visible pregnancy) is so far away in time that it’s tough to make a brain want to “make babies”.

But I don’t think computationally intractability of how actions effect inclusive genetic fitness is quite why evolution made such crude heuristics. Because if a brain understood that it was trying to maximize that quantity, I think it could figure out “have a lot of sex” as a heuristic approach without evolution hard-coding it in. And I think humans actually do have some level of in-brain goals to have more descendants beyond just having more sex. So I think these things like sex pleasure are just performance optimizations to a mentally tractable challenge.

E.g. snakes quickly triggering a fear reflex

Re the argument for "Why internalization might be difficult", I asked Evan Hubinger for his take on your rendition of the argument, and he thinks it's not right.

Rather, the argument that Risks from Learned Optimization makes that internalization would be difficult is that:

  • ~all models with good performance on a diverse training set probably have to have a complex world model already, which likely includes a model of the base objective,
  • so having the base objective re-encoded in a separate part of the model that represents its objective is just a waste of space/complexity.

Especially since this post is now (rightly!) cited in several introductory AI risk syllabi, it might be worth correcting this, if you agree it's an error.

Thanks! I agree it's an error, of course. I've changed the section, do you think it's accurate now?)

It looks good to me!

This is already true for GPT-3

Idk, maybe...?

Is that in doubt? Note that I don't say it models the base objective in the post, I just say that it has a complex world model. This seemed unquestionable to me since it demonstrably knows lots of things. Or are you drawing a distinction between "a lot of facts about stuff" and " a world model?" I haven't draw that; "model" seems very general and "complex" trivially true. It may not be a smart model.

Part of me thinks: I was trying to push on whether it has a world model or rather has just memorised loads of stuff on the internet and learned a bunch of heuristics for how to produce compelling internet-like text. For me, "world model" evokes some object that has a map-territory relationship with the world. It's not clear to me that GPT-3 has that.

Another part of me thinks: I'm confused. It seems just as reasonable to claim that it obviously has a world model that's just not very smart. I'm probably using bad concepts and should think about this more.

[-]bmg70

Since neural networks are universal function approximators, it is indeed the case that some of them will implement specific search algorithms.

I don't think this specific point is true. It seems to me like the difference between functions and algorithms is important. You can also approximate any function with a sufficiently large look-up table, but simply using a look-up table to choose actions doesn't involve search/planning.* In this regard, something like a feedforward neural network with frozen weights also doesn't seem importantly different than a look-up table to me.

One naive perspective: Systems like AlphaGo and MuZero do search, because they implement Monte-Carlo tree search algorithms, but if you were to remove their MCTS components then they simply wouldn't do search. Search algorithms can be used to update the weights of neural networks, but neural networks don't themselves do search.

I think this naive perspective may be wrong, because it's possible that recurrence is sufficient for search/planning processes to emerge (e.g. see this paper). But then, if that's true, I think that the power of recurrence is the important thing to emphasize, rather than the fact that neural networks are universal function approximators.

*I'm thinking of search algorithms as cognitive processes, rather than input-output behaviors (which could be produced via a wide range of possible algorithms). If you're thinking of them as behaviors, then my point no longer holds. Although I've interpreted the mesa-optimization paper (and most other discussions of mesa-optimization) as talking about cognitive processes.

(I somehow didn't notice your comment until now.) I believe you are correct. The theorem for function approximation I know also uses brute force (i.e., large networks) in the proof, so it doesn't seem like evidence for the existence of [weights that implement algorithms].

(And I am definitely not talking about algorithms in terms of input/output behavior.)

I've changed the paragraph into

You might wonder what the space of all models looks like. The typical answer is that the possible models are sets of weights for a neural network. The problem exists insofar as some sets of weights implement specific search algorithms.

Anyone who knows of alternative evidence I can point to here is welcome to reply to this comment.

Planned summary for the Alignment Newsletter:

This post summarizes and makes accessible the paper <@Risks from Learned Optimization in Advanced Machine Learning Systems@>.

but _man_y utility

This is a typo, maybe stray markdown formatting?

stray markdown formatting?

Yeah, most likely. Fixed it; thanks!

The images in this post seem to be broken.

Thanks. It's because directupload often has server issues. I was supposed to rehost all images from my posts to a more reliable host, but apparently forgot this one. I'll fix it in a couple of hours.

Internalization would require rebuilding [the model of the Base Objective] in full

I don't understand why. If the model of the base objective is already present somewhere in the weights, con't we just reuse that?

If it's present in the weights, the model has already done internalization. The question is, why would it do that in the first place?

This post is thoroughly excellent, a good summary and an important service!

However, the big caveat here is that evolution does not implement Stochastic Gradient Descent.

I came here to say that in fact they are quite analogous after all

I'm having trouble understanding how the maze example is different from the cat example. The maze AI was trained on a set of mazes that had a red door along the shortest path, so it learned to go to those red doors. When it was deployed on a different set of mazes, the goal it had learned didn't match up with the goal its programmers wanted it to have. This seems like the same type of out-of-distribution behavior that you illustrated with the AI that learned to look for white animals rather than cats.

You presented the maze AI as different from the cat AI because it had an outer goal of "find the shortest path through the maze" and implemented that goal by iterating the inner goal of "breadth-first search for a red door". The inner goal is aligned with the outer goal for all training mazes, but not for the real mazes. But couldn't you frame the cat AI the same way? Maybe it has an outer goal of "check for a cat" and it implements that with an inner goal of "divide the image into a set of shapes that each contain only colors within [margin] of the average color. If there is at least one shape that's within [margin] of white and has [shape] return yes, otherwise return no."

How is the maze AI fundamentally different from the cat AI? Why is the inner/outer alignment model of thinking about an AI system more useful than thinking about it as a single optimizer that was trained on a flawed distribution?

(This is the second time someone asks this, so the fault is probably with the post and I should edit it somehow.)

The difference is that the maze AI is running a search. (The classifier isn't; it's just applying a bunch of rules.) This matters because that's where the whole thing gets dangerous. If you get the last part on deceptive and proxy alignment, those concepts only make sense once we're in the business of optimizing, i.e., running a search for actions that score well according to some utility function. In that setting, it makes sense to think of the inner thing as an "optimizer" or "agent" that has goals/wants things/etc.

What's the conceptual difference between "running a search" and "applying a bunch of rules"? Whatever rules the cat AI is applying to the image must be implemented by some step-by-step algorithm, and it seems to me like that could probably be represented as running a search over some space. Similarly, you could abstract away the step-by-step understanding of how breadth-first search works and say that the maze AI is applying the rule of "return the shortest path to the red door".

Yeah, very good question. The honest answer is that I don't know; I had this distinction in mind when I wrote the post, but pressed with it, I don't know if there's a simple way to capture it. Someone on the AstralCodexTen article just asked the same, and the best I came up with is "the set of possible outputs is very large and contains harmful elements". This would certainly be a necessary criterion; if every output is harmless, the system can't be dangerous. (GPT already fails this.)

But even if there is no qualitative step, you can view it as a spectrum of competence, and deceptive/proxy alignment start being a possibility at some point on the spectrum. Not having the crisp characterization doesn't make the dangerous behavior go away.

I like this thread; I think it represents an important piece of the puzzle, and I'm hoping to write something more detailed on it soon, but here's a brief one.

My take is roughly: search/planning is one important ingredient of 'consequentialism' (in fact it is perhaps definitional, the way I understand consequentialism). When you have a consequentialist adversary (with strategic awareness[1]), you should (all other things equal) expect it to be more resilient to your attempts to put it out of action. Why? An otherwise similarly-behaved-in-training system which isn't a consequentialist must have learned some heuristics during training. Those heuristics will sometimes be lucky and rest on abstractions which generalise some amount outside of the training distribution. But the further from the training distribution, the more vanishingly-probable it is that the abstractions and heuristics remain suitable. So you should be more optimistic about taking it down if you need to. In contrast, a consequentialist system can refine or even replace its heuristics in response to changes in inputs (by doing search/planning).

Another (perhaps independent?) ingredient is the ability to refine and augment the abstractions and world/strategic model on which the planning rests (play/experimentation). I would be even more pessimistic about a playful consequentialist adversary, because I'd expect its consequentialism to keep working even further (perhaps indefinitely far) outside the training distribution, given the opportunity to experiment.


  1. roughly I mean 'knows about humans and ways to interact with and influence them' https://www.alignmentforum.org/posts/HduCjmXTBD4xYTegv/draft-report-on-existential-risk-from-power-seeking-ai (and see some discussion here https://www.alignmentforum.org/posts/cCMihiwtZx7kdcKgt/comments-on-carlsmith-s-is-power-seeking-ai-an-existential) ↩︎

it might contain over 101000000 candidates

This seems like an oddly specific number; is it supposed to be ?

If so, why is it such a small space? If the model accepts 24-bit, 1000x1000 pixel images and has to label them all as "cat" or "no cat", there should be  possible models.

is it supposed to be ?

Yes it is! This must have happened when I changed the editor to markdown. Thanks.

Why is it small? Well, the point of that sentence is that it's infeasible to try them all, so I just made up some large number (that I knew was definitely not too large). I'd say it's pedagogically preferable to avoid double-exponentiation.

Minor:

(If you don't know what depth-first search means: as far as mazes are concerned, it's simply the "always go left" rule.)

I was confused for a while, because my interpretation of "always go left" doesn't involve backing up (instead, when you get to a wall on the left, you just keep walking into it forever).

Another small nitpick: the difference, if any, between proxy alignment and corrigibility isn't explained. The concept of proxy alignment is introduced in subsection "The concept" without first defining it.

Thanks. (I put "look at your comments" on my todo list when you posted them a week ago, then totally forgot, so it's nice to have a reminder.)

Instead of "always go left", how about "always go along one wall"?

With respect to proxy vs. corrigibility, I'll have to try if I can figure out whether I had a good reason to use both terms there because right now it seems like introducing corrigibility is unnecessary. I don't think there is a difference.

Instead of "always go left", how about "always go along one wall"?

Yeah, maybe better, though still doesn't quite capture the "backing up" part of the algorithm. Maybe "I explore all paths through the maze, taking left hand turns first, backing up if I reach a dead end"... that's a bit verbose though.

I don't think there is a difference.

Gotcha

though still doesn't quite capture the "backing up" part of the algorithm

It doesn't? Isn't it exactly the same, at least provided the wall is topologically connected? I believe in the example I've drawn, going along one wall is identical to depth first search.

Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall

Edit: or do you just mean that even though you take the same steps, the two feel different because retreating =/= going further along the wall

Yeah, this — I now see what you were getting at!

I'm still a bit confused by the difference between inner alignment and out-of-distribution generalization. What's the fundamental difference between the cat-classifying problem and the maze problem. The model itself is an optimizer for the latter? But why this is any special?

What if the neural network used to solve the maze problem just learns a mapping (but doesn't do any search)? Is that still an inner-alignment problem?

Inner Alignment is only defined for mesa optimizers (i.e., models that run a search). So the answer to your second paragraph is no, it wouldn't.

Why is this special? Speaking from my own understanding here, in a nutshell because the optimization process is where most of the risk comes in, for two reasons. One is that a system not running an optimization process probably can't do things with large-scale negative consequences, no matter how terribly "misaligned" it is. This is why comprehensive AI services (a model for the future where we build lots of narrow systems that don't run searches) arguably has the potential to avoid x-risk (the problem here being competitiveness). And the other is that you can't get the behavior where a misaligned model appears to work great because it's doing what you want for instrumental reasons (i.e., deceptive alignment) without the inner optimizer.

I would agree that distributional shift isn't all that different from inner alignment on a conceptual level.

Author here. One thing I think I've done wrong in the post is to equate black-box-search-in-large-parametrized-space with all of machine learning. I've now added this paragraph at the end of chapter 1:

Admittedly, the inner alignment model is not maximally general. In this post, we've looked at black box search, where we have a parametrized model and do SGD to update the parameters. This describes most of what Machine Learning is up to in 2020, but it does not describe what the field did pre-2000 and, in the event of a paradigm shift similar to the deep learning revolution, it may not describe what the field looks like in the future. In the context of black box search, inner alignment is a well-defined property and Venn-Diagram a valid way of slicing up the problem, but there are people who expect that AGI will not be built that way.[1] There are even concrete proposals for safe AI where the concept doesn't apply. Evan Hubinger has since written a follow-up post about what he calls "training stories", which is meant to be "a general framework through which we can evaluate any proposal for building safe advanced AI".

I also converted the post to markdown, mostly for the footnotes (the previous version just had little superscripts written via the math mode).


  1. If an AGI does contain more hand-coded parts, the picture gets more complicated. E.g., if a system is logically separated into a bunch of components, inner alignment may apply to some of the components but not others. It may even apply to parts of biological systems, see e.g., Steven Byrne's Inner Alignment in the brain. ↩︎

In Machine Learning slang, we would say that the training distribution is different from the distribution in deployment

Distribution Shifts and environmental changes

Even if the model's world model (gray blob) improves to the point where it recognizes this failure, it will not correct the pointer because the pointer's current target is the Mesa Objective.

I don't understand. 

Say we're at the stage where the model only has a bunch of heuristics as its utility function. Then SGD comes and modifies the model to replace the heuristics with a pointer somewhere into the world model, say to place A. Here, A should look kind of like the Base Objective.

Now the model gets smarter, updates its world model, and comes up with a better representation of the base objective somewhere else in the world model, say at place B. It could now choose to change the pointer to point to B instead of A. But ( ... this is the point the quoted part made ...) it has no reason to do this. Its current utility function is the thing at place A; the thing at place B is different.

So the pointer isn't going to change by itself when the model processes more input. Though it could always change when SGD modifies the model.

Internalization would require rebuilding [the model of the Base Objective] in full

Maybe I wouldn't say "rebuilding" but for me, this would be like inserting the search module next to [the model of the Base Objective]. This would be much cheaper than rebuilding the whole model of the base objective.

Isn't this just "Humans are adaptation-executors, not utility-maximizers", but applied to AI to say that an AI using heuristics that successfully hit a target in environment X may not continue that target if the environment changes?