Many approaches to AI alignment require making assumptions about what humans want. On a first pass, it might appear that inner alignment is a sub-component of AI alignment that doesn't require making these assumptions. This is because if we define the problem of inner alignment to be the problem of how to train an AI to be aligned with arbitrary reward functions, then a solution would presumably have no dependence on any particular reward function. We could imagine an alien civilization solving the same problem, despite using very different reward functions to train their AIs.

Unfortunately, the above argument fails because aligning an AI with our values requires giving the AI extra information that is not encoded directly in the reward function (under reasonable assumptions). The argument for my thesis is subtle, and so I will break it into pieces.

First, I will more fully elaborate what I mean by inner alignment. Then I will argue that the definition implies that we can't come up with a full solution without some dependence on human values. Finally, I will provide an example, in order to make this discussion less abstract.

Characterizing inner alignment

In the last few posts I wrote (1, 2), I attempted to frame the problem of inner alignment in a way that wasn't too theory-laden. My concern was that the previous characterization was dependent on a solving particular outcome where you have an AI that is using an explicit outer loop to evaluate strategies based on an explicit internal search.

In the absence of an explicit internal objective function, it is difficult to formally define whether an agent is "aligned" with the reward function that is used to train it. We might therefore define alignment as the ability of our agent to perform well on the test distribution. However, if the test set is sampled from the same distribution as the training data, this definition is equivalent to the performance of a model in standard machine learning, and we haven't actually defined the problem in a way that adds clarity.

What we really care about is whether our agent performs well on a test distribution that doesn't match the training environment. In particular, we care about the agent's performance on during real-world deployment. We can estimate this real world performance ahead of time by giving the agent a test distribution that was artificially selected to emphasize important aspects of the real world more closely than the training distribution (eg. by using relaxed adversarial training).

To distinguish the typical robustness problem from inner alignment, we evaluate the agent on this testing distribution by observing its behaviors and evaluating it very negatively if it does something catastrophic (defined as something so bad we'd prefer it to fail completely). This information is used to iterate on future versions of the agent. An inner aligned agent is therefore defined as an agent that avoids catastrophes during testing.

The reward function doesn't provide enough information

Since reward functions are defined as mappings between state-action pairs and a real number, our agent doesn't actually have enough information from the reward function alone to infer what good performance means on the test. This is because the test distribution contains states that were not available in the training distribution.

Therefore, no matter how much the agent learns about the true reward function during training, it must perform some implicit extrapolation of the reward function to what we intended, in order to perform well on the test we gave it.

We can visualize this extrapolation as if we were asking a supervised learner what it predicts for inputs beyond the range it was provided in its training set. It will be forced to make some guesses for what rule determines what the function looks like outside of its normal range.

One might assume that we could just use simplicity as the criterion for extrapolation. Perhaps we could just say, formally, the simplest possible reward function that encodes the values observed during training is the "true reward" function that we will use to test the agent. Then the problem of inner alignment reduces to the problem of creating an agent that is able to infer the true reward function from data, and then perform well according to it inside general environments. Framing the problem like this would minimize dependence on human values.

There are a number of problems with that framing, however. To start, there are boring problems associated with using simplicity to extrapolate the reward function, such as the fact that one's notion of simplicity is language dependent, often uncomputable, and the universal prior is malign. Beyond these (arguably minor) issues, there's a deeper issue, which forces us to make assumptions about human values in order to ensure inner alignment.

Since we assumed that the training environment was necessarily different from the testing environment, we cannot possibly provide the agent information about every possible scenario we consider catastrophic during training. Therefore, the metric we were using to judge the success of the agent during testing is not captured in training data alone. We must introduce additional information about what we consider catastrophic. This information comes in the form of our own preferences, as we prefer the agent to fail in some ways but not in others.

It's also important to note that if we actually did provide the agent with the exact same data during training as it would experience during deployment, this is equivalent to simply letting the agent learn in the real world, and there would be no difference between training and testing. Since we normally assume providing such a perfect environment is either impossible or unsafe, the considerations in that case become quite different.

An example

I worry my discussion was a bit too abstract to be useful, so I'll provide a specific example to show where my thinking lies. Consider the lunar lander example that I provided in the last post.

To reiterate, we train an agent to land on a landing pad, but during training there is a perfect correlation between whether a landing pad is painted red and whether it is a real landing pad.

During deployment, if the "true" factor that determined whether a patch of ground is a landing pad was whether it is enclosed by flags, and some faraway crater is painted red, then the agent might veer off into the crater rather than landing on the landing pad.

Since there is literally not enough information during training to infer what property correctly determines whether a patch of ground is a landing pad, the agent is forced to infer whether its the flags or the red painting. It's not exactly clear what the "simplest" inference is here, but it's coherent to imagine that "red painting determines whether something is a landing pad" is the simplest inference.

As humans, we might have a preference for the flags being the true determinant, since that resonates more with what we think a landing pad should be, and whether something is painted red is not nearly as compelling to us.

The important point is to notice that our judgement here is determined by our preferences, and not something the agent could have learned during training using some value-neutral inferences. The agent must make further assumptions about human preferences for it to consistently perform well during testing.


  1. You might wonder whether we could define catastrophe in a completely value-independent way, sidestepping this whole issue. This is the approach implicitly assumed by impact measures. However, if we want to avoid all types of situations where we'd prefer the system fail completely, I think this will require a different notion of catastrophe than "something with a large impact." Furthermore, we would not want to penalize systems for having a large positive impact.
New Comment
9 comments, sorted by Click to highlight new comments since:

I (low-confidence) think that there might be a "choose two" wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I know this isn't saying why I think this yet, but I'd just like to register this now for later discussion.

I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.

I find this interesting but I'd be surprised if it were true :). I look forward to seeing it in the upcoming posts.

That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn't always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact.

A closely related notion to my definition is the term "unacceptable behavior" as Paul Christiano has used it. This is the way he has defined it,

In different contexts, different behavior might be acceptable and it’s up to the user of these techniques to decide. For example, a self-driving car trainer might specify: Crashing your car is tragic but acceptable. Deliberately covering up the fact that you crashed is unacceptable.

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can't see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

I think people should generally be a little more careful about saying "this requires value-laden information". First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say "don't do all these bad things we don't like"!

Second, it's always good to check "would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?":

This isn't an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.

I think it's not that the reward function is insufficient, it's the deeper problem that the situation is literally undefined. Can you explain why you think there _IS_ a "true" factor? Not "can a learning system find it", but "is there something to find"? If all known real examples have flags, flatness, and redness 100% correlated, there is no real preference for which one to use in the (counterfactual) case where they diverge. This isn't sampling error or bias, it's just not there.




I'll note that we are using the term "human values" as if all humans had the same values. Even in fairly trivial cases humans can differ in what tradeoffs they'll accept. E.g Adam gets food at a convenience store because it's convenient, Beth goes to Whole Foods for healthy* foods, and Chad goes to Walmart because he's cheap. All of them value convenience, nutrition, and cost, but to varying degrees.

*And with varying levels of information and disinformation about the actual nutritional needs of their bodies.

Can you explain why you think there _IS_ a "true" factor

Apologies for the miscommunication, but I don't think there really is an objectively true factor. It's true to the extent that humans say that it's the true reward function, but I don't think it's a mathematical fact. That's part of what I'm arguing. I agree with what you are saying.

Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?

The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment (it's just regular alignment), so I feel like I'm missing something.

Is your point mostly centered around there being no single correct way to generalize to new domains, but humans have preferences about how the AI should generalize, so to generalize properly, the AI needs to learn how humans want it to do generalization?

Pretty much, yeah.

The above sentence makes lots of sense to me, but I don't see how it's related to inner alignment

I think there are a lot of examples of this phenomenon in AI alignment, but I focused on inner alignment for two reasons

  • There's a heuristic that a solution to inner alignment should be independent of human values, and this argument rebuts that heuristic.
  • The problem of inner alignment is pretty much the problem of how to get a system to properly generalize, which makes "proper generalization" fundamentally linked to the idea.

Suppose that in building the AI, we make an explicitly computable hardcoded value function. For instance, if you want the agent to land between the flags, you might write an explicit, hardcoded function that returns 1 if between a pair of yellow triangles, else 0.

In the process of standard machine learning as normal, information is lost because you have a full value function but only train the network using the evaluation of the function at a finite number of points.

Suppose I don't want the lander to land on the astronaut, who is wearing a blue spacesuit. I write code that says that any time there is a blue pixel below the lander, the utility is -10.

Suppose that there are no astronauts in the training environment, in fact nothing blue whatsoever. A system that is trained using some architecture that only relies on the utility of what it sees in training, would not know this rule. A system that can take the code, and read it, would spot this info, but might not care about it. A system that generates potential actions, and then predicts what the screen would look like if it took those actions, and then sent that prediction to the hard coded utility function, with automatic shutdown if the utility is negative, would avoid this problem.

If hypothetically, I can take any programmed function f:observations -> reward and make a machine learning system that optimizes that function, then inner alignment has been solved.