This is the first post in a series where I'll explore AI alignment in a simplified setting: a neural network that's being trained by gradient descent. I'm choosing this setting because it involves a well-defined optimization process that has enough complexity to be interesting, but that's still understandable enough to make crisp mathematical statements about. As a result, it serves as a good starting point for rigorous thinking about alignment.

Defining inner alignment

First, I want to highlight a definitional issue. Right now there are two definitions of inner alignment circulating in the community. This issue was first pointed out to me by Evan Hubinger in a recent conversation.

The first definition is the one from last year's Risks from Learned Optimization paper, which Evan co-authored and which introduced the term. This paper defined the inner alignment problem as "the problem of eliminating the base-mesa objective gap" (Section 1.2). The implication is that if we can eliminate the gap between the base objective of a base optimizer, and the mesa-objectives of any mesa-optimizers that base optimizer may give rise to, then we will have satisfied the necessary and sufficient conditions for the base optimizer to be inner-aligned.

There's also a second definition that seems to be more commonly used. This definition says that "inner alignment fails when your capabilities generalize but your objective does not". This comes from an intuition (pointed out to me by Rohin Shah) that the combination of inner alignment and outer alignment should be accident-proof with respect to an optimizer's intent: an optimizer that's both inner- and outer-aligned should be trying to do what we want. Since an outer-aligned optimizer is one whose base objective is something we want, this intuition suggests that the remaining part of the intent alignment problem — the problem of getting the optimizer to try to achieve the base objective we set — is what inner alignment refers to.

Here I'll try to propose more precise definitions of alignment and capability in an optimizer, and explore what generalization and robustness might mean in the context of these properties. I'll also propose ways to quantify the capability and alignment profiles of existing ML systems.

But before doing that, I want to motivate these definitions with an example.

The base objective

The optimizer I'll be using as my example will be a gradient descent process, which we're going to apply to train a simplified neural network. I want to emphasize that I'm treating gradient descent as the optimizer here — not the neural network. The neural network isn't necessarily an optimizer itself, it's just the output artifact of our gradient descent optimizer.

To make this scenario concrete, we'll imagine the neural network we're training is a simplified language model: a feedforward MLP with a softmax layer at the top. The softmax layer converts the MLP's output activations into a probability distribution over next words, and the model gets scored on the cross-entropy loss between that probability distribution, and the actual next word that appears in the training text. (This ignores many of the complications of modern language models, but I'm keeping this example simple.)

We’ll let  represent all the parameters of this MLP — all its weights and biases — at training step . To train our MLP with gradient descent, we feed it batches of input-output pairs . If our MLP is part of a language model, then  might represent the words in the language model's context window for the  training example in the batch, and  might represent a one-hot encoding of the correct next word for the  training example in the batch. To make things even simpler, I'm also going to assume that every training batch contains the entire training dataset of  examples, an arrangement we'd never use if we were training a real system.

So at a given training step , the loss function for our language model is

I'll refer to the function  as “the neural network”. Here, “” is the dot product.

Notice that  here is our base objective: it's the quantity we're trying to get our gradient descent process to optimize for. If we'd succeeded in solving the entire outer alignment problem, and concluded that the base objective  was the only quantity we cared about optimizing, then the remaining challenge — getting our gradient descent process to actually optimize for  — would constitute the inner alignment problem, by our second definition above.

So the question now is: under what conditions does gradient descent actually optimize for our base objective?

The true objective

To answer this, we can try to determine which quantity gradient descent is truly optimizing for, and then look at how and when that quantity correlates with the base objective we really care about.

We can start by imagining the  step of gradient descent as applying a learning function  to the parameters in :

Running gradient descent consists of applying  repeatedly to :

In the long run, gradient descent should converge on some terminal value . (For now, we'll assume that this limit exists.)

The key characteristic of a terminal value  (when it exists) is that it's a fixed point of the dynamical system defined by . In other words:

Some of the fixed points  of this system will coincide with global or local minima of our base objective, the cross-entropy loss  — but not all of them. Some will be saddle points, while others will be local or global maxima. And while we don't consider all these fixed points to be equally performant with respect to our base objective, our gradient descent optimizer does consider them all to be equally performant with respect to its true objective.

This disagreement is the core of the inner alignment problem in this setting: our gradient descent process isn't always optimizing for the quantity we want it to. So what quantity is it optimizing for?

When we apply one step of gradient descent, we update each parameter in our neural network by an amount equal to a learning rate, times the error in that parameter that we calculate during backprop on the loss function . The update we apply to the  parameter, to move it from  to , can be written as

Here,  represents our learning rate at time step .

So our gradient descent optimizer will terminate if and only if there exists some time step  such that , across all parameters . (For a fixed learning function , this condition implies that the gradient updates are zero for all  as well.) And this happens if and only if the sum of the gradients

is equal to zero when .

But  represents more than just the terminal condition for our optimizer. It's the quantity that gradient descent is actually trying to minimize: anytime  deviates from zero, the amount of optimization power that's applied to move  towards zero is proportional to  itself. That makes  the true objective of our gradient descent optimizer — it's the loss function that gradient descent is actually optimizing for.

So now we have a base objective , which we've assigned to an optimizer; and we have a true objective , which is the one our optimizer is actually pursuing. Intuitively, the inner alignment of our optimizer seem like it would be related to how much, and under what circumstances,  correlates with  over the course of a training run. So we'll look at that next.

Two examples

Let's now consider two optimizers, A and B. Optimizers A and B are identical apart for one difference: Optimizer A has its parameters initialized at , while Optimizer B has its parameters initialized at 

As luck would have it, this small difference is enough to put  and  into different basins of attraction of the loss function. As a result, our two optimizers end up in different terminal states: 

These two terminal states also correspond — again, by luck in this example — to different values of the base objective. Indeed, it turns out that  is in the basin of attraction of a global minimum of the loss function, while  is in the basin of attraction of a local minimum. As a result, after many training steps, the base objectives of the two optimizers end up converging to different values:

Again, the limit of the loss function  is less than the limit of  because  corresponds to a global minimum, while  only corresponds to a local minimum. So Optimizer A is clearly better than Optimizer B, from the standpoint of its performance on our base objective — minimization of the loss function.

But crucially, because  and  both represent fixed points with zero gradients, the true objectives of the two optimizers both converge to zero in the limit:

In other words, Optimizer A and Optimizer B are equally good at optimizing for their true objectives. Optimizer A just does a better job of optimizing for the base objective we want, as a side effect of optimizing for its true objective. Intuitively, we might say that Optimizers A and B are equally capable with respect to their true objectives, while Optimizer A is better aligned with our base objective than Optimizer B is.

Let's look at a second example. This time we'll compare Optimizer A to a third optimizer, Optimizer C. These two optimizers are again identical, apart from one detail: while Optimizer A uses learning rate decay with , Optimizer C uses a constant learning rate with .

As a result of its learning rate decay schedule, Optimizer A converges on a global minimum in the  limit. But Optimizer C, with its constant learning rate, doesn't converge the same way. While it's drawn towards the same global minimum as Optimizer A, Optimizer C ends up orbiting the minimum point chaotically, without ever quite reaching it — its finite learning rate means it never perfectly hits the global minimum point, no matter how many learning steps we give it. As a result,

(To be clear, this is an abuse of notation: in reality  generally won't be well-defined for a chaotic orbit like this. But we can think of this instead as denoting the long-term limit of the average of  over a sufficiently large number of time steps.)

Intuitively, we might say that Optimizer A is more capable than Optimizer C, since it performs better, in the long run, on its true objective.

Optimizer A also performs better than Optimizer C on our base objective:

And interestingly, Optimizer A's better performance than C on our base objective is a direct result of its better performance than C on its true objective. So we might say that, in this second scenario, Optimizer C's performance on the base objective is capability-limited. If we improved C's capability on its true objective, we could get it to perform better on the base objective, too.

Capability and alignment

With those intuitions in hand, I'll propose the following two definitions.

 

Definition 1. Let  be a base optimizer acting over  optimization steps, and let  represent the value of its base objective at optimization step . Then the capability of  with respect to the base objective  is

 

Definition 2. Let  be a base optimizer with base objective , and  be a mesa-optimizer with mesa-objective . Then the mesa-optimizer's alignment with the base optimizer is given by

If  and  are both finite, we can also write 's alignment with  as

 

The intuition behind these definitions is that the capability  of an optimizer is the amount by which the optimizer is able to improve its objective over many optimization steps. One way in which a base optimizer can try to improve its base objective is by delegating part of its optimization work to a mesa-optimizer, which has its own mesa-objective. The alignment factor  in Definition 2 is a way of quantifying how effective that delegation is: to what extent does the mesa-optimizer's progress in optimizing for its mesa-objective "drag along" the base objective of the base optimizer that created it?

In our gradient descent example, our mesa-optimizer  was the gradient descent process, and its mesa-objective was what, at the time, I called the "true objective", . But the base optimizer  was the human who designed the neural network and ran the gradient process on it. If we think of this human as being our base optimizer, then we can write the capability of our human designer as

In other words, if a base optimizer delegates its objective to a mesa-optimizer, then that base optimizer's capability is equal to the capability of that mesa-optimizer, times how well-aligned the mesa-optimizer is to the base optimizer's base objective. If you fully delegate a goal to a subordinate, your capability on that goal is the product of 1) how capable your subordinate is at achieving their own goals; and 2) how well-aligned their own goals are to the goal you delegated to them. This seems intuitively reasonable.

But it also has a curiously unintuitive consequence in gradient descent. We tend to think that when we add neurons to an architecture, we're systematically increasing the capability of gradient descent on that architecture. But the definitions above suggest a different interpretation: because gradient descent might converge equally well on its true objective  on a big neural net as on a small one, its capability as an optimizer isn't systematically increased by adding neurons. Instead, adding neurons improves the degree to which gradient descent converges on a base objective that's aligned with our goals.

Robustness and generalization

As I've defined them above, capability and alignment are fragile properties. Two optimizers  and  could be nearly identical, but still have very different capabilities  and . This is a problem, because the optimizers in our definitions are specified up to and including things like their datasets and parameter initializations. So something as minor as a slight change in dataset — which we should expect to happen often to real-world optimizers — could cause a big change in the capability of the optimizer, as we've defined it.

We care a lot about whether an optimizer remains capable when we perturb it in various ways, including running it on different datasets. We also care a lot about whether an optimizer with objective  remains capable when we change its objective to something slightly different like . And we also care to what extent the alignment between two optimizers is preserved when we perturb either optimizer. Below I'll define two properties that describe the degree to which optimizers retain their capability and alignment properties under perturbations.

 

Definition 3. Let  be the capability of optimizer , and let  be the alignment of optimizer  with optimizer . Let  and  be finite perturbations applied respectively to  and . Then, the capability of  is robust under perturbation  if

Similarly, the alignment of  with  is robust under perturbations  and  if

 

Definition 4. Let  be an optimizer with objective function , and let  be an optimizer with objective function . Let  be a finite perturbation applied to , such that the optimizer  differs from  only in that its objective function is  instead of . Then, the capability of  generalizes to objective  if

Similarly, the alignment of  with  generalizes to objective  if

 

Intuitively, we're defining a robustly capable optimizer as one whose capability isn't strongly affected by classes of perturbations that we care about — and we're defining robust alignment between two optimizers in an analogous way. We're also thinking of generalization as a special case of robustness, meaning specifically that the optimizer is robust to perturbation to its objective function. So an optimizer whose capabilities generalize is one that continues to work well when we give it a new objective.

Quantifying inner alignment

With the vocabulary above, we can now define inner alignment more precisely, and even think about how to quantify it in real systems. We might say that a mesa-optimizer  is inner-aligned with its base optimizer  if its alignment factor  remains robustly high under variations  in the datasets that we expect either optimizer to encounter in the future. We can also quantify inner alignment by looking at how much specific variations in the data distribution affects the alignment factor between two optimizers.

We might also be interested investigating other properties that could affect inner alignment from a safety perspective. For example, under what conditions will alignment between a base optimizer and a mesa-optimizer generalize well to a new base objective? What kinds of perturbations to our optimizers are likely to yield breakdowns in robustness? As we add capacity to a deep learning model, should expect alignment to improve? And if so, should we expect an inflection point in this improvement — a level of capacity beyond which alignment declines sharply? How could we detect and characterize an inflection point like this? These are some of the topics I'll be exploring in the future.

Terminal states and transients

I want to highlight one final issue with the definitions above: I've defined inner alignment here only in connection with the limiting behavior of our optimizers. That means a mesa-optimizer that's well-aligned with its base optimizer would still — by the definition above — be free to do dangerous things on the path to correctly optimizing for the base objective.

To take an extreme example, we could have a system that's perfectly aligned to optimize for human happiness, but that only discovers that humans don't want to have their brains surgically extracted from their bodies after it's already done so. Even if the system later corrected its error, grew us new bodies, and ultimately gave us a good end state, we'd still have experienced a very unpleasant transient in the process. Essentially, this definition of alignment says to the mesa-optimizer: it's okay if you break a vase, as long as we know that you'll put it back together again in the long run.

I can understand this definition being controversial. It may be the most extreme possible version of the claim that the ends justify the means. So it could also be worth resolving the alignment problem into "weak" and "strong" versions — where weak alignment would refer to the  limit, while strong alignment would refer to transient behavior over, say, the next  optimization steps. A concept of strong alignment could let us prove statements like "this optimizer will have a performance level of at worst  on our base objective over the next  optimization steps." This seems very desirable.

On the other hand, we may want to prepare for the possibility that the terminal states we want will only be accessible through paths that involve transient unpleasantness. Perhaps one really does have to break eggs to make an omelet, and that's just how the universe is. (I don't think this is particularly likely: high-capacity neural networks and policy iteration in RL are both data points that suggest incrementalism is increasingly viable in higher-dimensional problem spaces.)

To summarize, weak alignment, which is what this post is mostly about, would say that "everything will be all right in the end." Strong alignment, which refers to the transient, would say that "everything will be all right in the end, and the journey there will be all right, too." It's not clear which one will be easier to prove than the other in which circumstance, so we'll probably need to develop rigorous definitions of both.


Big thanks to Rohin Shah, Jan Leike, Jeremie Harris, and Evan Hubinger for reviewing early drafts of this, suggesting ideas, and pointing out mistakes!

New Comment
6 comments, sorted by Click to highlight new comments since:

Planned summary for the Alignment Newsletter:

Consider a neural network like GPT-3 trained by gradient descent on (say) the cross-entropy loss function. This loss function forms the _base objective_ that the process is optimizing for. Gradient descent typically ends up at some local minimum, global minimum, or saddle point of this base objective.

However, if we look at the gradient descent equation, θ = θ - αG, where G is the gradient, we can see that this is effectively minimizing the size of the gradients. We can think of this as the mesa objective: the gradient descent process (with an appropriate learning rate decay schedule) will eventually get G down to zero, its minimum possible value (even though it may not be at the global minimum for the base objective).

The author then proposes defining capability of an optimizer based on how well it decreases its loss function in the limit of infinite training. Meanwhile, given a base optimizer and mesa optimizer, alignment is given by the capability of the base optimizer divided by the capability of the mesa optimizer. (Since the mesa optimizer is the one that actually acts, this is effectively measuring how much progress on the mesa objective also causes progress on the true base objective.)

This has all so far assumed a fixed training setup (such as a fixed dataset and network architecture). Ideally, we would also want to talk about robustness and generalization. For this, the author introduces the notion of a “perturbation” to the training setup, and then defines [capability / alignment] [robustness / generalization] based on whether the optimization stays approximately the same when the training setup is perturbed.

It should be noted that these are all definitions about the behavior of optimizers in the infinite limit. We may also want stronger guarantees that also talk about the behavior on the way to the infinite limit.

Thanks, Rohin!

Please note that I'm currently working on a correction for part of this post — the form of the mesa-objective  I'm claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.

Great post! I liked the clean analysis of the problem, the formalization, and the effort to point the potential issues with your definitions. Now I'm really excited for the next posts, where I assume that you will study robustness and generalization (based on your definitions) for simple examples of gradient descent. I'm interested in commenting early drafts if you need feedback!

Some of the fixed points  of this system will coincide with global or local minima of our base objective, the cross-entropy loss  — but not all of them. Some will be saddle points, while others will be local or global maxima. And while we don't consider all these fixed points to be equally performant with respect to our base objective, our gradient descent optimizer does consider them all to be equally performant with respect to its true objective.

This disagreement is the core of the inner alignment problem in this setting: our gradient descent process isn't always optimizing for the quantity we want it to. So what quantity is it optimizing for?

I agree wholeheartedly with this characterization. For me, that's the gist of the inner alignment problem if the objective is the right one (i.e. if outer alignment is solved).

Let's look at a second example. This time we'll compare Optimizer A to a ththird optimizer,

Typo on "ththird".

Definition 1. Let  be a base optimizer acting over  optimization steps, and let  represent the value of its base objective at optimization step . Then the capability of  with respect to the base objective  is

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

In our gradient descent example, our mesa-optimizer  was the gradient descent process, and its mesa-objective was what, at the time, I called the "true objective", . But the base optimizer  was the human who designed the neural network and ran the gradient process on it.

This is not where I thought you were going when I read the intro, but that's a brilliant idea that removes completely the question of whether and why the base optimizer would find a mesa-optimizer to which it can delegate work.

Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.

Typo is fixed — thanks for pointing it out!

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like , then you'd get  for any .

Thanks again!

Interesting post. Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Thanks for the comment!

Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.

 

Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!