Right now I’m working on finding a good objective to optimize with ML, rather than trying to make sure our models are robustly optimizing that objective. (This is roughly “outer alignment.”)

That’s pretty vague, and it’s not obvious whether “find a good objective” is a meaningful goal rather than being inherently confused or sweeping key distinctions under the rug.

So I like to focus on a more precise special case of alignment: solve alignment when decisions are “low stakes.” I think this case effectively isolates the problem of “find a good objective” from the problem of ensuring robustness and is precise enough to focus on productively.

In this post I’ll describe what I mean by the low-stakes setting, why I think it isolates this subproblem, why I want to isolate this subproblem, and why I think that it’s valuable to work on crisp subproblems.

1. What is the low-stakes setting?

A situation is low-stakes if we care very little about any small number of decisions. That is, we only care about the average behavior of the system over long periods of time (much longer than the amount of time it takes us to collect additional data and retrain the system).

For example, this requires that all of the AI systems in the world can’t corrupt the training process quickly or seize control of resources from humans. If they try, we can keep collecting data and fine-tuning them, and this will cause their behavior to change before anything irreversibly bad happens.

For a more formal definition see section 6.

2. Why do low stakes require only outer alignment?

If the stakes are low, we can train our model on the decisions that actually arise in practice rather than needing to anticipate tricky decisions in advance. Moreover, because the payoff from an individual action is always small, we can focus on average-case performance and achieve reasonable sample complexities without any additional tricks.

The main substantive claim is that we don’t need to worry about the “distributional shift” between past decisions and future decisions. When the distribution of inputs change, the system may behave poorly for a while, but if we keep retraining on the new data then it will eventually adapt. If individual decisions are low stakes, then the total cost of all of this adaptation is small. I give this argument in more detail in section 7.

Formally this resembles an online regret bound (this textbook gives a nice introduction to online learning). SGD satisfies such a bound in the case of convex losses. For messy model classes like neural networks we usually can’t prove much interesting about SGD (either for online or offline learning), but for a variety of reasons I think it’s reasonable to expect a similar online bound. I discuss this in more detail in section 8.

This isn’t to say that we can totally ignore optimization difficulties, or the online nature of the problem. But it appears that the main difficulty is constructing a good enough objective and arguing that it is sufficiently easy to optimize.

3. Why focus on this subproblem first?

I think it’s really great to focus on a good subproblem if you can find one.

If you solve your subproblem, then you’ve made progress. If you get stuck, well then you were probably going to get stuck anyway and at least you’re stuck on something easier. When working on a big problem like alignment, I feel like it’s easy to bounce off of every solution because it doesn’t handle the whole problem immediately, and splitting into subproblems is a key way to get over that failure.

I think that the low-stakes setting is a particularly good and clean subproblem: it’s definitely not harder than the original, there are clear ways in which it’s much easier, and solving it would represent real progress.

Why do I focus on this problem first, rather than starting with the other side (something more like robustness / inner alignment)?

  • I think that finding a “good” objective is likely to be similar to finding a “good” specification for adversarial training or verification, and understanding the structure of our specification will change how we approach robustness.
  • I think that defining a good objective likely requires something like “knowing what the model knows.” If this is successful, it’s likely to be an important ingredient for robustness as well (especially for treacherous behavior, where in some sense the model “knows” about the problem).
  • Put differently, I feel like we can approximately split the full alignment problem into two parts: low stakes and handling catastrophes. We know how to define the low-stakes part but don’t know quite how to formulate catastrophes, so it’s more natural to start with low stakes.

4. Is the low-stakes setting actually scary?

Many AI safety problems involve AI systems behaving badly in an abrupt and irreversible way. Does the low-stakes assumption dismiss the central concerns?

I think that you could have very risky situations even if every individual decision is low-stakes. For example, suppose that our world was full of safeguards that made it very slow and hard to change anything big (our infrastructure, our government, individual decision-making…). Perhaps any big change will need to take place in tiny pieces spread over thousands of days (though smaller changes occur more easily).

In such a world AI can still cause huge amounts of trouble if humans can’t understand what it is doing. Rather than “taking over” in a single unanticipated shock, the situation can deteriorate in a thousand tiny pieces each of which humans cannot understand.

5. Why focus on "low stakes" rather than "outer alignment"?

Theoretical work in general, and my research methodology in particular, is a lot easier when we can cleanly evaluate proposed solutions rather than being fuzzy about the boundaries of our problem. Alignment is always frustratingly fuzzy, but we don’t have to make it any worse.

I think that if I tried to work on “outer alignment” defined more fuzzily, I might have ended up sweeping some critical issues under the rug or pursuing dead ends. For example:

  • If I’d tried to assume that our models generalize “correctly” as long as the objective is good, I might not have been thinking clearly about how models generalize from questions we can evaluate to those we can’t. But I think that’s actually a core issue that can’t be cleanly separated from the rest of outer alignment.
  • If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.

Overall I think that inner alignment and outer alignment are useful intuitive handles but don’t carve the problem space cleanly enough to be good research problems.

6. More formal definition of low-stakes

Say that our AI receives a sequence of inputs x[1], x[2], … x[T], and produces a sequence of outputs y[1], y[2], … y[T].

Let U[t](y) be our expected utility if we intervene to have the AI output y[t] = y and then be aligned in all future timesteps. Let ρ be a bound on the maximum possible utility difference U[t](y) − U[t](y′).

(Assume that our AI sometimes behaves randomly for exploration, so that we can define these as conditional expectations given the state of the world before the system chooses y[t].)

Let k be a constant (the “latency”) such that we can afford to train the AI on all data up to time t before we need to deploy it to make a decision at time t+k.

Then the “low-stakes” goal is to achieve a total utility within O(ρk√T) of an aligned model. I call this “low-stakes” because the bound is only meaningful when ρk — the total damage that can be done by ML systems before retraining — is small relative to the total value at stake. Because this gap grows sublinearly with T, note that it is eventually small if ML is deployed for long enough relative to the stakes and the time required for models to learn.

7. More formal argument that outer alignment is sufficient

Suppose that we could compute the utilities U[t](y) exactly, i.e. that we had a perfect objective. I claim that we could then satisfy the low-stakes goal by performing online RL, i.e. performing SGD with a loss function that is an unbiased estimator for the expectation of U[t](y) for an action y sampled from the model.

I’ll focus on the case k=1 for simplicity, but I think k>1 is basically the same.

For each timestep t, let M[t] be our model at time t, let y[t] be the random output we sample, let M*[t] be the aligned model that is competitive with M, and let y*[t] be the output of M*.

Then:

  • U[0](y*[0]) is the utility we’d obtain by taking aligned decisions at every step. U[T](y[T]) is the actual utility we receive. So our “regret” is U[T](y[T]) − U[0](y*[0])
  • U[t+1](y*[t+1]) = U[t](y[t]), since U is defined assuming the system is aligned at all future times.
  • Thus the regret is the sum of U[t](y*[t]) − U[t](y[t]) across all time steps t.
  • But this is identical to the difference in performance between the sequence of models M[t] and M*[t].

If our loss function was convex and M* was fixed, then SGD would have a regret bound of O(ρk√T), as desired. If we are optimizing over the space of neural networks then our loss function is obviously not convex and so we can’t easily prove a bound of this form, but I’ll argue in section 8 that we can aim for a similar guarantee as long as M* is “easy” to find with SGD.

Of course we can’t hope to actually compute the real utility differences U[t] (since e.g. which decision is optimal may depend on hard empirical facts where we don’t get any feedback until years later). So we’ll need to set our sights a bit lower (e.g. to say that in subjective expectation we do as well as the aligned model, rather than being able to say that for every possible sequence we actually do as well). I discuss similar issues in section 3 of Towards formalizing universality, and I don’t think they change the basic picture.

8. Why expect SGD to work online even for neural networks?

SGD enjoys a √T online regret bound only for convex losses. Convexity implies that the iterates of SGD are optimal for a regularized loss function, which is needed to get a bound.

Other than that hiccough I think the regret bound basically goes through. But why am I not concerned about the suboptimality of SGD?

  • To the extent that SGD can’t find the optimum, it hurts the performance of both the aligned model and the unaligned model. In some sense what we really want is a regret bound compared to the “best learnable model,” where the argument for a regret bound is heuristic but seems valid.
  • Moreover, a satisfactory alignment scheme already needs to explain why finding the aligned model M* is not much harder than finding the unaligned model M. And that’s basically what we need in order to argue that SGD has a regret bound relative to M*.

Overall I’m prepared to revise my view if the empirical evidence suggests that online bounds are problematic, but so far all the experiments I’m aware of are broadly consistent with the theoretical picture.

The other slight subtlety in section 7 is that we care about the regret between the actual model M[t] and the aligned version M*[t], whereas we conventionally define regret relative to some fixed target model. I’m not concerned about this either: (i) regret bounds that compare the actual model M to the best transformed model can be obtained by similar methods, (ii) for our purposes we’d also be fine just bounding our regret compared to the single maximally-competent aligned model at the end of training (although I actually expect that to be somewhat harder).

Overall I don’t think this is completely straightforward, but I think it looks good enough that the main difficulty is probably finding a good objective. I’d personally want to start thinking about these more technical issues only after we’ve solved the thornier conceptual issues.

New Comment
10 comments, sorted by Click to highlight new comments since:

Planned summary for the Alignment Newsletter:

We often split AI alignment into two parts: outer alignment, or ``finding a good reward function'', and inner alignment, or ``robustly optimizing that reward function''. However, these are not very precise terms, and they don't form clean subproblems. In particular, for outer alignment, how good does the reward function have to be? Does it need to incentivize good behavior in all possible situations? How do you handle the no free lunch theorem? Perhaps you only need to handle the inputs in the training set? But then what specifies the behavior of the agent on new inputs?

This post proposes an operationalization of outer alignment that admits a clean subproblem: _low stakes alignment_. Specifically, we are given as an assumption that we don't care much about any small number of decisions that the AI makes -- only a large number of decisions, in aggregate, can have a large impact on the world. This prevents things like quickly seizing control of resources before we have a chance to react. We do not expect this assumption to be true in practice: the point here is to solve an easy subproblem, in the hopes that the solution is useful in solving the hard version of the problem.

The main power of this assumption is that we no longer have to worry about distributional shift. We can simply keep collecting new data online and training the model on the new data. Any decisions it makes in the interim period could be bad, but by the low-stakes assumption, they won't be catastrophic. Thus, the primary challenge is in obtaining a good reward function, that incentivizes the right behavior after the model is trained. We might also worry about whether gradient descent will successfully find a model that optimizes the reward even on the training distribution -- after all, gradient descent has no guarantees for non-convex problems -- but it seems like to the extent that gradient descent doesn't do this, it will probably affect aligned and unaligned models equally.

Note that this subproblem is still non-trivial, and existential catastrophes still seem possible if we fail to solve it. For example, one way that the low-stakes assumption could be made true was if we had a lot of bureaucracy and safeguards that the AI system had to go through before making any big changes to the world. It still seems possible for the AI system to cause lots of trouble if none of the bureaucracy or safeguards can understand what the AI system is doing.

Planned opinion:

I like the low-stakes assumption as a way of saying "let's ignore distributional shift for now". However, I think that's more because it agrees with my intuitions about how you want to carve up the problem of alignment, rather than because it feels like an especially clean subproblem. It seems like there are other ways to get similarly clean subproblems, like "assume that the AI system is trying to optimize the true reward function".

That being said, one way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). Plausibly that type of cleanliness is correlated with good decompositions of the alignment problem, and so it is not a coincidence that my intuitions tend to line up with "clean subproblems".

Sounds good/accurate.

It seems like there are other ways to get similarly clean subproblems, like "assume that the AI system is trying to optimize the true reward function".

My problem with this formulation is that it's unclear what "assume that the AI system is trying to optimize the true reward function" means---e.g. what happens when there are multiple reasonable generalizations of the reward function from the training distribution to a novel input?

I guess the natural definition is that we actually give the algorithm designer a separate channel to specify a reward function directly rather than by providing examples. It's not easy to do this (note that a reward function depends on the environment, it's not something we can specify precisely by giving code, and the details about how the reward is embedded in the environment are critical; also note that "actually trying" then depends on the beliefs of the AI in a way that is similarly hard to define), and I have some concerns with various plausible ways of doing that.

I guess the natural definition is ...

I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.

It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer's preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.

I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is "actually trying" to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).

Just to reiterate, I agree that the low-stakes formulation is better; I just think that my reasons for believing that are different from "it's a clean subproblem". My reason for liking it is that it doesn't require you to specify a perfect reward function upfront, only a reward function that is "good enough", i.e. it incentivizes the right behavior on the examples on which the agent is actually trained. (There might be other reasons too that I'm failing to think of now.)

I was imagining a Cartesian boundary, with a reward function that assigns a reward value to every possible state in the environment (so that the reward is bigger than the environment). So, embeddedness problems are simply assumed away, in which case there is only one correct generalization.

This certainly raises a lot of questions though---what form do these states take? How do I specify a reward function that takes as input a state of the world?

I agree that "actually trying" is still hard to define, though you could avoid that messiness by saying that the goal is to provide a reward such that any optimal policy for that reward would be beneficial / aligned (and then the assumption is that a policy that is "actually trying" to pursue the objective would not do as well as the optimal policy but would not be catastrophically bad).

I'm also quite scared of assuming optimality. For example, doing so would assume away sample complexity and would open up whole strategies (like arbitrarily big debate trees or debates against random opponents who happen to sometimes give good rebuttals) that I think should be off limits for algorithmic reasons regardless of the environment (and some of which are dead ends with respect to the full problem).

It feels like the low-stakes setting is also mostly assuming away embeddedness problems? I suppose it still includes e.g. cases where the AI system subtly changes the designer's preferences over the course of training, but it excludes e.g. direct modification of the reward, taking over the training process, etc.

I feel like low-stakes makes a plausible empirical assumption under which it turns out to be possible to ignore many of the problems associated with embededness (because in fact the reward function is protected from tampering). But I'm much more scared about issues the other consequences of assuming a cartesian boundary (where e.g. I don't even know the type signatures of the objects involved any more).

A way you could imagine this going wrong, that feels scary in the same way as the alternative problem statements, is if "are the decisions low stakes?" is a function of your training setup, so that you could unfairly exploit the magical "reward functions can't be tampered with" assumption to do something unrealistic.

But part of why I like the low stakes assumption is that it's about the problem you face. We're not assuming that every reward function can't be tampered with, just that there is some real problem in the world that has low stakes. If your algorithm introduces high stakes internally then that's your problem and it's not magically assumed away.

This isn't totally fair because the utility function U in the low-stakes definition depends on your training procedure, so you could still be cheating. But I feel much, much better about it.

I think this is basically what you were saying with:

That being said, one way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem).

That seems like it might capture the core of why I like the low-stakes assumption. It's what makes it so that you can't exploit the assumption in an unfair way, and so that your solutions aren't going to systematically push up against unrealistic parts of the assumption.

Yeah, all of that seems right to me (and I feel like I have a better understanding of why assumptions on inputs are better than assumptions on outputs, which was more like a vague intuition before). I've changed the opinion to:

I like the low-stakes assumption as a way of saying "let's ignore distributional shift for now". Probably the most salient alternative is something along the lines of "assume that the AI system is trying to optimize the true reward function". The main way that low-stakes alignment is cleaner is that it uses an assumption on the _environment_ (an input to the problem) rather than an assumption on the _AI system_ (an output of the problem). This seems to be a lot nicer because it is harder to "unfairly" exploit a not-too-strong assumption on an input rather than on an output. See [this comment thread](https://www.alignmentforum.org/posts/TPan9sQFuPP6jgEJo/low-stakes-alignment?commentId=askebCP36Ce96ZiJa) for more discussion.

I feel like we can approximately split the full alignment problem into two parts: low stakes and handling catastrophes.

Insert joke about how I can split physics research into two parts: low stakes and handling catastrophes.

I'm a little curious about whether assuming fixed low stakes accidentally favors training regimes that have the real-world drawback of raising the stakes.

But overall I think this is a really interesting way of reframing the "what do we do if we succeed?" question. There is one way it might be misleading, which is I think that we're left with much more of the problem of generalizing beyond the training domain that it first appears: even though the AI gets to equilibrate to new domains safely and therefore never has to take big leaps of generalization, the training signal itself has to do all the work of generalization that the trained model gets to avoid!

Paul creates a sub problem of alignment which is "alignment with low stakes." Basically, this problem has one relaxation from the full problem: We never have to care about single decisions, or more formally traps cannot happen in a small set of actions.

Another way to say it is we temporarily limit distributional shift to safe bounds.

I like this relaxation of the problem, because it gets at a realistic outcome we may be able to reach, and in particular it let's people work on it without much context.

However, the fact inner alignment doesn't need to be solved may be a problem depending on your beliefs about outer vs inner alignment.

I'd give it a +3 in my opinion.

In such a world AI can still cause huge amounts of trouble if humans can’t understand what it is doing. Rather than “taking over” in a single unanticipated shock, the situation can deteriorate in a thousand tiny pieces each of which humans cannot understand.

Some would say this has already happened, even though (some) humans do understand (and object) to what it is doing, most do not understand, and the ones in control of the AI do not object.

  • If I’d tried to assume that models are “trying” to get a low value for the training loss, I might have ended up relying on our ability to incentivize the model to make very long-term predictions. But I think that approach is basically a dead end.

Why do you believe that this is a dead-end?

To the extent that SGD can’t find the optimum, it hurts the performance of both the aligned model and the unaligned model. In some sense what we really want is a regret bound compared to the “best learnable model,” where the argument for a regret bound is heuristic but seems valid.

I'm not sure this goes through. In particular, it could be that the architecture which you would otherwise deploy (presumably human brains, or some other kind of automated system) would do better than the "best learnable model" for some other (architecture + training data + etc) combination. Perhaps in some sense what you want is a regret bound over the aligned model which you would use if you don't deploy your AI model, not a regret bound between your AI model and the best learnable model in its (architecture + training data + etc) space.

That said, I'm really not familiar with regret bounds, and it could be that this is a non-concern.