Outer Alignment

Written by markov, Seth Herd, Ben Pace, et al. last updated

Outer Alignment (also known as the reward misspecification problem) is the problem of specifying a reward function which captures human preferences. Outer alignment asks the question - "What will we train our model to do?" Note that this is in the narrow technical sense of selecting a reward function, while wisely choosing a training target is a seperate issue; see the list below.

Overall, outer alignment as a problem is intuitive enough to understand, i.e., is the specified loss function aligned with the intended goal of its designers? However, implementing this in practice is extremely difficult. Conveying the full “intention” behind a human request is equivalent to conveying the sum of all human values and ethics. This is difficult in part because human intentions are themselves not well understood. Additionally, since most models are designed as goal optimizers, they are all susceptible to Goodhart’s Law which means that we might be unable to foresee negative consequences that arise due to excessive optimization pressure on a goal that would look otherwise well specified to humans.

To solve the outer alignment problem, some sub-problems that we would have to make progress on include specification gaming, value learning, and reward shaping/modeling. Some proposed solutions to outer alignment include scalable oversight techniques such as IDA, as well as adversarial oversight techniques such as debate.

Outer Alignment vs. Inner Alignment

This is often taken to be separate from the inner alignment problem, which asks: How can we robustly aim our AI optimizers at any objective function at all?

It should be kept in mind that you can have both inner and outer alignment failures together. It is not a dichotomy and often even experienced alignment researchers are unable to tell them apart. This indicates that the classifications of failures according to these terms are fuzzy. Ideally, we don't think of a binary dichotomy of inner and outer alignment that can be tackled individually but of a more holistic alignment picture that includes the interplay between both inner and outer alignment approaches.

 

  • Inner alignment: Does the model learn to do what you trained it to do?
    • Or does it create a mesaoptimizer?
    • Relevant example: take over the world as a mesoptimized goal includes perform appropriately in testing to avoid detection as a subgoal
  • Outer alignment: Does your training set actually train the model to do what you think you want?
    • Example: training a model to detect hidden equipment won't do what you want if all the training set with hidden equipment was taken from a sunny day, while all the negative examples were taken on cloudy days (figuratively but not literally a thing that happened)
  • Alignment target selection: This is a separate potential point of failure.
    • If you get inner alignment and outer alignment right, but you selected "make me a lot of money" as your alignment target, you will probably be unhappy with your overall efforts. You failed at wisely selecting an alignment target.