Here’s my take on why the distinction between inner and outer-alignment frame is weird/unclear/ambiguous in some circumstances: My understanding is that these terms were originally used when talking about AGI. So outer alignment involved writing down a reward or utility function for all of human values and inner alignment involves getting these values in the AI.

However, it gets confusing when you use these terms in relation to narrow AI. For a narrow AI, there’s a sense in which we feel that we should only have to define the reward on that narrow task. ie. if we want an AI to be good at answering questions, an objective that rewards it for correct answers and penalises itself for incorrect answers feels like a correct reward function for that domain. So if things go wrong and it kidnaps humans and forces us to ask it lots of easy questions so it can score higher, we’re not sure whether to say that it’s inner or outer alignment. On one hand, if our reward function penalised kidnapping humans (which is something we indeed want penalised) then it wouldn’t have done it. So we are tempted to say it is outer misalignment. On the other hand, many people also have an intuition that we’ve defined the reward function correctly on that domain and that the problem is that our AI didn’t generalise correctly from a correct specification. This pulls us in the opposite direction, towards saying it is inner misalignment.

Notice that what counts as a proper reward function is only unclear because we’re talking about narrow AI. If we were talking about AGI, then of course our utility function would be incomplete if it doesn’t specify that it shouldn’t kidnap us in order to do better at a question-answering task. It’s an AGI, so that’s in scope. But when we’re talking about narrow AI, it feels as though we shouldn’t have to specify it or provide anti-kidnapping training data. We feel like it should just learn it automatically on the limited domain, ie. that avoiding kidnapping is the responsibility of the training process, not of the reward function.

Hence the confusion. The resolution is relatively simple: define how you want to partition responsibilities between the reward function and the training process.

1.
^

 The outer misalignment threat model covers cases where problematic feedback results in training a misaligned AI even if the actual oversight process used for training would actually have caught the catastrophically bad behavior if it was applied to this action.

2.
^

For AIs which aren’t well described as pursuing goals, it’s sufficient for the AI to just be reasonably well optimized to perform well according to this reward provision process. However, note that AIs which aren't well described as pursuing goals also likely pose no misalignment risk.

3.
^

Prior work hasn’t been clear about outer alignment solutions just needing to be robust to a particular AI produced by some training process, but this seems extremely key to a reasonable definition of the problem from my perspective. This is both because we don’t need to be arbitrarily robust (the key AIs will only be so smart, perhaps not even smarter than humans) and because approaches might depend on utilizing the AI itself in the oversight process (recursive oversight) such that they’re only robust to that AI but not others. For an example of recursive oversight being robust to a specific AI, consider ELK type approaches. ELK could be applicable to outer alignment via ensuring that a human overseer is well informed about everything a given AI knows (but not necessarily well informed about everything any AI could know).Os

4.
^

Again, this is only important insofar as AIs are doing anything well described as "trying" in any cases.

5.
^

In some circumstances, it’s unclear exactly what it would even mean to optimize a given reward provision process as the process is totally inapplicable to the novel circumstances. We’ll ignore this issue.

New Comment


10 comments, sorted by Click to highlight new comments since:

I think the core confusion is that outer/inner (mis)-alignment have different (reasonable) meanings which are often mixed up:

  • Threat models: outer misalignment and inner misalignment.
  • Desiderata sufficient for a particular type of proposal for AI safety: For a given AI, solve outer alignment and inner alignment (sufficiently well for a particular AI and deployment) and then combine these solutions to avoid misalignment issues.

The key thing is that threat models are not necessarily problems that need to be solved directly. For instance, AI control aims to address the threat model of inner misalignment without solving inner alignment.

Definition as threat models

Here are these terms defined as threat models

  • Outer misalignment is the threat model where catastrophic outcomes result from providing problematic reward signals to the AI. This includes cases where problematic feedback results in catastrophic generalization behavior as described in “Without specific countermeasures…[1], but also slow-rolling catastrophe due to direct incentives from reward as discussed in “What failure looks like: You get what you measure”.
  • Inner misalignment is the threat model where catastrophic outcomes happen regardless of whether the reward process is problematic due to the AI pursuing undesirable objectives in at least some cases. This could be due to scheming (aka deceptive alignment), problematic goals which correlate with reward (aka proxy alignment with problematic generalization), or threat models more like deep deceptiveness.

This seems like a pretty reasonable decomposition of problems to me, but again note that these problems don't have to respectively be solved by inner/outer alignment "solutions".

Definition as a particular type of AI safety proposal

This proposal has two necessary desiderata:

  • Outer alignment: a reward provision process such that sufficiently good outcomes would occur if our actual AI maximized this reward to the best of its abilities[2]. Note that we only care about maximization given a specific AI’s actual capabilities and affordances, the process doesn’t need to be robust to arbitrary optimization. As such, outer alignment is with respect to a particular AI and how it is used[3]. (See also the notion of local adequacy we define here.)
  • Inner alignment: a process which sufficiently ensures that an AI actually does robustly “try”[4] to maximize a given reward provision process (from the prior step) including in novel circumstances[5].

This seems like generally reasonable overall proposal, though there are alternatives. And the caveats around outer alignment only needing to be locally adequate are important.

Doc with more detail

This content is copied out of this draft which I've never gotten around to cleaning up and publishing.

  1. ^

     The outer misalignment threat model covers cases where problematic feedback results in training a misaligned AI even if the actual oversight process used for training would actually have caught the catastrophically bad behavior if it was applied to this action.

  2. ^

    For AIs which aren’t well described as pursuing goals, it’s sufficient for the AI to just be reasonably well optimized to perform well according to this reward provision process. However, note that AIs which aren't well described as pursuing goals also likely pose no misalignment risk.

  3. ^

    Prior work hasn’t been clear about outer alignment solutions just needing to be robust to a particular AI produced by some training process, but this seems extremely key to a reasonable definition of the problem from my perspective. This is both because we don’t need to be arbitrarily robust (the key AIs will only be so smart, perhaps not even smarter than humans) and because approaches might depend on utilizing the AI itself in the oversight process (recursive oversight) such that they’re only robust to that AI but not others. For an example of recursive oversight being robust to a specific AI, consider ELK type approaches. ELK could be applicable to outer alignment via ensuring that a human overseer is well informed about everything a given AI knows (but not necessarily well informed about everything any AI could know).Os

  4. ^

    Again, this is only important insofar as AIs are doing anything well described as "trying" in any cases.

  5. ^

    In some circumstances, it’s unclear exactly what it would even mean to optimize a given reward provision process as the process is totally inapplicable to the novel circumstances. We’ll ignore this issue.

That's an excellent point.

I agree. I think that's probably a better way of clarifying the confusion that what I wrote.

My mental shorthand for this has been that outer alignment is getting the AI to know what we want it to do, and inner alignment is getting it to care. Like the difference between knowing how to pass a math test, and wanting to become a mathematician. Is that understanding different from what you're describing here?

I don't think that's quite accurate:  any sufficiently powerful AI will know what we want it to do.

Yes, true, and often probably better than we would be able to write it down, too.

I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we're worried about.

Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?

I was under the impression that this meant that a sufficiently powerful AI would be outer-aligned by default, and that this is what enables several of the kinds of deceptions and other dangers we're worried about.


This would be the case if inner alignment meant what you think it does, but it doesn't.

Is the difference between the goal being specified by humans vs being learned and assumed by the AI itself?

Yeah, outer alignment is focused on whether we can define what we want the AI to learn (ie. write down a reward function). Inner alignment focused on what the learned artifact (the AI) ends up learning to pursue.

The confusion arises because the terms Inner/Outer alignment were invented by people doing philosophy and not people doing actual machine learning.  In practice inner misalignment is basically not a thing.  See, for example, Nora Belrose's comments about the "counting argument".

I'm definitely one of those non-experts who has never done actual machine learning, but AFAICT that article you linked both is tied to and does not explicitly mentioned that the 'principle of indifference' is about the epistemological taste of the reasoner, while arguing that the cases where the reasoner lacks knowledge to hold a more accurate prior means the principle itself is wrong.

The training of an LLM is not a random process, therefore indifference will not accurately predict the outcome of this process. This does not imply anything about other forms of AI, or about whether people reasoning in the absence of knowledge about the training process were making a mistake. It also does not imply sufficient control over the outcome of the training process to ensure that the LLM will, in general, want to do what we want it to want to do, let alone to do what we want it to do.

The section where she talks about how evolution's goals are human abstractions and an LLM's training has a well-specified goal in terms of gradient descent is really where that argument loses me, though. In both cases, it's still not well specified, a priori, how the well-defined processes cash out in terms of real-world behavior. The factual claims are true enough, sure. But the thing an LLM is trained to do is predict what comes next, based on training data curated by humans, and humans do scheme. Therefore, a sufficiently powerful LLM should, by default, know how to scheme, and we should assume there are prompts out there in prompt-space that will call forth that capability. No counting argument needed.  In fact, the article specifically calls this out, saying the training process is "producing systems that behave the right way in all scenarios they are likely to encounter," which means the behavior is unspecified in whatever scenarios the training process deems "unlikely," although I'm unclear what "unlikely" even means here or how it's defined.

One of the things we want from our training process is to not have scheming behavior get called up in a hard-to-define-in-advance set of likely and unlikely cases. In that sense, inner-alignment may not be a thing for the structure of LLMs, in that the LLM will automatically want what it is trained to want. But, it is still the case that we don't know how to do outer-alignment for a sufficiently general set of likely scenarios, aka we don't actually know precisely what behavioral responses our training process is instilling.

Chris' latest reply to my other comment resolved a confusion I had, so I now realize my comment above isn't actually talking about the same thing as you.

See: DeepMind's How undesired goals can arise with correct rewards for an empirical example of inner misalignment.

From a quick skim, that post seems to only be arguing against scheming due to inner misalignment. Let me know if I'm wrong.