I think Evan's Clarifying Inner Alignment Terminology is quite clever; more well-optimized than it may at first appear. However, do think there are a couple of things which don't work as well as they could:
- What exactly does the modifier "intent" mean?
- Based on how "intent alignment" is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans". In this case, capability robustness is not exactly what's needed; instead, what I'll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
- (I find myself flipping between these two views, and thereby getting confused.)
- Furthermore, I would argue that the second alternative (making "intent alignment" about the mesa-objective) is more true to the idea of intent alignment. Making it about the behavioral objective turns it into a fact about the actual impact of the system, since "behavioral objective" is defined by looking at what the system actually accomplishes. But then, why the divide between intent alignment and impact alignment?
- Based on how "intent alignment" is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans". In this case, capability robustness is not exactly what's needed; instead, what I'll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
- Any definition where "inner alignment" isn't directly paired with "outer alignment" is going to be confusing for beginners.
- In Evan's terms, objective robustness is basically a more clever (more technically accurate and more useful) version of "the behavioral objective equals the outer objective", whereas inner alignment is "the mesa-objective equals the outer objective".
- (It's clear that "behavioral" is intended to imply generalization, here -- the implication of objective robustness is supposed to be that the objective is stable under distributional shift. But this is obscured by the definition, which does not explicitly mention any kind of robustness/generalization.)
- By making this distinction, Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).
- In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional -- which could be an advantage, if this assumption isn't so good!
- However, although I find the decomposition insightful, I dread explaining it to beginners in this way. I find that I would prefer to gloss over objective robustness and pretend that intent alignment simply factors into outer alignment and inner alignment.
- I also find myself constantly thinking as if inner/outer alignment were a pair, intuitively!
- In Evan's terms, objective robustness is basically a more clever (more technically accurate and more useful) version of "the behavioral objective equals the outer objective", whereas inner alignment is "the mesa-objective equals the outer objective".
My current proposal would be the following:
- Re-define "intent alignment" to refer to the mesa-objective.
- Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there's an inner optimizer).
- This fits with the intuitive picture that inner and outer are supposed to be complimentary!
- Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there's an inner optimizer).
- If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.
- This is exactly what we need to pair with the new "intent alignment" in order to achieve impact alignment.
- However, this is clearly a narrower concept than capability robustness (it assumes there is a mesa-objective).
This is a complex and tricky issue, and I'm eager to get thoughts on it.
Relevant reading:
- Evan's post on the topic.
- The post which discusses Evan's as the "objective-focused approach", contrasting it with Rohin's "generalization-focused approach". My proposal would make the two diagrams more different from each other. I'm also interested in trying to merge the diagrams or otherwise "bridge the conceptual gap" between the two approaches.
As a reminder, here are Evan's definitions. Nested children are subgoals; it's supposed to be the case that if you can achieve all the children, you can achieve the parent.
- Impact Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.
- Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
- Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
- Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.
- Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
- Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
So we split impact alignment into intent alignment and capability; we split intent alignment into outer alignment and objective robustness; and, we achieve objective robustness through inner alignment.
Here's what my proposed modifications do:
- (Impact) Alignment
- Inner Robustness: An agent is inner-robust if it performs well on its mesa-objective even in deployment/off-distribution.
- Intent Alignment: An agent is intent aligned if the optimal policy for its mesa-objective is impact aligned with humans.
- Outer Alignment
- Inner Alignment
"Objective Robustness" disappears from this, because inner+outer gives intent-alignment directly now. This is a bit of a shame, as I think objective robustness is an important subgoal. But I think the idea of objective robustness fits better with the generalization-focused approach:
- Alignment
- Outer Alignment: For this approach, outer alignment is re-defined to be only on-training-distribution (we could call it "on-distribution alignment" or something).
- Robustness
- Objective Robustness
- Inner Alignment
- Capability Robustness
- Objective Robustness
And it's fine for there to be multiple different subgoal hierarchies, since there may be multiple paths forward.
Update: having now thought more deeply about this, I no longer endorse my above comment.
While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:
Everything in the above comment then still goes through, except with these definitions reversed.
On the one hand, the "perfect IRL" definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn't defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn't define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.
On the other hand, it's actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like "maximize happiness" as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.
A few consequences of the above:
This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the "generalization-focused approach" to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. ("When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?")
Would definitely be interested in getting feedback on these thoughts!