I think Evan's Clarifying Inner Alignment Terminology is quite clever; more well-optimized than it may at first appear. However, do think there are a couple of things which don't work as well as they could:
- What exactly does the modifier "intent" mean?
- Based on how "intent alignment" is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans". In this case, capability robustness is not exactly what's needed; instead, what I'll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
- (I find myself flipping between these two views, and thereby getting confused.)
- Furthermore, I would argue that the second alternative (making "intent alignment" about the mesa-objective) is more true to the idea of intent alignment. Making it about the behavioral objective turns it into a fact about the actual impact of the system, since "behavioral objective" is defined by looking at what the system actually accomplishes. But then, why the divide between intent alignment and impact alignment?
- Based on how "intent alignment" is defined (basically, the optimal policy of its behavioral objective would be good for humans), capability robustness is exactly what it needs to combine with in order to achieve impact alignment. However, we could instead define "intent alignment" as "the optimal policy of the mesa objective would be good for humans". In this case, capability robustness is not exactly what's needed; instead, what I'll provisionally call inner robustness (IE, strategies for achieving the mesa-objective generalize well) would be put in its place.
- Any definition where "inner alignment" isn't directly paired with "outer alignment" is going to be confusing for beginners.
- In Evan's terms, objective robustness is basically a more clever (more technically accurate and more useful) version of "the behavioral objective equals the outer objective", whereas inner alignment is "the mesa-objective equals the outer objective".
- (It's clear that "behavioral" is intended to imply generalization, here -- the implication of objective robustness is supposed to be that the objective is stable under distributional shift. But this is obscured by the definition, which does not explicitly mention any kind of robustness/generalization.)
- By making this distinction, Evan highlights the assumption that solving inner alignment will solve behavioral alignment: he thinks that the most important cases of catastrophic bad behavior are intentional (ie, come from misaligned objectives, either outer objective or inner objective).
- In contrast, the generalization-focused approach puts less emphasis on the assumption that the worst catastrophes are intentional -- which could be an advantage, if this assumption isn't so good!
- However, although I find the decomposition insightful, I dread explaining it to beginners in this way. I find that I would prefer to gloss over objective robustness and pretend that intent alignment simply factors into outer alignment and inner alignment.
- I also find myself constantly thinking as if inner/outer alignment were a pair, intuitively!
- In Evan's terms, objective robustness is basically a more clever (more technically accurate and more useful) version of "the behavioral objective equals the outer objective", whereas inner alignment is "the mesa-objective equals the outer objective".
My current proposal would be the following:
- Re-define "intent alignment" to refer to the mesa-objective.
- Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there's an inner optimizer).
- This fits with the intuitive picture that inner and outer are supposed to be complimentary!
- Now, inner alignment + outer alignment directly imply intent alignment, provided that there is a mesa-objective at all (IE, assuming that there's an inner optimizer).
- If we wish, we could replace or re-define "capability robustness" with "inner robustness", the robustness of pursuit of the mesa-objective under distributional shift.
- This is exactly what we need to pair with the new "intent alignment" in order to achieve impact alignment.
- However, this is clearly a narrower concept than capability robustness (it assumes there is a mesa-objective).
This is a complex and tricky issue, and I'm eager to get thoughts on it.
Relevant reading:
- Evan's post on the topic.
- The post which discusses Evan's as the "objective-focused approach", contrasting it with Rohin's "generalization-focused approach". My proposal would make the two diagrams more different from each other. I'm also interested in trying to merge the diagrams or otherwise "bridge the conceptual gap" between the two approaches.
As a reminder, here are Evan's definitions. Nested children are subgoals; it's supposed to be the case that if you can achieve all the children, you can achieve the parent.
- Impact Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.
- Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.
- Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.
- Outer Alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.
- Objective Robustness: An agent is objective robust if the optimal policy for its behavioral objective is impact aligned with the base objective it was trained under.
- Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.
So we split impact alignment into intent alignment and capability; we split intent alignment into outer alignment and objective robustness; and, we achieve objective robustness through inner alignment.
Here's what my proposed modifications do:
- (Impact) Alignment
- Inner Robustness: An agent is inner-robust if it performs well on its mesa-objective even in deployment/off-distribution.
- Intent Alignment: An agent is intent aligned if the optimal policy for its mesa-objective is impact aligned with humans.
- Outer Alignment
- Inner Alignment
"Objective Robustness" disappears from this, because inner+outer gives intent-alignment directly now. This is a bit of a shame, as I think objective robustness is an important subgoal. But I think the idea of objective robustness fits better with the generalization-focused approach:
- Alignment
- Outer Alignment: For this approach, outer alignment is re-defined to be only on-training-distribution (we could call it "on-distribution alignment" or something).
- Robustness
- Objective Robustness
- Inner Alignment
- Capability Robustness
- Objective Robustness
And it's fine for there to be multiple different subgoal hierarchies, since there may be multiple paths forward.
InfraBayes doesn't look for the regularity in reality that NNs are taking advantage of, agreed. But InfraBayes is exactly about "what kind of regularity assumptions can we realistically make about reality?" You can think of it as a reaction to the unrealistic nature of the regularity assumptions which Solomonoff induction makes. So it offers an answer to the question "what useful+realistic regularity assumptions could we make?"
The InfraBayesian answer is "partial models". IE, the idea that even if reality cannot be completely described by usable models, perhaps we can aim to partially describe it. This is an assumption about the world -- not all worlds can be usefully described by partial models. However, it's a weaker assumption about the world than usual. So it may not have presented itself as an assumption about the world in your mind, since perhaps you were thinking more of stronger assumptions.
If it's a good answer, it's at least plausible that NNs work well for related reasons.
But I think it also makes sense to try to get at the useful+realistic regularity assumptions from scratch, rather than necessarily making it all about NNs