I've been poking at Evan's Clarifying Inner Alignment Terminology. His post gives two separate pictures (the objective-focused approach, which he focuses on, and the generalization-focused approach, which he mentions at the end). We can consolidate those pictures into one and-or graph as follows:

And-or graphs make explicit which subgoals are jointly sufficient, by drawing an arc between those subgoal lines. So, for example, this claims that intent alignment + capability robustness would be sufficient for impact alignment, but alternatively, outer alignment + robustness would also be sufficient. I've also added colors to make it a bit clearer.
The red represents what belongs entirely to the generalization-focused path. The yellow represents what belongs entirely to the objective-focused path. The blue represents everything else. (In this diagram, all the blue is on both paths, but that will not be the case in my next diagram.)
Note, in particular, that both paths seek outer alignment + objective robustness + capability robustness. According to the above picture, the disagreement between the two paths is only one of which of these sub-goals are better grouped together.
But this doesn't seem to actually be true. Objective Robustness and Inner Alignment Terminology points out that, really, the two approaches want to define some of the terminology differently. My previous post on the subject suggests even more differences. Putting these things together, and with some other revisions, I suggest this revised joint graph:

The and-or graph here has been supplemented with double-headed arrows, which indicate a looser relationship of pseudo-equivalence (more on this later).
Definitions:
- Behavioral Alignment: This is just another way to say "impact alignment" that's more consistent with the rest of the terminology. Behavioral alignment means alignment in terms of what the system actually does. I don't want to delve into the definition of the term "alignment" itself in this post, so, that's about all I can say.
- Inner Robustness: This means that the mesa-objective is efficiently pursued under a wide range of circumstances (ie, including distributional shift). In other words: whatever the mesa-optimizer wants, it is broadly capable of achieving it.
- On-Distribution Alignment: Objective Robustness and Inner Alignment pointed out that the generalization-focused path re-defines "outer alignment" as "alignment on the training distribution" (so that we can then think of the rest of the alignment problem as a problem of generalization). I take this to mean both that the base objective is aligned on the training distribution, and that the behavior of the trained system is aligned on the training distribution. (One implies the other, if training succeeds.)
- Robustness: Performing well on the base objective in a wide range of circumstances.
- Intent Alignment: A model is intent-aligned if it has a mesa-objective, and that mesa-objective is aligned with humans. (Again, I don't want to get into exactly what "alignment" means.)
- Capability Robustness: As elsewhere, I define this as performing well on a behavioral objective even off-distribution. The system is highly capable at something, but we say nothing about what that thing is.
- Objective Robustness: The behavioral objective of the system is aligned with the base objective, even under distributional shift. [EDIT: I now want to re-define this as follows: the behavioral objective generalizes acceptably. See discussion under "red lines".]
- Inner Alignment: A system is inner-aligned if it has a mesa-objective, and that mesa-objective is aligned with the base objective.
- Outer Alignment: The base objective is aligned with humans.
Yellow Lines:
These lines represent the objective-centric approach. I think this rendering is more accurate than Evan's, primarily because my definition of intent alignment seems truer to Paul's original intention, and secondarily because inner alignment and outer alignment now form a nice pair.
- Inner Alignment + Outer Alignment Intent Alignment: This is by transitivity of alignment. If the mesa-objective is aligned with the base objective, and the base objective is aligned with humans, then the mesa-objective will be aligned with humans.
- Intent Alignment + Inner Robustness Behavioral Alignment: If something is intent-aligned, and also achieves its intent reliably, then it must be behaviorally aligned.
This path apparently implies building goal-oriented systems; all of the subgoals require that there actually is a mesa-objective. In contrast, I think researchers who identify with this path probably don't all think the end result would necessarily be goal-oriented. For example, my impression of what people mean by "solving the inner alignment problem" includes building systems which robustly avoid having inner optimizers at all. This is not well-represented by the proposed graph.
We could re-define "inner alignment" to mean "the mesa-objective aligns with the base objective, or the system lacks any mesa-objective" -- but this includes a lot of dumb things under "inner aligned", which seems intuitively wrong.
A closer term is acceptability, which could plausibly be defined as "not actively pursuing a misaligned goal". However, I was not sure how to put anything like this into the graph in a nice way.
Red Lines:
These lines represent the generalization-focused approach.
- Capability Robustness + Objective Robustness Robustness: We perform well on the behavioral objective in a wide range of circumstances; and, the behavioral objective is aligned with the base objective in a wide range of circumstances; therefore, we perform well on the base objective in a wide range of circumstances.
- Robustness + On-Distribution Alignment Behavioral Alignment: We perform well on the base objective in training, and we generalize well, therefore we perform well in general.
This approach has some distinct advantages over the objective-focused approach. First, it does not assume the existence of inner optimizers at any point. It is possible that this approach could succeed without precisely defining "inner optimizer", identifying mesa-objectives and checking their alignment, or anything like that. Second, this approach can stand on the shoulders of existing statistical learning theory. If the whole problem boils down to generalization guarantees, then perhaps we just need to advance work on the same kinds of problems which machine learning has faced since its inception.
A subtlety here is that the base objective matters in two different ways. For "on-distribution alignment", we only care about how the base objective performs on the training data. This makes sense: that's the only way it effects training, so why would we care about correctly specifying outer alignment off-distribution? Instead, we rely on generalization to specify that part correctly. This seems like an advantage to the approach, because it greatly reduces the outer alignment problem.
However, objective robustness also depends on the base objective, and specifically depends on the off-distribution behavior of the base objective. This reflects the fact that to generalize correctly, the system does need to get information about the off-distribution base objective somehow. But how? In prosaic AI, only on-distribution behavior of the loss function can influence the end result.
I can see a few possible responses here.
- Double down on the "correct generalization" story: hope to somehow avoid the multiple plausible generalizations, perhaps by providing enough training data, or appropriate inductive biases in the system (probably both).
- Achieve objective robustness through other means. In particular, inner alignment is supposed to imply objective robustness. In this approach, inner-alignment technology provides the extra information to generalize the base objective appropriately.
Response #2 is consistent with how the generalization-focused path has been drawn by others; IE, it includes inner alignment as a subgoal of objective robustness. However, including this fully in the generalization-focused path seems unfortunate to me, because it adds mesa-objectives as a necessary assumption (since inner alignment requires them). Perhaps dealing directly with mesa-objectives is unavoidable. However, I would prefer to be agnostic about that for the time being.
EDIT:
I now think (to fix the above-mentioned problem, and to represent Rohin's view more accurately) we should re-define objective robustness as follows:
Objective Robustness: The behavioral objective generalizes acceptably.
The notion of "acceptable" is left purposefully open, but it should have two desirable properties:
- We should be happy with a model if we know it has high average-case performance (on training data) and we know it has acceptable performance generally. In other words, it should bridge the gap in the argument.
- Acceptability should not be too much of a burden. Hopefully, generalizing acceptably is easier than generalizing exactly correctly.
Blue Lines:
Inner Robustness and Capability Robustness
Inner robustness implies capability robustness, because we know there's a goal which the system performs well on in a broad variety of circumstances. (Inner robustness just tells us a bit more about what that goal is, while capability robustness doesn't care.)
Capability robustness sort of implies inner robustness, if we assume a degree of agency: it would be pretty strange for the system to robustly pursue some other goal than its mesa-objective.
However, these implications require the presence of an inner optimizer. In particular, capability robustness obviously won't imply inner robustness in the absence of one of those.
Inner Alignment and Objective Robustness
Evan argued that inner alignment implies objective robustness. This argument requires that the agent is capable enough that its behavioral objective will match its mesa-objective, even under distributional shift.
We could also argue in the other direction: if something is behaviorally aligned with the base objective in a broad variety of circumstances, then (again assuming sufficient agency), surely it must not have a misaligned objective.
Again, these implications only make sense if there is a mesa-objective.
On-Distribution Alignment and Outer Alignment
Outer alignment implies on-distribution alignment trivially. On-distribution doesn't imply outer alignment by any means, but the pseudo-equivalence here is because outer alignment doesn't matter beyond the influence of the base objective on training; so, at least for prosaic AI, outer alignment shouldn't matter beyond on-distribution alignment.
Equating Pseudo-Equivalences
If we collapse all the pseudo-equivalent subgoals, we get an and-or graph which looks quite similar to the one we started out with:

This makes clear that both approaches have an "inner alignment type thing", an "outer alignment type thing", and a "capability type thing"; they just define these things differently:
Objective-Focused Approach | Generalization-Focused Approach | |
Outer cluster: | Base objective is aligned with humans. | Base objective is aligned with humans on-distribution. |
Inner cluster: | Mesa-objective exists and is aligned with base objective. | (Robust) behavioral objective exists and is aligned with base objective. |
Capabilities cluster: | Achieves high performance on the mesa-objective, in a broad variety of situations. | Achieves high performance on behavioral objective in a broad variety of situations. |
Grouping: | (outer + inner) + capabilities | outer + (inner + capabilities) |
This may be an easier way to remember my larger graph.
Other remarks:
Definition of "Alignment"
I've used the term "aligned" in several definitions where Evan used more nuanced phrases. For example, inner alignment:
- Evan: A mesa-optimizer is inner-aligned if the optimal policy for its mesa-objective is impact aligned with the base objective.
- Me: A system is inner-aligned if it has a mesa-objective, and that mesa-objective is aligned with the base objective.
Evan's definition seems more nuanced and useful. It puts some gears on the concept of alignment. It averts the mistake "aligned means equal" (if humans want to drink coffee, that should not imply that aligned robots want to drink coffee). It captures the idea that goal alignment has to do with high levels of performance (we don't want to label something as misaligned just because it makes dumb mistakes).
However, I'm not confident that the details of Evan's locutions are quite right. For example, should alignment be tested only in terms of the very best policy? This seems like a necessary condition, but not sufficient. If behavior is severely misaligned even for some very very high-performance policies (but technically sub-optimal), then the alignment isn't good enough; we don't expect training to find the very best policy.
So, I think it better to remain somewhat ambiguous for this post, and just say "aligned" without going further.
Other Nuances
Note that with my re-definition of "objective robustness", the generalization-focused path now implies achieving a weaker kind of alignment: the objective-focused approach achieves what we might call strong alignment, where the system is robustly pursuing aligned goals. The generalization-focused approach will be weaker (depending on how exactly "acceptability" gets defined), only guaranteeing that the resulting system doesn't do something terrible. (This weaker form of alignment seems very reasonable to me.)
This means we can split the top bubble into objective/generalization -focused versions, like the others. If we really want, we can also come up with split definitions of "robustness" and "intent alignment", so that the whole graph gets split, although this doesn't seem particularly useful.
The "task AI vs goal-directed AI" distinction deserves a mention. To some extent, the objective-focused approach is all about goal-directed AI, while the generalization-focused approach remains more agnostic. However, it could be that task-based systems still have mesa-objectives (EG "do what the user says"), just myopic ones. Part of inner alignment is then to ensure myopia.
Meta-thoughts on the graph/terminology.
Generally, I felt like if I had chosen more things to be careful about, I could have made the graph three times as big. It's tempting to try and map out all possible important properties and all possible approaches. However, the value of a map like this rapidly diminishes as the map gets larger. Which things to make perfectly clear vs muddy is a highly subjective choice. I would appreciate feedback on the choices I made, as this will inform my write-up-to-come. (This post will resemble the first major section of that write-up.)
Also, I'm not very committed to the terms I chose here. EG, using "behavioral alignment" rather than "impact alignment". I welcome alternate naming schemes.
I find myself thinking that objective robustness is actually what I mean by the inner alignment problem. Abergal voiced similar thoughts. But this makes it seem unfortunate that "inner alignment" refers specifically to the thing where there are mesa-optimizers. I'm not sure what to do about this.
If they are equivalent, then I feel like the obvious value of the work would make resource constraints go away?
However, thinking about raising money for it helps to convince me that the proposed linkage has "leaks".
Imagine timelines where they had Robustness & Intent Alignment (but there was no point where they had "Inner Robustness" or "On-Distribution Alignment"). Some of those timelines might have win conditions, and others might now. The imaginable failures work for me as useful intuition pumps.
I haven't managed to figure out a clean short response here, so I'll give you apologies and lots of words <3
...
If I was being laconic, I might try to restate what I think I noticed is that BOTH "Inner Alignment" and "Objective Robustness" have in some deep sense solved the principle agent problem...
...but only Inner Alignment has solved the "timeless multi-agent case", while Objective Robustness has solved the principle agent problem for maybe only the user, only at the moment the user requests help?
(I can imagine the people who are invested in the yellow or the red options rejecting this for various reasons, but I think it would be interesting to hear the objections framed in terms of principals, agents, groups, contracts, explicable requests for an AGI, and how these could interact over time to foreclose the possibility of very high value winning conditions. Since my laconic response's best expected followup is more debate, it seems good to sharpen and clarify the point.)
...
Restating "the same insight" in a concretely behavioral form: I think that hypothetically, I would have a lot easier time explicitly and honestly pitching generic (non-altruistic, non-rationalist) investors on an "AGI Startup" if I was aiming for Robustness, rather than Intent Alignment.
The reason it would be easier: it enables the benefits to go disproportionately to the investors. Like, what if it turns out that disproportionate investor returns are not consistent with something like "the world's Coherent Extrapolated Volition (or whatever the latest synecdoche for the win condition is)". THEN, just request "pay out the investors and THEN with any leftovers do the good stuff". Easy peasy <3
That is, Robustness is easier to raise funds for, because it increases the pool of possible investors from "saints" to "normal selfish investors"...
...which feels like almost an accusation against some people, which is not exactly what I'm aiming for. (I'm not not aiming for that, but its not the goal.) I'll try again.
...
Restating "again, and at length, and with a suggested modification to the diagram":
My intuitions suggest reversing the "coin vs paper" metaphor to make it very vivid and to make metal money be the real good stuff <3
(If you have not been studying block chain economics by applying security mindset to protocol governance for a while, and kept generating things "not as good as gold" over and over and over, maybe this metaphor won't make sense to you. It works for ME though?)
I imagine an "Intent Alignment" that is Actually Good as being like 100 kg of non-radioactive gold.
You could bury it somewhere, and dig it up 1000 years later, and it would still be just what it is: an accurate theory of pragmatically realizable abstract goodness that is in perfect resonance with the best parts of millennia of human economic and spiritual and scientific history up to the moment it was produced.
(
Asteroid mining could change the game for actual gold? And maybe genetic engineering could change the question of values so maybe 150 years from now humans will be twisted demons?
But assuming no "historically unprecedented changes to our axiological circumstances" Intent Alignment and gold seem metaphorically similar to me (and similarly at risk as a standard able to function for human people as an age old meter stick for goodness in the face of technological plasticity and post-scarcity economics and wireheading and sybil attacks and so on).
)
Following this metaphorical flip: "Robustness" becomes the vending machine that will take any paper instruction, and banking credentials you wish to provide (for a bank that is part of westphalian finance and says that you have credit).
If you pay enough to whoever owns a Robust machine, it'll give you almost anything...
...then the impedance mismatch could be thought of as a problem where the machine doesn't model the gold plates covered in the Thoughtful Wish as "valuable" (because the gold isn't held by a bank) though maybe it could work as an awkward and bulky set of instructions that aren't on paper but then you could do a clever referential rewrite?
Thus, a simple way to reconcile these things would be for some rich/powerful person to come up, swipe a card to transfer 20 million argentinian nuevo pesos (possibly printed yesterday?) and write the instructions "Do what that 100kg of gold that is stamped and shaped with relevant algorithms and inspiring poetry says to do."
Since Robustness will more or less safely-to-the-user do anything that can be done (like it won't fail to parse "that" in a sloppy and abusive way, for example, triggering on other gold and getting the instruction scrambled, or any of an infinity of other quibbles that could be abusively generated) it will work, right?
By hypothesis, it has "Objective Robustness" so it WILL robustly achieve any achievable goal (or fail out in some informative way if asked to make 1+2=4 or whatever).
So then TIME seems to be related to how the pesos and a paper instructions to follow the gold instructions could fail?
Suppose a Robust vending machine was first asked to create a singleton situation where an AGI exists, manages basically everyone, but isn't following any kind of Intent Aligned "Golden Plan" that is philosophically coherent n'stuff.
Since the gears spin very fast, the machine understands that a Golden Plan would be globally inconsistent with its own de facto control of "all the things" already in its relatively pedestrian and selfish way that serve the goals of the first person to print enough pesos, and so it would prevent any such Golden Plan from being carried out later.
To connect this longer version of a restatement with earlier/shorter restatements, recall the idea of solving the principal/agent problem in the "timeless multi-agent case"...
In the golden timelessly aligned case, if somehow in the future an actually better theory of "what an AGI should do" is discovered (and so we get "even more gold" in the coin/paper/vending machine metaphor), then Intent Alignment would presumably get out of the way and allow this progress to unfold in an essentially fair and wise way.
Robustness has no such guarantees. This may get at the heart of the inconsistency?
Compressing this down to a concrete suggestion to usefully change the diagram:
I think maybe you could add a 10th node, that was something like "A Mechanism To Ensure That Early Arriving Robustness Defers To Late Arriving Intent Alignment"?
(In some sense then, the thing that Robustness might lack is "corrigibility to high quality late-arriving Alignment updates"?)
I'm pretty sure that Deference is not traditionally part of Robustness as normally conceived, but also if such a thing somehow existed in addition to Robustness then I'd feel like: yeah, this is going to work and the three things (Deference, Robustness, and Intent Alignment) might be logically sufficient to guarantee the win condition at the top of the diagram :-)