Thank you so much for writing this! I've been confused about this terminology for a while and I really like your reframing.
An additional terminological point that I think it would be good to solidify is what people mean when they refer to "inner alignment" failures. As you alude to, my impression is that some people use it to refer to objective robustness failures, broadly, whereas others (e.g. Evan) use it to refer to failures that involve mesa optimization. There is then additional confusion around whether we should think "inner alignment" failures that don't involve mesa optimization will be catastrophic and, relatedly, around whether humans count as mesa optimizers.
I think I'd advocate for letting "inner alignment" failures refer to objective robustness failures broadly, talking about "mesa optimization failures" as such, and then leaving the question about whether there are problematic inner alignment failures that aren't mesa optimization-related on the table.
If there were a "curated posts" system on the alignment forum, I would nominate this for curation. I think it's a great post.
Just re-read this because you cited it recently, and I like it even more the second time :)
I also like an intermediate point between the changes you lay out: keeping the "old style" tree diagram that puts outer alignment and objective robustness together under "intent alignment," but changing the interpretation of these boxes to your "new style" version where outer alignment is less impressive / stringent and robustness is more central.
Planned summary for the Alignment Newsletter:
Mesa optimization and inner alignment have become pretty important topics in AI alignment since the <@2019 paper@>(@Risks from Learned Optimization in Advanced Machine Learning Systems@) on it was published. However, there are two quite different interpretations of inner alignment concerns:
1. **Objective-focused:** This approach considers _structural_ properties of the computation executed by the learned model. In particular, the risk argument is that sufficiently capable learned models will be executing some form of optimization algorithm (such as a search algorithm), guided by an explicit objective called the mesa-objective, and this mesa-objective may not be identical to the base objective (though it should incentivize similar behavior on the training distribution) which can then lead to bad behavior out of distribution.
The natural decomposition is then to separate alignment into two problems: first, how do we specify an outer (base) objective that incentivizes good behavior in all situations that the model will ever encounter, and second, how do we ensure that the mesa objective equals the base objective.
2. **Generalization-focused:** This approach instead talks about the behavior of the model out of distribution. The risk argument is that sufficiently capable learned models, when running out of distribution, will take actions that are still competent and high impact, but that are not targeted towards accomplishing what we want: in other words, their capabilities generalize, but their objectives do not.
Alignment can then be decomposed into two problems: first, how do we get the behavior that we want on the training distribution, and second, how do we ensure the model never behaves catastrophically on any input.
Planned opinion:
I strongly prefer the second framing, though I’ll note that this is not independent evidence -- the description of the second framing in the post comes from some of my presentations and comments and conversations with the authors. The post describes some of the reasons for this; I recommend reading through it if you’re interested in inner alignment.
For example, although our results show CoinRun models failed to learn the general capability of pursuing the coin, the more natural interpretation is that the model has learned a robust ability to avoid obstacles and navigate the levels,[7] but the objective it learned is something like “get to the end of the level,” instead of “go to the coin.”
It seems to me that every robustness failure can be interpreted as an objective robustness failure (as aptly titled in your other post). Do you have examples of a capability robustness failure that is not an objective robustness failure?
Suppose we train a model, and at some point during training the inference execution hacks the computer on which the model is trained, and the computer starts doing catastrophic things via its internet connection. Does the generalization-focused approach consider this to be an outer alignment failure?
In the alignment community, there seem to be two main ways to frame and define objective robustness and inner alignment. They are quite similar, mainly differing in the manner in which they focus on the same basic underlying problem. We’ll call these the objective-focused approach and the generalization-focused approach. We don’t delve into these issues of framing the problem in Empirical Observations of Objective Robustness Failures, where we present empirical observations of objective robustness failures. Instead, we think it is worth having a separate discussion of the matter. These issues have been mentioned only infrequently in a few comments on the Alignment Forum, so it seemed worthwhile to write a post describing the framings and their differences in an effort to promote further discussion in the community.
TL;DR
This post compares two different paradigmatic approaches to objective robustness/inner alignment:
Objective-focused approach
Generalization-focused approach
Special thanks to Rohin Shah, Evan Hubinger, Edouard Harris, Adam Shimi, and Adam Gleave for their helpful feedback on drafts of this post.
Objective-focused approach
This is the approach taken by “Risks from Learned Optimization" (RFLO) and elaborated upon by Evan Hubinger’s follow-up post “Clarifying inner alignment terminology":
Essentially, this framing factors alignment into two problems:
We call this the objective-focused approach because of its emphasis on a model’s “objectives” in identifying the problems that remain when outer alignment is solved. This focus probably derives from long-standing worries about the goals of intelligent agents; many of the original cases for catastrophic risk from powerful AI were essentially that it seems dangerous to build very intelligent agents that have goals (or utility functions, etc.) that diverge from our own. With this worry in mind, the natural next question was “how do we get our AIs to have the right goals?” The above factorization aims to answer this question by 1) specifying the “right goals” and 2) making sure these actually become the model’s own.
This approach has a few limitations. First, there is no clear dividing line between capability and objective robustness as defined; at least, they are not as orthogonal as suggested in "2D Robustness." Because the behavioral objective is the objective recovered from perfect inverse reinforcement learning (IRL), every model has a behavioral objective. If the “perfect” IRL doesn’t correct for biases/it assumes that the model’s behavior is optimal with respect to the behavioral objective, the recovered behavioral objective will likely just be a different encoding of the policy. RFLO acknowledges this:
It therefore appears that no agent could fail to be capability robust with respect to its own behavioral objective: since the behavioral objective is recovered with perfect IRL over infinite data, the model should always perform well on its behavioral objective, even out of the training distribution (be capability robust).[1] Taking this one step further, knowing the behavioral objective in the limit seems to obviate the need to discuss alignment in the first place: if we know what the model would do in every situation, we already know whether it’s safe. On the other hand, if perfect IRL were to correct for biases, the recovered behavioral objective would be closer to “what the model is actually ‘trying’ to do” than “whatever the model actually does,” and the model could fail to be capability robust with respect to its own behavioral objective. However, it is very unclear what should count as a bias (vs., for example, a strange and/or particular preference).
A notion of a behavioral objective that relied only on a finite number of observations (e.g. what the model appears to be optimizing for on the training data) would be just as problematic; it would be underspecified because there are many possible objectives that fit the behavior thus far and therefore unhelpful in predicting the behavior of the model under future distributional shift.
Another potential limitation of this approach is that powerful AI systems might not be well-described by relatively simple behavioral objectives; they might instead act in a way that optimizes a complex mix of complex heuristics that performed well on the training data. In this case, reasoning about a system’s “objectives” would be largely unhelpful in predicting its behavior out-of-distribution. Although humans intuitively think in terms of agency and goals when reasoning about what an intelligent system will do (in other words, apply the intentional stance), it is possible that such a paradigm will not apply well to powerful AI systems.
Having said all of this about the behavioral objective, however, the objective-focused approach is probably still a useful framing to the extent one thinks that our models will learn to plan and act flexibly according to internally-represented objectives/goals and that mechanistically understanding the way they implement this behavior is possible. This is probably the case if mesa-optimizers[2] are likely to arise, but also potentially possible if we can understand how a model’s goals are "formulated in terms of... concepts it possesses," perhaps with advanced interpretability tools, and reason about how the concepts to which the goals refer will generalize under distributional shift.
Generalization-focused approach
All of the aforementioned problems with a purely behavioral conception of a model’s objective imply that the overall robustness problem cannot be so cleanly subdivided into capability and objective robustness, at least without understanding how a model’s internal objectives are structurally implemented. It is probably more accurate to say that robustness failures can be put on a spectrum ranging from cases where the model’s capabilities fail to generalize to those where its “objective” fails to generalize. This suggests a different factorization of the alignment problem, with a slightly different notion of “outer alignment”:[3]
(source)
This is the approach suggested by Rohin Shah, highlighted in his presentation "Generalization > Utility" (slides) but mentioned in various comments since the publication of RFLO. In this framing, “outer alignment” (as a property of the base objective) is not defined in terms of whether the model that is optimal with respect to the objective in the limit of infinite data and perfect training is aligned; instead, an objective function is outer aligned if it incentivizes or produces the behavior we actually want on the training distribution.[4] This conception of outer alignment is probably better suited to the task of actually trying to build aligned models; in practice, we only get to check whether the models we train performed acceptably on the training distribution. It seems much more difficult to reason about a given base objective over every possible situation, and besides, how the base objective would score behavior outside the training distribution has no influence on the model that gets produced. Even a training objective that would appropriately score performance in every possible situation cannot fully determine the behavior of a model beyond the examples it was trained on; from the point of view of the model, the training signal is underspecified over inputs from distributions other than the one it was trained on. Robustness, then, is about how a model trained with an outer aligned objective will generalize upon deployment/under distributional shift. In other words, the two problems here are:[5]
This approach’s emphasis on generalization means it does not explicitly rely on a notion of the model’s objective or goals in subdividing the overall alignment problem. However, conceiving of robustness failures on a spectrum ranging from those where the model fails to generalize capably to those where it generalizes capably but in ways that are no longer aligned with what we want is still important for solving the problem. The aforementioned technical fuzziness between capability and objective robustness remains, but the two are qualitatively distinct enough for this idea of a spectrum to be meaningful.
Indeed, this is why “objective robustness” and “capability robustness” remain children of the robustness node in the diagram; even though the boundary between the two remains ill-defined, it is necessary to categorize robustness failures by whether they generalize incapably or competently but pursue now-misaligned “objectives” (in the behavioral/intentional sense).[6] Ultimately, the robustness failures we care most about are those where the model generalizes competently under distributional shift but in ways that are no longer aligned with what we want. Robustness failures where the model becomes inept could only produce risks of accidents, but capable models pursuing misaligned objectives could in principle leverage their capabilities to visit arbitrarily bad states, and at the extreme, deceptive models would be extremely dangerous. This is why “inner alignment” (broadly construed) is so important in the first place. Indeed, Shah’s threat model (slides) is exactly this kind of “bad generalization.” This is a qualitatively different kind of robustness failure than the kind usually discussed within the machine learning community, even if relying on a purely behavioral notion of a model’s “objective” means that it cannot be cleanly separated from the usual failure mode in a technical fashion. For example, although our results show CoinRun models failed to learn the general capability of pursuing the coin, the more natural interpretation is that the model has learned a robust ability to avoid obstacles and navigate the levels,[7] but the objective it learned is something like “get to the end of the level,” instead of “go to the coin.”
Understanding a model’s objectives or goals, if it has them, is thus instrumentally useful for reasoning about out of distribution generalization, but the generalization under distributional shift is what is of primary importance. The robustness-focused framing includes as subcases instances where models misgeneralize out of distribution because they have misaligned goals that are only revealed under the distributional shift (e.g. deceptive models), but it also includes cases where, for example, models execute complicated behavioral heuristics that worked well on the training distribution and will generalize capably, just no longer in the way we want. In both examples, the problem is to prevent catastrophic behavior in the worst case.
Terminology in our work
In either of the above framings of the matter, our work empirically demonstrates “objective robustness” failures[8] in modern reinforcement learning agents, as clearly as they can be distinguished from “ordinary” robustness failures where a model’s capabilities fail to generalize. We will use “objective robustness” throughout this work to refer to the property of interest because we feel the most straightforward interpretation of our results is that our agents have learned general enough capabilities that they can use them coherently out-of-distribution in ways that are no longer aligned with the training objective. Additionally, we know that many people will strongly associate “inner (mis)alignment” with the special case of mesa-optimization, and we wish to avoid any confusion over whether this is the first empirical demonstration of mesa-optimization (it is not).
However, we do not feel completely satisfied with this terminology and want to use this as an opportunity to reopen discussion about the terms and definitions we want to settle on as a community when discussing these issues. Besides the obvious desirability of having standardized terms and concepts in order to facilitate further technical work and enhance communication among those in the alignment community, being as clear and coherent as possible about how we frame and discuss inner alignment will help to bridge gaps in understanding between this community and the broader machine learning community, which is obviously already well-aware of robustness failures of the usual kind. We hope that this work can serve as a jumping-off point for these renewed discussions.
On the other hand, if capability robustness were considered with respect to the base objective, then every failure of capability robustness would also be a failure of objective robustness: trivially, an incapable model will have a behavioral objective that is different from the base objective. In either case, with the behavioral objective so defined, the entire “robustness” problem seems to technically collapse to just the “objective robustness” problem. ↩︎
In the sense of mesa-optimization originally intended by RFLO: learned mechanistic search/optimization for an internally-but-explicitly-represented mesa-objective ↩︎
N.B. This diagram also comes from "Clarifying inner alignment terminology." The "intent alignment" factorization highlighted in the last section can be refactored into this robustness-centric version with the terms defined the same way, and where a model is “robust” “if it performs well on the base objective it was trained on even in deployment/out-of-distribution.” However, the factorization of the generalization-focused approach discussed here is not quite equivalent, despite sharing the same terms and overall structure. It has different notions of both “outer alignment” and “robustness”: a base objective is “outer aligned” if it incentivizes the behavior we want on the training distribution (not every possible situation), and a model is “robust” if it “generalizes acceptably” (not “performs well on the base objective out-of-distribution,” since we no longer expect an “outer aligned” base objective to capture what we want outside of the training distribution). The objective-focused approach still emphasizes solving robustness by trying to ensure that models robustly pursue aligned objectives with competence OOD. It is therefore arguably more naturally suited to the “intent alignment” factorization, since in either case a solution under its definitions requires a model to robustly “try” to do the right thing. ↩︎
This seems closely related to what Paul Christiano is aiming for with his low-stakes assumption. (See also his comment discussion with Rohin on the topic.) ↩︎
Rohin calls these "strong performance in normal situations" and "acceptable behavior in all situations." ↩︎
The arrows in this diagram can be interpreted in the same way as those in Evan’s original diagrams, where a problem is solved if its direct subproblems are solved. However, this is not intended to suggest that the two should be approached as entirely separate problems under the generalization-focused approach, as the boundary between them is fuzzy. ↩︎
After all, ProcGen was designed to test (capability) generalization in deep RL. ↩︎
Or, if we assume the generalization-focused approach, where “inner (mis)alignment” isn’t reserved for the specific case of mesa-optimization as defined in RFLO, we could use the term to refer to cases where the model's capabilities generalize but its "objective" does not, in which case this work empirically demonstrates inner misalignment. ↩︎