Discussion: Objective Robustness and Inner Alignment Terminology
In the alignment community, there seem to be two main ways to frame and define objective robustness and inner alignment. They are quite similar, mainly differing in the manner in which they focus on the same basic underlying problem. We’ll call these the objective-focused approach and the generalization-focused approach. We don’t delve into these issues of framing the problem in Empirical Observations of Objective Robustness Failures, where we present empirical observations of objective robustness failures. Instead, we think it is worth having a separate discussion of the matter. These issues have been mentioned only infrequently in a few comments on the Alignment Forum, so it seemed worthwhile to write a post describing the framings and their differences in an effort to promote further discussion in the community. TL;DR This post compares two different paradigmatic approaches to objective robustness/inner alignment: Objective-focused approach * Emphasis: “How do we ensure our models/agents have the right (mesa-)objectives?” * Outer alignment: “an objective function r is outer aligned if all models that perform optimally on r in the limit of perfect training and infinite data are intent aligned.” * Outer alignment is a property of the training objective. Generalization-focused approach * Emphasis: “How will this model/agent generalize out-of-distribution?” * Considering a model’s “objectives” or “goals,” whether behavioral or internal, is instrumentally useful for predicting OOD behavior, but what you ultimately care about is whether it generalizes “acceptably.” * Outer alignment: a model is outer aligned if it performs desirably on the training distribution. * Outer alignment is a property of the tuple (training objective, training data, training setup, model). Special thanks to Rohin Shah, Evan Hubinger, Edouard Harris, Adam Shimi, and Adam Gleave for their helpful feedback on drafts of this post. Objective-focused approach This is the approa