So, I certainly agree that pseudo-alignment is a type of robustness/distributional shift problem. In fact, I would describe “Risks from Learned Optimization” as a deep dive on a particular subset of robustness problems that might be particularly concerning from a safety standpoint. Thus, in that sense, whether it's really a “new” sort of robustness problem is less the point than the analysis that the paper presents of that robustness problem. That being said, I do think that at least the focus on mesa-optimization was fairly novel in terms of caching out the generalization failures we wanted to discuss in terms of the sorts of learned optimization processes that might exhibit them (as well as the discussion of deception, as you mention).
I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce.
I agree with that and I think that the sentence you're quoting there is meant for a different sort of reader that has less of a clear concept of ML. One way to interpret the passage you're quoting that might help you is that it's just saying that guarantees about global optima don't necessarily translate to local optima or to actual models you might find in practice.
But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former.
I also agree with this. I would describe my picture here as something like: Pseudo-aligned mesa-optimization Objective generalization without capability generalization Robustness problems. Given that picture, I would say that the pseudo-aligned mesa-optimizer case is the most concerning from a safety perspective, then generic objective generalization without capability generalization, then robustness problems in general. And I would argue that it makes sense to break it down in that way precisely because you get more concerning safety problems as you go narrower.
Also, more detail on the capability vs. objective robustness picture is also available here and here.
I disagree with the framing that: "pseudo-alignment is a type of robustness/distributional shift problem". This is literally true based on how it's defined in the paper. But I think in practice, we should expect approximately aligned mesa-optimizers that do very bad things on-distribution (without being detected).
According to me (and at least some if not all of the authors of that paper disagree with me), the main point is highlighting the possibility of capabilities generalizing while objectives do not. I agree that this is a failure mode that we knew about before, but it's not one that people were paying much attention to. At the very least, when people said they worked on "robustness", they weren't distinguishing between capability failure vs. objective failure (though of course the line between these is blurry).
Although on the other hand, decade+ old arguments about the instrumental utility of good behavior while dependent on humans have more or less the same format. Seeing good behavior is better evidence of intelligence (capabilities generalizing) than it is of benevolence (goals 'generalizing').
The big difference is that the olde-style argument would be about actual agents being evaluated by humans, while the mesa-optimizers argument is about potential configurations of a reinforcement learner being evaluated by a reward function.
So, I think the other answers here are adequate, but not super satisfying. Here is my attempt.
The frame of "generalization failures" naturally primes me (and perhaps others) to think of ML as hunting for useful patterns, but instead fitting to noise. While pseudo-alignment is certainly a type of generalization failure, it has different connotations: that of a system which has "correctly learned" (in the sense of internalizing knowledge for its own use), but still does not perform as intended.
The mesa-optimizers paper defines inner optimizers as performing "search". I think there are some options here and we can define things slightly differently.
In Selection vs Control, I split "optimization" up into two types: "selection" (which includes search, and also weaker forms of selection such as mere sampling bias), and "control" (which implies actively steering the world in a direction, but doesn't always imply search, EG in the case of a thermostat).
In mesa-search vs mesa-control, I applied this distinction to mesa-optimization, arguing that mesa-optimizers which do not use search could still present a danger.
Mesa-controllers are a form of inner optimizers which reliably steer the world in a particular direction. These are distinguished from 'mere' generalization failure because generalization failures do not usually have such an impact. If we define pseudo-alignment in this way, you could say we are defining it by its impact. Clearly, more impactful generalization failures are more concerning. However, you might think it's a little weird to invent entirely new terminology for this case, rather than referring to it as "impactful generalization failures".
Mesa-searchers are a form of inner optimizers characterized by performing internal search. You could say that they're clearly computing something coherent, just not what was desired (which may not be the case for 'mere' generalization failures). These are more clearly a distinct phenomenon, particularly if the intended behavior didn't involve search. (It would seem odd to call them "searching generalization failures" imho.) But the safety concerns are less directly obvious.
It's only when we put these two together that we have something both distinct and of safety concern. Looking for "impactful generalization failures" gives us relatively little to grab onto. But it's particularly plausible that mesa-searchers will also be mesa-controllers, because the machinery for complex planning is present. So, this combination might be particularly worth thinking about.
we'll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so -- somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)
I tend to agree with this line of thinking. IE, it seems intuitive to me that highly robust alignment technology would rely on arguments that don't explicitly mention inner optimization anywhere, because those failure modes are ruled out via the same general arguments which rule out other failure modes. However, it also seems plausible to me that it's useful to think about inner alignment along the way.
You wanted a convincing example. I think The Solomonoff Prior Is Malign could be such an example. Before becoming aware of this argument, it seemed pretty plausible to me that the Solomonoff prior described a kind of rational ideal for induction. This isn't a full "here's a safety argument that would go through if we assumed no-inner-optimizers", but in a parallel universe where we have access to infinite computation, it could be close to that. (EG, someone could argue that the chance of Solomonoff Induction resulting in generalization failures is very low, and then change their mind when they hear the Solomonoff-is-malign argument.)
Also, it seems highly plausible to me that inner alignment is a useful thing to have in mind for "less than highly robust" alignment approaches (approaches which seek to grab the lower-hanging fruit of alignment research, to create systems that are aligned in the worlds where achieving alignment isn't so hard after all). These approaches can, for example, employ heuristics which make it somewhat unlikely that inner optimizers will emerge. I'm not very interested in that type of alignment research, because it seems to me that alignment technology needs to be backed up with rather tight arguments in order to have any realistic chance of working; but it makes sense for some people to think about that sort of thing.
I have only just read the mesa optimizers paper, and I don't understand what it adds to the pre-existing picture that "ML can fail to generalize outside the train distribution and this is bad."
The discussion in the paper generally assumes a background distinction between "training" and "deployment" (or an equivalent distinction), and discuss models which succeed on the base objective during "training" but not during "deployment."
In the sections about "deception," this happens in a special way quite unlike the ordinary failures to generalize that we see in ML today (and arises under correspondingly exotic conditions). But, in cases other than "deception," the paper describes dynamics and outcomes that seem identical to the ordinary generalization problem in ML:
For example, the following is just a textbook generalization failure, which can happen with or without a mesa-optimizer:
Additionally, when the paper makes remarks that seem to be addressing my question, I find these remarks confused (or perhaps just confusing).
For instance, in this remark
I don't understand what "safety properties of the base optimizer" could be, apart from facts about the optima it tends to produce. That is, I can't think of a property that would appear to confer "safety" until we consider the possibility of producing mesa-optimizers, and then stop appearing thus.
A safety property of an optimizer is some kind of assurance about the properties of the optima; if such a property only holds for a subset of optima (the ones that are not mesa-optimizers), we'll see this appear mathematically in the definition of the property or in theorems about it, whether or not we have explicitly considered the possibility of mesa-optimizers. (I suppose the argument could be that some candidate safety properties implicitly assume no optimum is a mesa-optimizer, and thus appear to apply to all optima while not really doing so -- somewhat analogous to early notions of continuity which implicitly assumed away the Weierstrass function. But if so, I need a real example of such a case to convince me.)
The following seems to offer a different answer to my question:
This seems to contrast two ways of failing on a samples from non-train distribution. Supposing a model has learned to "understand" train samples and use that understanding to aim for a target, it can then
But even without mesa-optimizers, cases of ML generalization failure often involve the latter, not just the former. A dumb classifier with no internal search, trained on the red-door setting described above, would "understand" the blue-door test data well enough to apply its internalized objective perfectly; even in this textbook-like case, exhibited by arbitrarily simple classifiers, the "capabilities generalize." This kind of problem is definitely bad, I just don't see what's new about it.