If I've understood it correctly, I think this is a really important point, so thanks for writing a post about it. This post highlights that mesa objectives and base objectives are typically going to be of different "types", because the base objective will typically be designed to evaluate things in the world as humans understand it (or as modelled by the formal training setup) whereas the mesa objective will be evaluating things in the AI's world model (or if it doesn't really have a world model, then more local things like actions themselves as opposed to their distant consequences).
Am I acting in bad faith?... Surely I "get what they mean"?
I'm certainly glad to see people suspending their sense of "getting it" when it comes to reference (aka pointers, aka representation) since I don't think we have solid foundations for these topics and I think they are core issues in AI alignment.
Thank you for writing this post. I think the issue goes even deeper. The reward function doesn't even have the type signature of "objective."
Nice post.
Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment.
Have you seen Hubinger's more recent post, More variations on pseudo-alignment ? It amends the list of pseudo-alignment types originally listed in "Risks of Learned Optimization" to include a couple more.
Your claim above that the best we could hope for may be a form of proxy alignment or approximate alignment reminds me of following pseudo-alignment type he introduced in that more recent post. In the description of this type, he also seems to agree with you that robust alignment is very difficult or "unstable" (though perhaps you go further in saying its impossible):
Corrigible pseudo-alignment. In the paper, we defined corrigible alignment as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." We mostly just talked about this as a form of robust alignment—however, as I note in "Towards a mechanistic understanding of corrigibility," this is a very unstable operation, requiring you to get your pointer just right. Thus, I think it's better to talk about corrigible alignment as the class of possible relationships between the base and mesa-objectives defined by the model having some sort of pointer to the base objective, including both corrigible robust alignment (if the pointer is robust) and corrigible pseudo-alignment (if the pointer is to some sort of non-robust proxy). In particular, I think this distinction is fairly important to why deceptive alignment might be more likely than robust alignment, as it points at why robust alignment via corrigibility might be quite difficult (which is a point we made in the paper, but one which I think is made much clearer with this distinction).
Suppose I have two robots, one of which wants to turn the world into paperclips and the other of which wants to turn the world into staples.
Your argument here seems to extend to saying that we can't call these robots or their preferences "misaligned with each other," because robot A's preferences are used to search over actions of robot A, and vice versa for robot B.
I don't think that argument makes sense. The action spaces are different, but both robots are still trying to affect the same world and steer it in different directions. We could formalize this by defining for each robot a utility function over states of the world.
There is one important type signature point here, which is that robots are made of atoms and utility functions are not. The robots' utility functions don't live in the physical robots (they're not the right type of stuff), they actually live in our abstract model of the robots. This doesn't mean it's futile to compare things, though - it's fine to use abstractions within their domains of validity.
Note: I skimmed this post because I didn't want to think about maths right now.
Take a Q-value table for, I don't know, a 3 state MDP. Apply TD-0 a ton until you get convergence. How has this system not learnt its base objective? Similairly, train any system to completion. How has it not learnt its base objective? If learning the base objective is impossible, then I think I've got a contradiction on my hands.
Your post seems to be focused more on pointing out a missing piece in the literature rather than asking for a solution to the specific problem (which I believe is a valuable contribution). Regardless, here is roughly how I would understand “what they mean”:
Let be the task space, the output space, the model space, our base objective, and the mesa objective of the model for input . Assume that there exists some map mapping internal objects to outputs by the model, such that .
Given this setup, how can we reconcile and ? Assume some distribution over the task space is given. Moreover, assume there exists a function mapping tasks to utility functions over outputs, such that . Then we could define a mesa objective as where if and otherwise we define as some very small number or (and replace by above). We can then compare and directly via some distance on the spaces and .
Why would such a function exist? In stochastic gradient descent, for instance, we are in fact evaluating models based on the outputs they produce on tasks distributed according to some distribution . Moreover, such a function should probably exist given some regularity conditions imposed on an arbitrary objective (inspired by the axioms of expected utility theory).
Why would a function exist? Some function connecting outputs to the internal search space has to exist because the model is producing outputs. In practice, the model might not optimize perfectly and thus might not always choose the argmax (potentially leading to suboptimality alignment), but this could probably still be accounted for somehow in this model. Moreover, could theoretically differ between different inputs, but again one could probably change definitions in some way to make things work.
If is a mesa-optimizer, then there should probably be some way to make sense of the mathematical objects describing mesa objective, search space, and model outputs as described above. Of course, how to do this exactly, especially for more general mesa-optimizers that only optimize objectives approximately, etc., still needs to be worked out more.
These issues of preferences over objects of different types (internal states, policies, actions, etc.) and how to translate between them are also discussed in the post Agents Over Cartesian World Models.
For such that is a mesa=optimizer let be the space it optimizes over, and be its utility function .
I know you said "which we need not notate", but I am going to say that for and , that , and is the space of actions (or possibly, and is the space of actions available in the situation )
(Though maybe you just meant that we need note notate separately from s, the map from X to A which s defines. In which case, I agree, and as such I'm writing instead of saying that something belongs to the function space . )
For to have its optimization over have any relevance, there has to be some connection between the chosen (chosen by m) , and .
So, the process by which m produces m(x) when given x, should involve the selected .
Moreover, the selection of the ought to depend on x in some way, as otherwise the choice of is constant each time, and can be regarded as just a constant value in how m functions.
So, it seems that what I said was should instead be either , or (in the latter case I suppose one might say )
Call the process that produces the action using the choice of by the name
(or more generally, ) .
is allowed to also use randomness in addition to and . I'm not assuming that it is a deterministic function. Though come to think of it, I'm not sure why it would need to be non-deterministic? Oh well, regardless.
Presumably whatever is being used to select , depends primarily (though not necessarily exclusively) on what s(x) is for various values of x, or at least on something which indicates things about that, as f is supposed to be for selecting systems which take good actions?
Supposing that for the mesa-optimizer that the inner optimization procedure (which I don't have a symbol for) and the inner optimization goal (i.e. ) are separate enough, one could ask "what if we had m, except with replaced with , and looked at how the outputs of and differ, where and are respectively are selected (by m's optimizer) by optimizing for the goals , and respectively?".
Supposing that we can isolate the part of how f(s) depends on s which is based on what is or tends to be for different values of , then there would be a "how would differ if m used instead of ?".
If in place of would result in things which, according to how works, would be better, then it seems like it would make sense to say that isn't fully aligned with ?
Of course, what I just described makes a number of assumptions which are questionable:
The first of these is also connected to another potential flaw with what I said, which is, it seems to describe the alignment of the combination of (the optimizer m uses) along with , with , rather than just the alignment of with .
So, alternatively, one might say something about like, disregarding how the searching behaves and how it selects things that score well at the goal , and just compare how and tend to compare when and are generic things which score well under and respectively, rather than using the specific procedure that uses to find something which scores well under , and this should also, I think, address the issue of possibly not having a cleanly separable "how it optimizes for it" method that works for generic "what it optimizes for".
The second issue, I suspect to not really be a big problem? If we are designing the outer-optimizer, then presumably we understand how it is evaluating things, and understand how that uses the choices of for different .
I may have substantially misunderstood your point?
Or, was your point that the original thing didn't lay these things out plainly, and that it should have?
Ok, reading more carefully, I see you wrote
I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them.
and the other things right before and after that part, and so I guess something like "it wasn't stated precisely enough for the cases it is meant to apply to / was presented as applying as a concept more generally than made sense as it was defined" was the point and which I had sorta missed it initially.
(I have no expertise in these matters; unless shown otherwise, assume that in this comment I don't know what I'm talking about.)
The observations I make here have little consequence from the point of view of solving the alignment problem. If anything, they merely highlight the essential nature of the inner alignment problem. I will reject the idea that robust alignment, in the sense described in Risks From Learned Optimization, is possible at all. And I therefore also reject the related idea of 'internalization of the base objective', i.e. I do not think it is possible for a mesa-objective to "agree" with a base-objective or for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned.” I claim that whenever a learned algorithm is performing optimization, one needs to accept that an objective which one did not explicitly design is being pursued. At present, I refrain from attempting to propose my own adjustments to the framework, or to build on the existing literature or to develop my own theory. I am certainly not against doing any of those things, but they are things to possibly be pursued later; none of them is the purpose of this post.
To make my main point, I will introduce only a bare minimum of mathematical notation. We will show that a mesa-objective always has a different type signature to a base objective and that the default assumption ought to be that there is no way to compare them in general and certainly no general way to interpret what it means for them to ‘agree’. Suppose that an optimizer is searching through a space S of systems. At this time, I do not want to attempt to unpack what it means to 'search', but, naively, we can imagine that there is an objective function f :S→R, which determines something that we might call the 'search criterion'. The idea of course is that the optimizer is a system that is 'searching' through the set S and judging different points according to the criterion that higher values of f are better.
In the background, there is some 'task' and naively we can think of this as being represented by a 'task space' X which consists of all of the different possible 'presentations' or 'instances' of the task. For example, perhaps the task is choosing the next move in a game of Go or the next action in a real-time strategy video game. In these examples, a given x∈X would represent a board position in Go, say, or a single snapshot of the game-state in the video game. Then, in general, given x∈X and s∈S, we can think that s(x) is the output of s on the task instance x or the action taken by s when presented with x (i.e. s(x) denotes the next board move in Go or the next action to be taken in the video game). So each element of S defines a map from the task space X to some kind of output space or space of possible actions, which we need not notate.
Now, it is possible that there exists m∈S which works in the following way: Whenever the output of m on an instance x of the task needs to be evaluated, i.e. whenever m(x) is computed, what happens is that m searches over another search space Σ and looks for elements that score highly according to some other objective function g:Σ→R. Whenever this is the case, we say that such an m∈S is a mesa-optimizer and that the original optimizer - the one that searches over S - is the base optimizer. Notice that in some way, elements of Σ must in turn correspond to outputs/actions, because given some x, the mesa-optimizer m conducts a search over Σ to determine what output m(x) is, but that is all just part of the internal workings of m and we need not 'know' or notate how this correspondence works.
In Risks From Learned Optimization, Hubinger et al. write:
So: The mesa-objective is the criterion that m is using in its search: It expresses the idea that higher values of g are better. And the base objective refers to the criterion that higher values of f are better.
Inner Alignment, Robust Alignment, and Pseudo Alignment
The domain of f is the space S - the space of systems that the base optimizer is searching over (and which can be represented mathematically as a space of functions, each of which is from X to the output or 'action' space). On the other hand, the domain of g is Σ . As mentioned above, we might want to think of Σ as corresponding to (a subset of) the output space, but either way, a priori, there is nothing to suggest that S and Σ are not different spaces. The two objective functions used as criteria in these searches have different domains and it is not clear how to compare them.
In Risks From Learned Optimization, it is written that "The problem posed by misaligned mesa-optimizers is... the gap between the base objective and the mesa-objective... We will call the problem of eliminating the base-mesa objective gap the inner alignment problem...". I think that they are absolutely right to point to the difference between the base objective and a mesa-objective as being the source of an important issue, but I find referring to it as a "gap", at least at the level of generality posited, to be somewhat misleading. We are not dealing with two objects that are in principle comparable but just so happen to be separated by a gap (a gap waiting to be narrowed by the correct clever idea, say). Instead, the difference, which is due to the different type signatures of the objective functions, is essential in character and rather means that they are, in general, incomparable.
Consider the definitions of robust alignment and pseudo alignment:
What might it possibly mean to have mesa-optimizers with mesa-objectives that "agree" with the base objective on past training data or "across distributions"? Again, the base objective refers to a criterion used to select between different systems. How can a mesa-objective, a criterion that a particular one of these systems uses to select between different actions, 'agree' or 'disagree' with it on any particular set of data or "across distributions"? Without further development of the framework, or further explanation, it's impossible to know precisely what this could mean. Robust alignment seems at best to be a very odd, extreme case (where somehow we have ended up with something like f=g and/or S=Σ ?) and at worst simply impossible.
Later, Hubinger attempts to clarify the terminology in a separate post: Clarifying Inner Alignment Terminology. This attempt at clarification and increased rigour should obviously be encouraged, but it is immediately clear that some of the main definitions are still unsatisfactory: The last of seven definitions is the definition is that of Inner Alignment:
This version of the definition seems to turn crucially on the notion that a policy could be "impact aligned" with the base objective. Let us turn to Hubinger's own definition of "Impact Alignment", from the same post, to find out what this means precisely:
It seems that we are only told what impact alignment means in the context of an agent and humanity. So we are still missing what seems to be the very core of this edifice: What does it really mean for a mesa-optimizer to be - in whatever is the appropriate sense of the word - 'aligned'? What could it mean for a mesa-objective to 'agree with' the base objective?
Internalization of the base objective
In the Deceptive Alignment post, the idea of “Internalization of the base objective” is introduced. Arguably this is the point at which one might expect the issues I have raised to be most fleshed out, because in highlighting the possibility of “internalization” of the base objective, i.e. that it is possible for a mesa-objective function to be “adjusted towards the base objective function to the point where it is robustly aligned,” there is an implicit claim that robust alignment really can occur. So to understand this phenomenon, we might look for an explanation as to how this occurs. But the ensuing analysis is somewhat weak and vague, to the point that it is almost just a restatement of the claim that it purports to explain:
I could try to give my own interpretation of what happens when information about the base objective “flows into” the learned algorithm “via” the optimization process, but I would be making something up that does not appear in the text. And what follows is really just a discussion of some possible ways by which the mesa-optimizer comes to be able to use information about the base objective (it could get information about the base objective directly via the base optimizer or it could get it from the task inputs). None of it goes towards alleviating the specific concerns laid out above and none of it really explains with any conviction how true “internalization” happens. Moreover, a footnote admits that in fact the two routes by which the mesa-optimizer may come to be able to use information about the base objective do not even neatly correspond to the dichotomy given by ‘internalization of the base objective’ vs. ‘modelling of the base objective’.
Conclusions
My observations here run counter to any argument which suggests it is possible to 'close the gap' between the base and mesa objectives. As stated above, this suggests that the inner alignment problem has an essential nature: I claim that whenever mesa-optimization occurs, one needs to accept that internally, there is pressure towards a goal which one did not explicitly design.
Of course a close reading of what has been said here really only shows that we cannot rely on the specific formalization I have used (though it may be no more than a few mathematical functions) while still maintaining the exact theoretical framework described in Risks From Learned Optimization. Therefore, either we can try to revise the framework slightly, essentially omitting the notions of robust alignment and 'internalization of the base objective' and focussing more on revised versions of 'proxy alignment' and 'approximate alignment' as descriptors of what is essentially the best possible situation in terms of alignment. Or, it may be the case that the fault is with my formalization and that what I claim are conceptual issues are little more than notational or mathematical curiosities. If the latter is indeed the case, then at the very least, we need to be explicit about whatever tacit assumptions have been made that imply that formalization along the lines I have outlined cannot provide a permissible analysis. For example, I can certainly imagine that it may be possible to add in details on a case-by-case basis or at least to restrict to a specific explicit class of base objectives and then explicitly define how to compare mesa-objectives to them. Perhaps those who object to my view will claim that this is what is really going on in people's minds and it's just that it has not been spelled out. However, at present, I believe that at the level of generality that Risks From Learned Optimization strives for, we simply cannot speak of mesa-objectives ‘agreeing with’ or even really of being ‘adjusted towards’ base objectives.
Remarks
As a final set of remarks, I wanted to briefly discuss the general attitude I have taken here. One might read this and think either that yes, this all seems reasonable, but since it is not about addressing the alignment problem at all, what was the point? Or perhaps one might think that it could all be avoided if only I were to make a more charitable reading of the Risks From Learned Optimization posts in the first place. Am I acting in bad faith?... Surely I "get what they mean"? Indeed, often I do feel like I can see or could guess what the authors are getting at. Why then, have I gone out of my way to take them at their word to such a great extent, just so I can point out inconsistencies?
I want to end by describing some general, if somewhat vague and half-baked, thoughts about this kind of theoretical/conceptual AI Alignment work and hopefully this will help to answer the above questions. In my humble opinion, one of the things that this type of work ought to be 'worried about' is that it exists in a kind of no-man's land between on the one hand more traditional academic work in fields like computer science and philosophy and on the other hand more 'mainstream' ML Safety, shall we say. For a while I have been wondering whether or not this kind of theoretical alignment work is doomed to remain in this no man's land, propped up by a few success stories but mostly fed by a steady stream of informal arguments, futurological speculation, and 'hand-waving' on blogs and comment sections. I of course do not fully know, but here are a couple of things that have come to mind when trying to think about this: Firstly, when we do not have the luxury of mathematical proof nor the crutch of being backed up by working code and empirical results, it is even more important to subject arguments to a high level of scrutiny. There should be (and hopefully can be) a high bar of intellectual and academic rigour for theoretical/conceptual work in this area. It needs to strive to be as clean and clear as possible. And it's worth saying that one reason for this is so that it stands on its own two feet, so to speak, when interrogated outside of communities like this one. Secondly, I feel it is important that the best arguments and ideas we have - and good critiques of them - appear 'in the literature'. I certainly don't advocate for a completely traditional model of dissemination and publication (there are many advantages to the Alignment Forum and the prevailing rationalist/EA/longtermist ecosystem and their ways of doing things) and of course many great ideas start out as hand-waving and speculation, but it will ultimately not be enough that some idea is 'generally known' in the online/EA alignment communities or can be put together by combing through comment sections and the minds of the relevant people if said idea is never really or cannot be 'written up' in a truly convincing way. As I've said, these remarks are not fully fleshed out and further discussion here doesn't really seem appropriate. For now, the idea was to explain some of my motivation for taking time to post something like this. All discussion and comments are welcome.