The Cake
Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that’s fine. Baking that cake is my terminal goal.
In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no cocoa powder. Guess I’ll have to acquire some cocoa powder! Acquiring the cocoa powder is an instrumental goal: I care about it exactly insofar as it helps me bake the cake.
My cocoa acquisition subquest is a very different kind of goal than my cake baking quest. If the oven ends up a molten pile of metal shortly after the cocoa is acquired, if I shatter the eggs or spill the milk in my rush to the supermarket, then that’s a problem - a molten oven or shattered eggs or spilled milk would make it harder for me to bake the cake! More generally, in the process of acquiring cocoa powder, I want to not mess up other things which are helpful for making the cake. Unlike my terminal goal of baking a cake, my instrumental goal of acquiring cocoa powder comes with a bunch of implicit constraints about not making other instrumental subgoals much harder.
(If you’re already thinking “hmm, that sounds kinda like corrigibility”, then you have the right idea and that is indeed where we’re going with this.)
Generalizable takeaway: unlike terminal goals, instrumental goals come with a bunch of implicit constraints about not making other instrumental subgoals much harder.
The Restaurant
Now imagine that I’m working as a chef in a big restaurant. My terminal goal is the restaurant’s long-term success; I care about nothing else. If the bombs drop, so long as the restaurant is still doing good business afterwards, I’ll be happy.
One day, a customer orders a fresh chocolate cake, and it falls to me to bake it. Now baking the cake is an instrumental goal.
One key difference from the previous example: in the restaurant, I don’t know all the things which future customers will order. I don’t know exactly which ingredients or tools will be needed tomorrow. So, in the process of baking the cake, I want to avoid wasting ingredients or destroying tools which might be useful for any of the dishes which future customers might order. My instrumental goal of baking a cake comes with a bunch of implicit constraints about not-making-harder a whole distribution of potential future instrumental subgoals.
Another key difference from the previous example: now there are multiple chefs, multiple subagents working on different instrumental subgoals. As part of the implicit constraints on my cake-baking, I need to not make their instrumental subgoals more difficult. And that notably brings in lots of informational constraints. For instance, if I use some eggs, I need to either put the rest of the eggs back in a location predictable to the other chefs, or I need to communicate to the other chefs where I left the eggs, so that they don’t have to spend time searching for the eggs later. So my instrumental goal of baking a cake comes with a bunch of constraints about being predictable to others, and/or making information about what I’m doing visible to others.
Generalizable takeaway: unlike terminal goals, instrumental goals come with implicit constraints about being predictable, making information about what one is doing visible, and not-making-harder a whole broad distribution of other possible instrumental goals.
… and now this sounds a lot like corrigibility.
Happy Instrumental Convergence?
Still sticking to the restaurant example: presumably many different instrumental goals in the restaurant require clean plates, empty counter space, and money. Those are all convergently instrumentally-useful resources within the restaurant.
Now, the way you might be used to thinking about instrumental convergence is roughly: “For lots of different goals in the restaurant, I need clean plates, empty counter space, and money. So, I might as well seize a bunch of those things upfront. Sure that’ll screw over the other chefs, but I don’t care about that.”. And that is how the reasoning might go if baking this one cake were a terminal goal.
But instrumental goals are different. If I’m the chef baking the cake as an instrumental goal, I instead reason: “For lots of different goals in the restaurant, a chef needs clean plates, empty counter space, and money. So, I should generally make sure those things are readily available to my fellow chefs as much as possible, so that they'll be able to solve their problems for our shared terminal goal. I’ll avoid using the resources up, and even make more of them available (by e.g. cleaning a counter top) whenever I have a relative advantage in doing so.”.
I want to emphasize that this sort of reasoning should require no “special sauce”. It’s just a natural, implicit part of instrumental goals, as opposed to terminal goals.
One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.
Suppose, in the restaurant example, that clean plates, empty counter space, and money are the only significant convergently instrumental goals. Then, (in the restaurant environment,) we get a natural notion of general corrigibility: if I just “try not to step on the toes” of instrumentally-convergent subgoals, then that will mostly keep me from stepping on the toes of most subgoals pursued by other restaurant-denizens, regardless of what our top-level goals are. The same strategy works for many different top-level goals in this restaurant, so it’s a generally corrigible strategy.
More generally, if I track instrumentally-convergent subgoals throughout the whole world, and generally "avoid stepping on the toes" of any of them... that would be a generally corrigible strategy.
And that unlocks the natural next jump.
All The Way Up
The natural next jump: do we even need the terminal goal at all? What if a mind’s top-level goals were the same “kind of thing” as instrumental goals more generally? Indeed, in some ways that would be a very natural structure for a general-purpose mind; it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?
So long as instrumental convergence kicks in hard enough in the global environment, the mind can “try not to step on the toes” of instrumentally-convergent subgoals, and then that will mostly keep it from stepping on the toes of most other people's subgoals, regardless of the original terminal goal. So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.
For AI purposes, this would be a much safer kind of agent. It would be an AI which naturally tries not to “step on other agents’ toes”, naturally behaves such that it doesn’t get in the way of other agents’ goals (and in particular humans’ goals). But unlike e.g. naive formulations of “low-impact” agents, such an AI would also actively try to behave in ways predictable and legible to other agents, and make sure that other agents can easily query information about its own behavior.
In short, it sounds like all the properties of corrigibility we always hoped for, all coming from a single coherent underlying concept (i.e. not thrown together ad-hoc), and therefore likely to also generalize in ways we like to properties we haven’t yet thought to ask for.
Research Threads
This concept of corrigibility immediately suggests lots of research approaches.
First, on the theory side, there’s the problem of fleshing out exactly what the “type signature” of an instrumental goal is, with all those implicit constraints. The main way one would tackle this problem would be:
- Pick some class of optimization problems, and a way to break it into apparent “subproblems”.
- Work through some examples to check that the sort of phenomena we’re interested in actually do show up for that class of optimization problems and notion of “subproblems”.
- Explicitly spell out the “implicit constraints” of the subproblems in this formulation.
- Repeat for other formulations, and look for the generalizable patterns in how the implicit constraints of subproblems are naturally represented. Operationalize those patterns.
- Look for positive arguments that this operationalization of the relevant patterns is “the unique right way” to formulate things - like e.g. derivations from some simple desiderata, mediation in some class of algorithms, etc.
On the empirical side, one could try clever ways of training instrumental rather than terminal goals into a system. For instance, the restaurant example suggests training a system to work with many instances of itself or other systems in order to solve top-level goals in a reasonably general environment. Then, y’know… see what happens.
This is great.
Now, given that you're already talking about instrumental goals "trying not to step on each other's toes", what else would they need to deserve the name of "subagents"?
Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say "corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals"; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.