The Cake

Imagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and extended mathematical universe is to bake that cake. I care about nothing else. If the oven ends up a molten pile of metal ten minutes after the cake is done, if the leftover eggs are shattered and the leftover milk spilled, that’s fine. Baking that cake is my terminal goal.

In the process of baking the cake, I check my fridge and cupboard for ingredients. I have milk and eggs and flour, but no cocoa powder. Guess I’ll have to acquire some cocoa powder! Acquiring the cocoa powder is an instrumental goal: I care about it exactly insofar as it helps me bake the cake.

My cocoa acquisition subquest is a very different kind of goal than my cake baking quest. If the oven ends up a molten pile of metal shortly after the cocoa is acquired, if I shatter the eggs or spill the milk in my rush to the supermarket, then that’s a problem - a molten oven or shattered eggs or spilled milk would make it harder for me to bake the cake! More generally, in the process of acquiring cocoa powder, I want to not mess up other things which are helpful for making the cake. Unlike my terminal goal of baking a cake, my instrumental goal of acquiring cocoa powder comes with a bunch of implicit constraints about not making other instrumental subgoals much harder.

(If you’re already thinking “hmm, that sounds kinda like corrigibility”, then you have the right idea and that is indeed where we’re going with this.)

Generalizable takeaway: unlike terminal goals, instrumental goals come with a bunch of implicit constraints about not making other instrumental subgoals much harder.

The Restaurant

Now imagine that I’m working as a chef in a big restaurant. My terminal goal is the restaurant’s long-term success; I care about nothing else. If the bombs drop, so long as the restaurant is still doing good business afterwards, I’ll be happy.

One day, a customer orders a fresh chocolate cake, and it falls to me to bake it. Now baking the cake is an instrumental goal.

One key difference from the previous example: in the restaurant, I don’t know all the things which future customers will order. I don’t know exactly which ingredients or tools will be needed tomorrow. So, in the process of baking the cake, I want to avoid wasting ingredients or destroying tools which might be useful for any of the dishes which future customers might order. My instrumental goal of baking a cake comes with a bunch of implicit constraints about not-making-harder a whole distribution of potential future instrumental subgoals.

Another key difference from the previous example: now there are multiple chefs, multiple subagents working on different instrumental subgoals. As part of the implicit constraints on my cake-baking, I need to not make their instrumental subgoals more difficult. And that notably brings in lots of informational constraints. For instance, if I use some eggs, I need to either put the rest of the eggs back in a location predictable to the other chefs, or I need to communicate to the other chefs where I left the eggs, so that they don’t have to spend time searching for the eggs later. So my instrumental goal of baking a cake comes with a bunch of constraints about being predictable to others, and/or making information about what I’m doing visible to others.

Generalizable takeaway: unlike terminal goals, instrumental goals come with implicit constraints about being predictable, making information about what one is doing visible, and not-making-harder a whole broad distribution of other possible instrumental goals.

… and now this sounds a lot like corrigibility.

Happy Instrumental Convergence?

Still sticking to the restaurant example: presumably many different instrumental goals in the restaurant require clean plates, empty counter space, and money. Those are all convergently instrumentally-useful resources within the restaurant.

Now, the way you might be used to thinking about instrumental convergence is roughly: “For lots of different goals in the restaurant, I need clean plates, empty counter space, and money. So, I might as well seize a bunch of those things upfront. Sure that’ll screw over the other chefs, but I don’t care about that.”. And that is how the reasoning might go if baking this one cake were a terminal goal.

But instrumental goals are different. If I’m the chef baking the cake as an instrumental goal, I instead reason: “For lots of different goals in the restaurant, a chef needs clean plates, empty counter space, and money. So, I should generally make sure those things are readily available to my fellow chefs as much as possible, so that they'll be able to solve their problems for our shared terminal goal. I’ll avoid using the resources up, and even make more of them available (by e.g. cleaning a counter top) whenever I have a relative advantage in doing so.”.

I want to emphasize that this sort of reasoning should require no “special sauce”. It’s just a natural, implicit part of instrumental goals, as opposed to terminal goals.

One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.

Suppose, in the restaurant example, that clean plates, empty counter space, and money are the only significant convergently instrumental goals. Then, (in the restaurant environment,) we get a natural notion of general corrigibility: if I just “try not to step on the toes” of instrumentally-convergent subgoals, then that will mostly keep me from stepping on the toes of most subgoals pursued by other restaurant-denizens, regardless of what our top-level goals are. The same strategy works for many different top-level goals in this restaurant, so it’s a generally corrigible strategy.

More generally, if I track instrumentally-convergent subgoals throughout the whole world, and generally "avoid stepping on the toes" of any of them... that would be a generally corrigible strategy.

And that unlocks the natural next jump.

All The Way Up

The natural next jump: do we even need the terminal goal at all? What if a mind’s top-level goals were the same “kind of thing” as instrumental goals more generally? Indeed, in some ways that would be a very natural structure for a general-purpose mind; it needs the ability to recursively handle instrumental subgoals anyway, so why have a whole extra different kind of goal at the top?

So long as instrumental convergence kicks in hard enough in the global environment, the mind can “try not to step on the toes” of instrumentally-convergent subgoals, and then that will mostly keep it from stepping on the toes of most other people's subgoals, regardless of the original terminal goal. So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.

For AI purposes, this would be a much safer kind of agent. It would be an AI which naturally tries not to “step on other agents’ toes”, naturally behaves such that it doesn’t get in the way of other agents’ goals (and in particular humans’ goals). But unlike e.g. naive formulations of “low-impact” agents, such an AI would also actively try to behave in ways predictable and legible to other agents, and make sure that other agents can easily query information about its own behavior.

In short, it sounds like all the properties of corrigibility we always hoped for, all coming from a single coherent underlying concept (i.e. not thrown together ad-hoc), and therefore likely to also generalize in ways we like to properties we haven’t yet thought to ask for.

Research Threads

This concept of corrigibility immediately suggests lots of research approaches.

First, on the theory side, there’s the problem of fleshing out exactly what the “type signature” of an instrumental goal is, with all those implicit constraints. The main way one would tackle this problem would be:

  • Pick some class of optimization problems, and a way to break it into apparent “subproblems”.
  • Work through some examples to check that the sort of phenomena we’re interested in actually do show up for that class of optimization problems and notion of “subproblems”.
  • Explicitly spell out the “implicit constraints” of the subproblems in this formulation.
  • Repeat for other formulations, and look for the generalizable patterns in how the implicit constraints of subproblems are naturally represented. Operationalize those patterns.
  • Look for positive arguments that this operationalization of the relevant patterns is “the unique right way” to formulate things - like e.g. derivations from some simple desiderata, mediation in some class of algorithms, etc.

On the empirical side, one could try clever ways of training instrumental rather than terminal goals into a system. For instance, the restaurant example suggests training a system to work with many instances of itself or other systems in order to solve top-level goals in a reasonably general environment. Then, y’know… see what happens.

New to LessWrong?

1.
^

Agents weakly prefer more of a good to less of it, and agents' demand for a good doesn't depend on other people having it.

New Comment


56 comments, sorted by Click to highlight new comments since:

On first read the very rough idea of it sounds ... maybe right? It seems to perhaps actually centrally engage with the source of my mind's intuition that something like corrigibility ought to exist?

Wow. 

I'd love to get a spot check for flaws from a veteran of the MIRI corrigibility trenches.

I will note this sounds a lot like Turntrout's old Attainable Utility Preservation scheme. Not exactly, but enough that I wouldn't be surprised if a bunch of the math here has already been worked out by him (and possibly, in the comments, a bunch of the failure-modes identified).

Some of this seems right to me, but the general points seem wrong. I agree that insofar as a subprocess resembles an agent, there will be a natural pressure for it to resemble a corrigible agent. Pursuit of e.g. money is all well and good until it stomps the original ends it was supposed to serve -- this is akin to a corrigibility failure. The terminal-goal seeking cognition needs to be able to abort, modify, and avoid babysitting its subcognition.

One immediate thing to flag is that when you start talking about chefs in the restaurant, those other chefs are working towards the same overall ends. And the point about predictability and visibility only applies to them. Indeed, we don't really need the notion of instrumentality here -- I expect that two agents that know the other to be working towards the same ends to naturally want to coordinate, including by making their actions legible to the other.

One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.

This is, I think, the cruxy part of this essay. Knowing that an agent won't want to build incorrigible limbs, so we should expect corrigibility as a natural property of (agentic) limbs isn't very important. What's important is whether we can build an AI that's more like a limb, or that we expect to gravitate in that direction, even as it becomes vastly more powerful than the supervising process.

(Side note: I do wish you'd talked a bit about a restaurant owner, in your metaphor; having an overall cognition that's steering the chefs towards the terminal ends is a natural part of the story, and if you deny the restaurant has to have an owner, I think that's a big enough move that I want you to spell it out more.)

So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.

I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.

But perhaps you mean you want to set up an agent which is serving the terminal goals of others? (The nearest person? The aggregate will of the collective? The collective will of the non-anthropomorphic universe?) If it has money in its pocket, do I get to spend that money? Why? Why not expect that in the process of this agent getting good at doing things, it learns to guard its resources from pesky monkeys in the environment? In general I feel like you've just gestured at the problem in a vague way without proposing anything that looks to me like a solution. :\

I think this misunderstands the idea, mainly because it's framing things in terms of subagents rather than subgoals. Let me try to illustrate the picture in my head. (Of course at this stage it's just a hand-wavy mental picture, I don't expect to have the right formal operationalization yet.)

Imagine that the terminal goal is some optimization problem. Each instrumental goal is also an optimization problem, with a bunch of constraints operationalizing the things which must be done to avoid interfering with other subgoals. The instrumental convergence we're looking for here is mainly in those constraints; we hope to see that roughly the same constraints show up in many instrumental goals for many terminal goals. Insofar as we see convergence in the constraints, we can forget about the top-level goal, and expect that a (sub)agent which respects those constraints will "play well" in an environment with other (sub)agents trying to achieve other instrumental and/or terminal goals.

Then, addressing this part specifically:

I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.

... that would only happen insofar as converting the universe into batteries, computers and robots can typically be done without interfering with other subgoals, for a wide variety of terminal objectives. If it does interfere with other subgoals (for a wide variety of terminal objectives), then the constraints would say "don't do that".

And to be clear, maybe there would be some large-scale battery/computer/robot building! But it would be done in a way which doesn't step on the toes of other subplans, and makes the batteries/computers/robots readily available and easy to use for those other subplans.

This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don't really know what "corrigible goal" even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you're trying to say.

Suppose we're optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be careful when building paperclip factories not to inadvertently ruin our ability to make paperclips. One way of protecting the terminal goal even when focusing on subgoals is to forward-check actions to see if they conflict with the destination. (This is similar to how a corrigible agent might check for confirmation from its principal before doing something with heavy, irreversible consequences.) Forward-checking, for obvious reasons, requires there to actually be a terminal goal to check, and we should not expect this to work in an agent "without a terminal goal." But there's another way to prevent optimizing a subgoal to inadvertently hurt global success: constrain the optimization. If we can limit the kinds of changes that we make when pursuing the subgoal to nice, local, reversible ones, then we can pursue building paperclip factories myopically, expecting that we won't inadvertently produce side-effects that ruin the overall ability to make paperclips. This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly--better to have the agent's parallel actions constrained to nice parts of the space.

If it turns out there's a natural kind of constraint that shows up when making plans in a complex world, such that optimizing under that set of constraints is naturally unlikely to harm ability to accomplish goals in general, then perhaps we have some hope in naming that natural kind, and building agents which are always subject to these constraints, regardless of what they're working on.

Is that right?

(This is indeed a very different understanding of what you were saying than I originally had. Apologies for the misunderstanding.)

This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly--better to have the agent's parallel actions constrained to nice parts of the space.

If I were a singleton AGI, but not such a Jupiter brain that I could deal with the combinatorial explosion of directly jointly-optimizing every motion of every robot, I would presumably set up an internal “free market” with spot-prices for iron ore and robot-hours and everything else. Then I would iteratively cycle through all my decision-points and see if there are ways to “make money” locally, and then update virtual “prices” accordingly.

In fact, I think there’s probably a theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices. (Something to do with Lagrange multipliers? Shrug.)

(Fun fact: In the human world, propagating prices within firms—e.g. if the couch is taking up 4m² of the 4000m² floor space at the warehouse, then that couch is “charged” 0.1% of the warehouse upkeep costs, etc.—is very rarely done but leads directly to much better decisions and massive overall profit increases! See here.)  

Externalities are not an issue in this virtual “economy” because I can “privatize” everything—e.g. I can invent fungible allowances to pollute the atmosphere in thus-and-such way etc. This is all just a calculation trick happening in my own head, so there aren’t coordination problems or redistribution concerns or information asymmetries or anything like that. Since I understand everything (even if I can’t juggle it all in my head simultaneously), I’ll notice if there’s some relevant new unpriced externality and promptly give it a price.

So then (this conception of) corrigibility would correspond to something like “abiding by this particular system of (virtual) property rights”. (Including all the weird “property rights” like purchasing allowances to emit noise or heat or run conscious minds or whatever, and including participating in the enterprise of discovering new unpriced externalities.) Do you agree?

...theorem that says that the optimal solution of a complex resource allocation problem is isomorphic to a system where things have prices.

 

This is true, if we interpret "optimal solution" as the set of Pareto efficient allocations and make some very general assumptions about monotonic utility and demand independence![1] The idea of the proof is to take the dual of the resource allocation problem to turn it into a resource valuation problem, and by strong duality theorem these have the same optima (the solution to the dual of a resource allocation problem is a vector of prices). This is an A-D model, so it is known that markets will clear at these prices. By Welfare Thm I, we know that markets that clear result in a Pareto efficient outcome, and by Welfare Thm II, we know that any Pareto efficient outcome can be supported by some set of prices produced by this process. Any process optimizing an allocation of scarce resources is isomorphic to a set of utility-maximizing agents exchanging those resources in a free market. Pretty cool!

  1. ^

    Agents weakly prefer more of a good to less of it, and agents' demand for a good doesn't depend on other people having it.

Can we remove one of the assumptions, or are both assumptions necessary to get the result stated?

I'm pretty sure they need to be at least locally true to guarantee what I said. Are there specific scenarios you're imagining?

For what it's worth, in most cases you'll probably be able to reframe the resource allocation problem so that these assumptions hold. For example, you can account for negative utility goods by just taking the complement and giving it a positive price. If I want to account for agents having a "negative" price for pollution, you can instead model how much agents value clean air and give it a positive price to satisfy monotonic utility.

One of the scenarios I'm imagining is a scenario where network effects exist, such that you don't want to have something if you are the only person having it, but you do want it if others have it.

Arguably, a lot of public goods/social media are like this, where there's 0 demand at a limited size, but have lots of demand when the size starts increasing beyond a threshold.

In essence, I'm asking if we can remove the demand independence assumption and still get an isomorphism between optimal solutions to scarce resources and a system having prices.

Ah, well if there are interdependent demand functions then there is no guarantee of a (general) equilibrium, so the starting resource allocation problem may have 0 or many solutions. So I would say no, the isomorphism doesn't hold in that case.

This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you're dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price "reputation damage" and instead you might want to be overly cautious (i.e. "impose constraints") and think through deviations from that cautious baseline on a case-by-case basis (i.e. "forward-check"). Again, I think your point is mostly right, and a useful frame -- it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.

TBC, I don't particularly expect hard constraints to show up, that was more a way of illustrating the underlying concept. The same underlying concept in the the market-style picture would be: across many different top-level goals, there are convergent ways of carving up "property rights". So, a system can be generally corrigible by "respecting the convergent property rights", so to speak.

Yup, exactly, and good job explaining it too.

:)

Now that I feel like we're at least on the same page, I'll give some thoughts.

  • This is a neat idea, and one that I hadn't thought of before. Thanks!
  • I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
  • I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big part of where I think I originally got confused and misunderstood you.
  • One of the big reasons I don't focus so much on constraints when thinking about corrigibility is because I think constraints are usually either brittle or crippling. I think corrigible agents will, for example, try to keep their actions reversible, but I don't see a way to instantiate this as a constraint in a way that both allows normal action and forbids Goodharting. Instead, I tend to think about heuristics that fall-back on getting help from the principal. ("I have a rough sense of how reversible things should normally be, and if it looks like I might be going outside the normal bounds I'll stop and check.")
  • Thus, my guess is that if one naively tries to implement an agent that is genuinely constrained according to the natural set of "instrumental constraints" or whatever we want to call them, it'll end up effectively paralyzing them.
  • The thing that allows a corrigible agent not to be paralyzed, in my mind, is the presence of a principal. But if I'm understanding you right, "instrumental constraint" satisfying agents don't (necessarily) have a principal. This seems like a major difference between this idea and corrigibility.
  • I have some additional thoughts on how exactly the Scylla and Charybdis of being paralyzed by constraints and cleverly bypassing constraints kills you, for example with regard to resource accumulation/protection, but I think I want to end by noting a sense that naively implementing these in some kind of straightforward constrained-optimizer isn't where the value of this idea lies. Instead, I am most interested in whether this frame can be used as a generator for corrigibility heuristics (and/or a corrigibility dataset). 🤔

This comment raises some good points, but even "there will be a natural pressure for [subprocesses] to resemble a corrigible agent" seem to be debatable. Again consider the restaurant setting. It is sometime necessary for restaurants to close temporarily for renovation to increase the seating capacity, upgrade equipment, etc. The head chef who decided to renovate will be making the instrumental goals of all the other chefs (make a good food, earn money to stay alive) untenable while they are furloughed. More generally, progress towards terminal goals is not monotonic and thus only focusing on the local topology of the optimization landscape might be insufficient to predict long-horizon trends. 

This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.

To be fair, I think the shutdownableness of an AI/not subverting higher level goals was the original motivation of all the corrigibility research, so this is a good thing.

Pursuit of money is an extremely special instrumental goal whose properties you shouldn't generalize to other goals in your theory of instrumental convergence. (And I could imagine it should be narrowed down further, e.g. into those who want to support the state vs those who want money by whichever means including scamming the state.)

Not convinced it's relevant, but I'm happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?

Generally you wouldn't since it's busy using that matter/energy for whatever you asked it to do. If you wanted to use it, presumably you could turn down its intensity, or maybe it exposes some simplified summary that it uses to coordinate economies of scale.

One time I was camping in the woods with some friends. We were sat around the fire in the middle of the night, listening to the sound of the woods, when one of my friends got out a bluetooth speaker and started playing donk at full volume (donk is a kind of funny, somewhat obnoxious style of dance music).

I strongly felt that this was a bad bad bad thing to be doing, and was basically pleading with my friend to turn it off. Everyone else thought it was funny and that I was being a bit dramatic -- there was nobody around for hundreds of metres, so we weren't disturbing anyone.

I think my friends felt that because we were away from people, we weren't "stepping on the toes of any instrumentally convergent subgoals" with our noise pollution. Whereas I had the vague feeling that we were disturbing all these squirrels and pigeons and or whatever that were probably sleeping in the trees, so we were "stepping on the toes of instrumentally convergent subgoals" to an awful degree.

Which is all to say, for happy instrumental convergence to be good news for other agents in your vicinity, it seems like you probably do still need to care about those agents for some reason?

Yes, I don't think this will let you get away with no specification bits in goal space at the top level like John's phrasing might suggest. But it may let you get away with much less precision? 

The things we care about aren't convergent instrumental goals for all terminal goals, the kitchen chef's constraints aren't doing that much to keep the kitchen liveable to cockroaches. But it seems to me that this maybe does gesture at a method to get away with pointing at a broad region of goal space instead of a near-pointlike region.

For some reason I've been muttering the phrase, "instrumental goals all the way up" to myself for about a year, so I'm glad somebody's come up with an idea to attach it to.

Hmm. This does have the feel of gesturing at something important, but I don't see it clearly yet...

Free association: geometric rationality.

MIRI's old results argue that "corrigibility via uncertainty regarding the utility function" doesn't work, because if the agent maximizes expected utility anyway, it doesn't matter one whit whether we're taking expectation over actions or over utility functions. However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next". Is there, therefore, some way to implement something-like-this while avoiding MIRI's counterexample?

Loophole: the counterexample works in the arithmetically-expected utility regime. What if we instead do it in the geometric one? I. e., have an agent take actions that maximize the geometrically-expected product of candidate utility functions? This is a more conservative/egalitarian regime: any one utility function flipping to negative or going to zero wipes out all value, unlike with sums (which are more tolerant of ignoring/pessimizing some terms, and can have "utility monsters"). So it might potentially make the agent actually hesitant to introduce potentially destructive changes to its environment...

(This is a very quick take and it potentially completely misunderstands the concepts involved. But I figure it's better to post than not, in case the connection turns out obvious to anyone else.)

However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next".

That's an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.

The "uncertainty regarding the utility function" enters here mainly when we invoke instrumental convergence, in hopes that the subagent can "act as though other subagents are also working torward its terminal goal" in a way agnostic to its terminal goal. Which is a very different role than the old "corrigibility via uncertainty" proposals.

Instrumental goal competition explains the cake

My worry is that instrumental subgoals are not safer because instrumental subgoals are automatically safer, but because higher goals (which generate instrumental subgoals) tend to generate multiple instrumental subgoals, none of which is important enough to steamroll the others. This seems to explain the cake example.

If you want instrumental goals all the way up, it means you want to repeatedly convert the highest goal into an instrumental subgoal of an even higher goal, which in turn will generate many other instrumental subgoals to compete with it for importance.

I'm not sure, but it looks like the only reason this should work is if the AGI/ASI has so many competing goals that being good to humans has some weight. This is similar to Multi-Objective Homeostasis.

Goal Reductionism

I guess another way this may work, is if the AGI/ASI itself isn't sure why it's doing something, we can teach it to to think that its behaviours are the instrumental subgoal of some higher purpose, which it itself can't be sure about.

This is related to Goal Reductionism.

I feel that Self-Other Overlap: A Neglected Approach to AI Alignment also fits the theme of the chef and restaurant example, and may help with Goal Reductionism.

I was thinking about this a while back, as I was reading some comments by @tailcalled where they pointed out this possibility of a "natural impact measure" when agents make plans. This relied on some sort of natural modularity in the world, and in plans, such that you can make plans by manipulating pieces of the world which don't have side-effects leaking out to the rest of the world. But thinking through some examples didn't convince me that was the case. 

Though admittedly, all I was doing was recursively splitting my instrumental goals into instrumental sub-goals and checking if they wound up seeming like natural abstractions. If they had, perhaps that would reflect an underlying modularity in plan-making in this world that is likely to be goal-independent. They didn't, so I got more pessimistic about this endeavour. Though writing this comment out, it doesn't seem like those examples I worked through are much evidence. So maybe this is more likely to work than I thought.

What I eventually realized is that this line of argument is a perfect rebuttal of the whole mesa-optimization neurosis that has popped up, but it doesn't actually give us AI safety because it completely breaks down once you apply it to e.g. law enforcement or warfare.

I agree this doesn't perfectly solve the AI safety problem, and my guess is that the reason this doesn't work for law enforcement/warfare is because the instrumental goals are adversarial, such that you are not incentivized to not break other agent's goals.

However, if something like the plan from John Wentworth's post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.

Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I'd encourage you to write up your results on that line of argument anyway.

However, if something like the plan from John Wentworth's post worked, this would be a really useful way to make automated AI alignment schemes that are safe, so I do think it indirectly gives us safety.

How?

Also, entirely removing inner alignment problems/mesa optimization should cut down doom probabilities, especially for Eliezer Yudkowsky and John Wentworth, so I'd encourage you to write up your results on that line of argument anyway.

I didn't really get any further than John Wentworth's post here. But also I've been a lot less spooked by LLMs than Eliezer Yudkowsky.

How?

Basically, because you can safely get highly capable AIs to work on long and confusing problems without worrying that they'd eventually takeover and kill everyone, and this includes all plans for automating alignment.

Also, a crux here is I expect automating alignment research to be way less adversarial than fields like law enforcement/warfare, because you are facing way less opposition to your goals.

If you want AIs to produce a lot of text on AI alignment and moral philosophy, you can already do that now without worrying that the AIs in question will take over the world.

If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can't really know how to adapt and improve it without actually involving it in those conflicts.

Ok, the key point I want to keep in mind is that for the purposes of AI alignment, we don't really need to solve most human conflicts, other than internally generated ones, because the traditional alignment problem is aligning an AI to a single human, so most of the political conflicts do not actually matter here.

Most places where AI or alignment are applied are more convoluted cases where lots of people are involved. It's generally not economically feasible to develop AGI for a single person, so it doesn't really happen.

Agree with this, but the point here is a single (or at least a small set of people) have control over AI values by default, such that the AI is aligned to them personally, and it essentially treats other people according to the instructions/wishes of that single person/small set of people, which was my point in claiming that most conflicts don't matter, because they have a resolution procedure that is very simple to implement.

I don't think the people develop AGI have clear or coherent wishes for how the AGI should treat most other people.

Agree with this, but 2 things:

  1. I expect people to develop clearer and coherent wishes once they actually realize that they might have nation-state level power.

  2. Most versions of incoherent/unclear wishes for other humans do not result in existential catastrophe, relative to other failure modes for AI safety.

I don't really understand how you expect this line of thought to play out. Are you arguing e.g. Sam Altman would start using OpenAI to enforce his own personal moral opinions, even when they are extremely unpopular?

This definitely can happen, though I'd argue in practice it wouldn't go as far as enforcing his own opinions by force, and to get back to what I wanted to argue, my point here is that instrumental goals leading to corrigibility, and in practice we will have instruction following AGIs/ASIs than value-aligned AGIs/ASIs:

https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than

I don't understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?

I don't understand your whole end-to-end point, like how does this connect to making AIs produce texts on alignment, and how does that lead to a pivotal act?

The key property is we can reasonably trust their research not to have adversarial backdoors, and we can let our guard down quite a lot, and the pivotal act I usually envision has to do with automating the R&D pipeline, which then leads to automating the alignment pipeline, which leads to existential safety.

Note this doesn't look like a pivotal act, and this is not coincidental here, because real life heroism doesn't look like bombast/using hard power, it looks like being able to make a process more efficient like the Green Revolution, or preventing backfire risk such that you make the situation worse.

I'm not interested in your key property, I'm interested in a more proper end-to-end description. Like superficially this just sounds like it immediately runs into the failure mode John Wentworth described last time, but your description is kind of too vague to say for sure.

I have to agree with this comment below by Matt Levinson below, that at least 3 of the specific failure modes described in the post can't be solved by any AI safety agenda, because they rely on the assumption that people will use the agenda, so there's no reason to consider them, and having read the discourse on that post, I think the main ways I disagree with John Wentworth is that I'm much more optimistic in general on verification, and do not find his worldview of verification not being easier than generation plausible at all, which leads to being more optimistic about something like a market of ideas for AI alignment working, and I think bureaucracies in general are way better than John Wentworth seems to imply.

This is also related to the experiment John did on whether markets reliably solve hard problems instead of goodharting by focusing on the air conditioner test, and my takeaway is that markets are actually sometimes good at optimizing things, and people just don't appreciate economic/computational constraints on why something is the way it is.

Comments below:

https://www.lesswrong.com/posts/8wBN8cdNAv3c7vt6p/the-case-against-ai-control-research#FembwXfYSwnwxzWbC

https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#maJBX3zAEtx5gFcBG

https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP#g4N9Pdj8mQioRe43q

https://www.lesswrong.com/posts/5re4KgMoNXHFyLq8N/air-conditioner-test-results-and-discussion#3TFECJ3urX6wLre5n

The posts I disagree with:

https://www.lesswrong.com/s/TLSzP4xP42PPBctgw/p/3gAccKDW6nRKFumpP

https://www.lesswrong.com/posts/2PDC69DDJuAx6GANa/verification-is-not-easier-than-generation-in-general

https://www.lesswrong.com/posts/MMAK6eeMCH3JGuqeZ/everything-i-need-to-know-about-takeoff-speeds-i-learned

https://www.lesswrong.com/posts/hsqKp56whpPEQns3Z/why-large-bureaucratic-organizations

(For the bureaucratic organizations point, I think the big reason why that neatly explains bureaucracy is a combo of needing to avoid corruption/bad states very highly, so simple, verifiable rules are best, combined with the world giving us problems that are hard to solve but easy to verify, plus humans needing to coordination).

So I'm much less worried about slop than John Wentworth is.

If you're assuming that verification is easier than generation, you're pretty much a non-player when it comes to alignment.

My new concept for "pivotal act that stops the world from getting to ASI, even though we get to AGI" is a soft-power act of better coordination. Get help from AGI to design and deploy decentralized governance tech that allows humanity (and AIs) to coordinate on escaping the trap of suicide-race.

Once you start getting involved with governance, you're going to need law enforcement and defense, which is an adversarial context and thus means the whole instrumental goal niceness argument collapse.

If you want to figure out how to achieve good results when making the AI handle various human conflicts, you can't really know how to adapt and improve it without actually involving it in those conflicts.


I disagree. There is such a lot of conflicts (some kinds make it into writing, some just happen) of different scales, both in history and now; I believe they span human conflict space almost fully. Just aggregating this information could lead to very good advice on handling everything, which AI could act upon if it so needed.

Could you unpack both clauses of this sentence? It's not obvious to me why they are true.

For the former I'd need to hear your favorite argument in favor of the neurosis that inner alignment is a major problem.

For the latter, in the presence of adversaries, every subgoal has to be robust against those adversaries, which is very unfriendly.

I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren't. Why do I think they're plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn't seem like a very complex claim, so I assign decent odds to it. There's some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that's not super strong, and there are other simple hypotheses for how things go, so I don't assign more than like 10% credence to the hypothesis. 

It's still not obvious to me why adversaries are a big issue. If I'm acting against an adversary, it seems like I won't make counter-plans that lead to lots of side-effects either, for the same reasons they won't. 

It's still not obvious to me why adversaries are a big issue. If I'm acting against an adversary, it seems like I won't make counter-plans that lead to lots of side-effects either, for the same reasons they won't. 

I mean we can start by noticing that historically, optimization in the presence of adversaries has lead to huge things. The world wars wrecked Europe. States and large bureaucratic organizations probably exist mainly as a consequence of farm raids. The immune system tends to stress out the body a lot when it is dealing with an infection. While it didn't actually trigger, the nuclear arms race lead to existential risk for humanity, and even though it didn't trigger the destruction, it still made people quite afraid of e.g. nuclear power. Etc..

Now, why does trying to destroy a hostile optimizer tend to cause so much destruction? I feel like the question almost answers itself.

Or if we want to go mechanistic about it, one of the ways to fight back the nazis is with bombs, which deliver a sudden shockwave of energy that has the property of destroying nazi structures and everything else. It's almost constitutive of the alignment problem: we have a lot of ways of influencing the world a lot, but those methods do not discriminate between good and evil/bad.

From an abstract point of view, many coherence theorems rely on e.g. Dutch books, and thus become much more applicable in the case of adversaries. The coherence theorem "if an agent achieves its goals robustly regardless of environment, then it stops people who want to shut it down" can be trivially restated as "either an agent does not achieve its goals robustly regardless of environment, or it stops people who want to shut it down", and here non-adversarial agents should obviously choose the former branch (to be corrigble, you need to not achieve your goals in an environment where someone is trying to shut you down).

From a more strategic point of view, when dealing with an adversary, you tend to become a lot more constrained on resources because if the adversary can find a way to drain your resources, then it will try to do so. Ways to succeed include:

  • Making it harder for people to trick you into losing resources, by e.g. making it harder for people to predict you, and being less trusting of what people tell you, and wining as quickly as possible
  • Gaining more resources by grabbing them from elsewhere

Also, in an adversarial context, a natural prior is that inconveniences are there for a reason, namely to interfere with you. This tends to make enemies.

I think mesa-optimizers could be a major-problem, but there are good odds we live in a world where they aren't. Why do I think they're plausible? Because optimization is a pretty natural capability, and a mind being/becoming an optimizer at the top-level doesn't seem like a very complex claim, so I assign decent odds to it. There's some weak evidence in favour of this too, e.g. humans not optimizing of what the local, myopic evolutionary optimizer which is acting on them is optimizing for, coherence theorems etc. But that's not super strong, and there are other simple hypotheses for how things go, so I don't assign more than like 10% credence to the hypothesis. 

Mesa-optimizers definitely exist to varying degrees, but they generally try to not get too involved with other things. Mechanistically, we can attribute this to imitation learning, since they're trying to mimick human's tendency to stitch together strategies in a reasonable way. Abstractly, the friendliness of instrumental goals shows us why unbounded unfriendly utility maximizers are not the only or even main attractor here.

(... Some people might say that we have a mathematical model of unbounded unfriendly utility maximizers but not of friendlier bounded instrumental optimizers. But those people are wrong because the model of utility maximizers assumes we have an epistemic oracle to handle the updating, prediction and optimization for us, and really that's the computationally heavy part. One of the advantages of more bounded optimization like in the OP is that it ought to be more computationally tractable because different parts of the plans interfere less with each other. It's not really fair to say that we know how utility maximizers work when they outsource the important part to the assumptions.)

I think one pattern which needs to hold in the environment in order for subgoal corrigibility to make sense is that the world is modular, but that modularity structure can be broken or changed


For one, modularity is the main thing that enables general purpose search: If we can optimize for a goal by just optimizing for a few instrumental subgoals while ignoring the influence of pretty much everything else, then that reflects some degree of modularity in the problem space

Secondly, if the modularity structure of the environment stays constant no matter what (e.g We can represent it as a fixed causal DAG), then there would be no need to "respect modularity" because any action we take would preserve the modularity of the environment by default (given our assumption); we would only need to worry about side effects if there's at least a possibility for those side effects to break or change the modularity of the problem space, and that means the modularity structure of the problem space is a thing that can be broken or changed


Example of modularity structure of the environment changing: Most objects in the world pretty much only have direct influence on other objects nearby, and we can break or change that modularity structure by moving objects to different positions. In particular, the positions are the variables which determines the modularity of "which objects influence which other objects", and the way that we "break" the modularity structure between the objects is by intervening on those variables.

 

So we know that "subgoal corrigibility" requries the environment to be modular, but that modularity structure can be broken or changed. If this is true, then the modularity structure of the environment can be tracked by a set of "second-order" variables such as position which tells us "what things influence what other things" (In particular, these second-order variables themselves might satisfy some sort of modularity structure that can be changed, and we may have third-order variables that tracks the modularity structure of the second-order variables). The way that we "respect the modularity" of other instrumental subgoals is by preserving these second-order variables that track the modularity structure of the problem space.

 

For instance, we get to break down the goal of baking a cake into instrumental subgoals such as acquiring coca powder (while ignoring most other things) if and only if a particular modularity structure of the problem space holds (e.g. other equipments are all in the right place & right positions), and there is a set of variables that track that modularity structure (the conditions & positions of the equipments). The way we preserve that modularity structure is by preserving those variables (the conditions & positions of the equipments).  

 

Given this, we might want to model the world in a way that explicitly represents variables that track the modularity of other variables, so that we get to preserve influence over those variables (and therefore the modularity structure that GPS relies on)

This is great.

Now, given that you're already talking about instrumental goals "trying not to step on each other's toes", what else would they need to deserve the name of "subagents"?

Note that the instrumental goal is importantly distinct from the subagent which pursues that instrumental goal. I think a big part of the insight in this post is to say "corrigibility is a property of instrumental goals, separate from the subagents which pursue those goals"; we can study the goals (i.e. problem factorization) rather than the subagents in order to understand corrigibility.

To the extent this reasoning works, it breaks the moment any agent has anything like a decisive strategic advantage. And that point no one else's instrumental or terminal goals can act as constraints.