Max Harms

Sequences

CAST: Corrigibility As Singular Target

Wikitag Contributions

Comments

Sorted by

This seems mostly right. I think there still might be problems where identifying and charging for relevant externalities is computationally harder than routing around them. For instance, say you're dealing with a civilization (such as humanity) that is responding to your actions in complex and chaotic ways, it may be intractable to find a way to efficiently price "reputation damage" and instead you might want to be overly cautious (i.e. "impose constraints") and think through deviations from that cautious baseline on a case-by-case basis (i.e. "forward-check"). Again, I think your point is mostly right, and a useful frame -- it makes me less likely to expect the kinds of hard constraints that Wentworth and Lorell propose to show up in practice.

:)

Now that I feel like we're at least on the same page, I'll give some thoughts.

  • This is a neat idea, and one that I hadn't thought of before. Thanks!
  • I think I particularly like the way in which it might be a way of naturally naming constraints that might be useful to point at.
  • I am unsure how much these constraints actually get strongly reified in practice. When planning in simple contexts, I expect forward-checking to be more common. The centrality of forward-checking in my conception of the relationship between terminal and instrumental goals is a big part of where I think I originally got confused and misunderstood you.
  • One of the big reasons I don't focus so much on constraints when thinking about corrigibility is because I think constraints are usually either brittle or crippling. I think corrigible agents will, for example, try to keep their actions reversible, but I don't see a way to instantiate this as a constraint in a way that both allows normal action and forbids Goodharting. Instead, I tend to think about heuristics that fall-back on getting help from the principal. ("I have a rough sense of how reversible things should normally be, and if it looks like I might be going outside the normal bounds I'll stop and check.")
  • Thus, my guess is that if one naively tries to implement an agent that is genuinely constrained according to the natural set of "instrumental constraints" or whatever we want to call them, it'll end up effectively paralyzing them.
  • The thing that allows a corrigible agent not to be paralyzed, in my mind, is the presence of a principal. But if I'm understanding you right, "instrumental constraint" satisfying agents don't (necessarily) have a principal. This seems like a major difference between this idea and corrigibility.
  • I have some additional thoughts on how exactly the Scylla and Charybdis of being paralyzed by constraints and cleverly bypassing constraints kills you, for example with regard to resource accumulation/protection, but I think I want to end by noting a sense that naively implementing these in some kind of straightforward constrained-optimizer isn't where the value of this idea lies. Instead, I am most interested in whether this frame can be used as a generator for corrigibility heuristics (and/or a corrigibility dataset). 🤔
Max Harms417

This is a helpful response. I think I rounded to agents because in my head I see corrigibility as a property of agents, and I don't really know what "corrigible goal" even means. Your point about constraints is illuminating, as I tend not to focus on constraints when thinking about corrigibility. But let me see if I understand what you're trying to say.

Suppose we're optimizing for paperclips, and we form a plan to build paperclip factories to accomplish that (top level) goal. Building factories then can be seen as a subgoal, but of course we should be careful when building paperclip factories not to inadvertently ruin our ability to make paperclips. One way of protecting the terminal goal even when focusing on subgoals is to forward-check actions to see if they conflict with the destination. (This is similar to how a corrigible agent might check for confirmation from its principal before doing something with heavy, irreversible consequences.) Forward-checking, for obvious reasons, requires there to actually be a terminal goal to check, and we should not expect this to work in an agent "without a terminal goal." But there's another way to prevent optimizing a subgoal to inadvertently hurt global success: constrain the optimization. If we can limit the kinds of changes that we make when pursuing the subgoal to nice, local, reversible ones, then we can pursue building paperclip factories myopically, expecting that we won't inadvertently produce side-effects that ruin the overall ability to make paperclips. This is especially useful when pursuing several subgoals in parallel, as forward-checking a combination of moves is combinatorially costly--better to have the agent's parallel actions constrained to nice parts of the space.

If it turns out there's a natural kind of constraint that shows up when making plans in a complex world, such that optimizing under that set of constraints is naturally unlikely to harm ability to accomplish goals in general, then perhaps we have some hope in naming that natural kind, and building agents which are always subject to these constraints, regardless of what they're working on.

Is that right?

(This is indeed a very different understanding of what you were saying than I originally had. Apologies for the misunderstanding.)

This seems right. Some sub-properties of corrigibility, such as not subverting the higher-level and being shutdownable, should be expected in well-constructed sub-processes. But corrigibility is probably about more than just that (e.g. perhaps myopia) and we should be careful not to assume that well-constructed sub-processes that resemble agents will get all the corrigibility properties.

Not convinced it's relevant, but I'm happy to change it to:
If it has matter and/or energy in its pocket, do I get to use that matter and/or energy?

Max Harms210

Some of this seems right to me, but the general points seem wrong. I agree that insofar as a subprocess resembles an agent, there will be a natural pressure for it to resemble a corrigible agent. Pursuit of e.g. money is all well and good until it stomps the original ends it was supposed to serve -- this is akin to a corrigibility failure. The terminal-goal seeking cognition needs to be able to abort, modify, and avoid babysitting its subcognition.

One immediate thing to flag is that when you start talking about chefs in the restaurant, those other chefs are working towards the same overall ends. And the point about predictability and visibility only applies to them. Indeed, we don't really need the notion of instrumentality here -- I expect that two agents that know the other to be working towards the same ends to naturally want to coordinate, including by making their actions legible to the other.

One more interesting thing to highlight: so far, insofar as instrumental goals are corrigible, we've only talked about them being corrigible toward other instrumental subgoals of the same shared terminal goal. The chef pursuing the restaurant's success might be perfectly fine screwing over e.g. a random taxi driver in another city. But instrumental convergence potentially points towards general corrigibility.

This is, I think, the cruxy part of this essay. Knowing that an agent won't want to build incorrigible limbs, so we should expect corrigibility as a natural property of (agentic) limbs isn't very important. What's important is whether we can build an AI that's more like a limb, or that we expect to gravitate in that direction, even as it becomes vastly more powerful than the supervising process.

(Side note: I do wish you'd talked a bit about a restaurant owner, in your metaphor; having an overall cognition that's steering the chefs towards the terminal ends is a natural part of the story, and if you deny the restaurant has to have an owner, I think that's a big enough move that I want you to spell it out more.)

So to build a generally corrigible system, we can imagine just dropping terminal goals altogether, and aim for an agent which is 'just' corrigible toward instrumentally-convergent subgoals.

I predict such an agent is relatively easy to make, and will convert the universe into batteries/black holes, computers, and robots. I fail to see why it would respect agents with other terminal goals.

But perhaps you mean you want to set up an agent which is serving the terminal goals of others? (The nearest person? The aggregate will of the collective? The collective will of the non-anthropomorphic universe?) If it has money in its pocket, do I get to spend that money? Why? Why not expect that in the process of this agent getting good at doing things, it learns to guard its resources from pesky monkeys in the environment? In general I feel like you've just gestured at the problem in a vague way without proposing anything that looks to me like a solution. :\

Max HarmsΩ120

Thanks for noticing the typo. I've updated that section to try and be clearer. LMK if you have further suggestions on how it could be made better.

Max HarmsΩ110

That's an interesting proposal! I think something like it might be able to work, though I worry about details. For instance, suppose there's a Propogandist who gives resources to agents that brainwash their principals into having certain values. If "teach me about philosophy" comes with an influence budget, it seems critical that the AI doesn't spend that budget trading with Propagandist, and instead does so in a more "central" way.

Still, the idea of instructions carrying a degree of approved influence seems promising.

Max HarmsΩ442

Sure, let's talk about anti-naturality. I wrote some about my perspective on it here: https://www.alignmentforum.org/s/KfCjeconYRdFbMxsy/p/3HMh7ES4ACpeDKtsW#_Anti_Naturality__and_Hardness

More directly, I would say that general competence/intelligence is connected with certain ways of thinking. For example, modes of thinking that focus on tracking scarce resources and bottlenecks are generally useful. If we think about processes that select for intelligence, those processes are naturally[1] going to select these ways of thinking. Some properties we might imagine a mind having, such as only thinking locally, are the opposite of this -- if we select for them, we are fighting the intelligence gradient. To say that a goal is anti-natural means that accomplishing that goal involves learning to think in anti-natural ways, and thus training a mind to have that goal is like swimming against the current, and we should expect it to potentially break if the training processes puts too much weight on competence compared to alignment. Minds with anti-natural goals are possible, but harder to produce using known methods, for the most part.

(AFAIK this is the way that Nate Soares uses the term, and I assume the way Eliezer Yudkowsky thinks about it as well, but I'm also probably missing big parts of their perspectives, and generally don't trust myself to pass their ITT.)

  1. ^

    The term "anti-natural" is bad in that it seems to be the opposite of "natural," but is not a general opposite of natural. While I do believe that the ways-of-thinking-that-are-generally-useful are the sorts of things that naturally emerge when selecting for intelligence, there are clearly plenty of things which the word "natural" describes besides these ways of thinking. The more complete version of "anti-natural" according to me would be "anti-the-useful-cognitive-strategies-that-naturally-emerge-when-selecting-for-intelligence" but obviously we need a shorthand term, and ideally one that doesn't breed confusion.

Max HarmsΩ110

If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.

How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?

Load More