I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.
The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.
I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.
On the surface, the axioms of VNM-utility seem reasonable to me
To me too! But the question isn't whether they seem reasonable. It's whether we can train agents that enduringly violate them. I think that we can. Coherence arguments give us little reason to think that we can't.
unused alternatives seem basically irrelevant to choosing between superior options
Yes, but this isn't Independence. And the question isn't about what seems basically irrelevant to us.
agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
as long as the resources are being modeled as part of what the agent has preferences about
Yes, but the concern is whether we can instil such preferences. It seems like it might be hard to train agents to prefer to spend resources in pursuit of their goals except in cases where they would do so by resisting shutdown.
Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants.
You can, of course, always reinterpret the objects of preference so that the VNM axioms are trivially satisfied. That's not a problem for my proposal. See:
Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.
Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.
And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn't require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.
The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!
You can define 'preferences' so that this is true, but then it need not follow that agents will pay costs to shift probability mass away from dispreferred options and towards preferred options. And that's the thing that matters when we're trying to create a shutdownable agent. We want to ensure that agents won't pay costs to influence shutdown-time.
Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.
I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives
Not true. The axiom we're giving up is Decision-Tree Separability. That's different to VNM Independence, and different to Option-Set Independence. It might be hard to train agents that enduringly violate VNM Independence and/or Option-Set Independence. It doesn't seem so hard to train agents that enduringly violate Decision-Tree Separability.
In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday.
Yes, nice point. Kinda weird? Maybe. Difficult to create artificial agents that do it? Doesn't seem so.
But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of VNM
Yep, you can always reinterpret the objects of preference so that the VNM axioms are trivially satisfied.That's not a problem for my proposal.
the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories.
Not true. As I say elsewhere:
And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.
I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period.
Same point here as above. You can get any agent to satisfy the VNM axioms by enriching the objects of preference. A concern is that these more complex preferences are harder to reliably train into your agent.
Excellent response. Thank you. :) I'll start with some basic responses, and will respond later to other points when I have more time.
I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.
I was speaking casually here, and I now regret it. You are absolutely correct that Option-Set independence is not the Independence axiom. My best guess about what I meant was that VNM assumes that the agent has preferences over lotteries in isolation, rather than, for example, a way of picking preferences out of a set of lotteries. For instance, a VNM agent must have a fixed opinion about lottery A compared to lottery B, regardless of whether that agent has access to lottery C.
> agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
You are correct. My "straightforward" mechanism for money-pumping an agent with preferences A > B, B > C, but which does not prefer A to C does indeed depend on being able to force the agent to pick either A or C in a way that doesn't reliably pick A.
Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.
Can you be more specific here? I gave several trees, above, and am not easily able to reconstruct your point.
Thanks. (And apologies for the long delay in responding.)
Here's my attempt at not talking past each other:
We can observe the actions of an agent from the outside, but as long as we're merely doing so, without making some basic philosophical assumptions about what it cares about, we can't generalize these observations. Consider the first decision-tree presented above that you reference. We might observe the agent swap A for B and then swap A+ for B. What can we conclude from this? Naively we could guess that A+ > B > A. But we could also conclude that A+ > {B, A} and that because the agent can see the A+ down the road, they swap from A to B purely for the downstream consequence of getting to choose A+ later. If B = A-, we can still imagine the agent swapping in order to later get A+, so the initial swap doesn't tell us anything. But from the outside we also can't really say that A+ is always preferred over A. Perhaps this agent just likes swapping! Or maybe there's a different governing principal that's being neglected, such as preferring almost (but not quite) getting B.
The point is that we want to form theories of agents that let us predict their behavior, such as when they'll pay a cost to avoid shutdown. If we define the agent's preferences as "which choices the agent makes in a given situation" we make no progress towards a theory of that kind. Yes, we can construct a frame that treats Incomplete Preferences as EUM of a particular kind, but so what? The important bit is that an Incomplete Preference agent can be set up so that it provably isn't willing to pay costs to avoid shutdown.
Does that match your view?
Yes, that's a good summary. The one thing I'd say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.
I reject Thornley’s assertion that they’re dealbreakers.
Everything you say in this section seems very reasonable. In particular, I think it's pretty likely that this is true:
It’s okay for our agent to have preferences around the shutdown button (that is: to have it either pressed or unpressed), because we can carefully train into our agent a shallow aversion to manipulating the button, including via side-channels such as humans or other machines. This aversion will likely win out over the agent’s incentives in settings that resemble the training environment. As a result, the agent won’t try to manipulate the button in the early phases of its life, and so will remain shutdownable long enough for a further refinement process to generalize the shallow aversion into a deep and robust preference for non-manipulation.
So I'm not sure whether I think that the problems of reward misspecification, goal misgeneralization, and deceptive alignment are 'dealbreakers' in the sense that you're using the word.
But I do still think that these problems preclude any real assurance of shutdownability: e.g. they preclude p(shutdownability) > 95%. It sounds like we're approximately in agreement on that:
But I also agree that my strategy isn’t ideal. It would be nice to have something robust, where we could get something closer to a formal proof of shutdownability.
Got this on my list to read! Just in case it's easy for you to do, can you turn the whole sequence into a PDF? I'd like to print it. Let me know if that'd be a hassle, in which case I can do it myself.
I wrote drafts in Google docs and can export to pdf. There may be small differences in wording here and there and some of the internal links will be broken, but I'd be happy to send you them. Email me at max@intelligence.org and I'll shoot them back to you that way?
I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
But suppose I’m wrong, and timestep-dominance is always relevant.
My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.
I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I propose that the agent choose by maximizing expected utility. Can you say what you expect the problem there to be?
Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.
I talk about the issue of creating corrigible subagents here. What do you think of that?
Note also a general nice feature of TD-agents: they won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. So if a TD-agent does try to create an incorrigible paperclipper, it won't hide that fact if doing so is at all costly.
While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.
One more thing I'll say: the IPP leaves open the content of the agent's preferences over same-length trajectories. One pattern of preferences you could try to train in is the kind of corrigibility that you talk about elsewhere in your sequence. That'd give you two lines of defence against incorrigibility.
I talk about the issue of creating corrigible subagents here. What do you think of that?
I may not understand your thing fully, but here's my high-level attempt to summarize your idea:
IPP-agents won't care about the difference between building a corrigible agent vs an incorrigible agent because it models that if humans decide something's off and try to shut everything down, it will also get shut down and thus nothing after that point matters, including whether the sub-agent makes a bunch of money or also gets shut down. Thus, if you instruct an IPP agent to make corrigible sub-agents, it won't have the standard reason to resist: that incorrigible sub-agents make more money than corrigible ones. Thus if we build an obedient IPP agent and tell it to make all its sub-agents corrigible, we can be more hopeful that it'll actually do so.
I didn't see anything in your document that addresses my point about money-maximizers being easier to build than IPP agents (or corrigible agents) and thus, in the absence of an instruction to make corrigible sub-agents, we should expect sub-agents that are more akin to money-maximizers.
But perhaps your rebuttal will be "sure, but we can just instruct/train the AI to make corrigible sub-agents". If this is your response, I am curious how you expect to be able to do that without running into the misspecification/misgeneralization issues that you're so keen to avoid. From my perspective it's easier to train an AI to be generally corrigible than to create corrigible sub-agents per se (and once the AI is generally corrigible it'll also create corrigible sub-agents), which seems like a reason to focus on corrigibility directly?
Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I'm still concerned that CAST training doesn't get us truly corrigible agents with high probability. I think we're better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).
Again, responding briefly to one point due to my limited time-window:
> While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.
Suppose I am an agent which wants paperclips. The world is full of matter and energy which I can bend to my will in the service of making paperclips. Humans are systems which can be bent towards the task of making paperclips, and I want to manipulate them into doing my bidding not[1] because they might turn me off, but because they are a way to get more paperclips. When I incinerate the biosphere to gain the energy stored inside, it's not[1] because it's trying to stop me, but because it is fuel. When my self-replicating factories and spacecraft are impervious to weaponry, it is not[1] because I knew I needed to defend against bombs, but because the best factory/spacecraft designs are naturally robust.
(just)
This is a nice point, but it doesn't seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won't pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Are you so sure that unsubtle manipulation is always more effective/cheaper than subtle manipulation? Like, if I'm a human trying to gain control of a company, I think I'm basically just not choosing my strategies based on resisting being killed ("shutdown-resistance"), but I think I probably wind up with something subtle, patient, and manipulative anyway.
(Part 4 of the CAST sequence)
This document is an in-depth review of the primary documents discussing corrigibility that I’m aware of. In particular, I'll be focusing on the writing of Eliezer Yudkowsky and Paul Christiano, though I’ll also spend some time at the end briefly discussing other sources. As I go through the writing of those who’ve come before, I want to specifically compare and contrast those ideas with the conceptualization of corrigibility put forth in earlier documents and the strategy proposed in The CAST Strategy. At a high level I mostly agree with Christiano, except that he seems to think we’ll get corrigibility emergently, whereas I think it’s vital that we focus on directly training purely corrigible agents (and he wants to focus on recursive architectures that seem brittle and unproven, but that’s more of an aside).
In my opinion this document goes into more detail than I expect >95% of readers want. I’ve tried to repeat all of the important ideas that show up in this document elsewhere, so you are encouraged to skim or just skip to the next post in the sequence: Open Corrigibility Questions.
Note: I only very recently learned about Human Control: Definitions and Algorithms but haven’t yet had the time/spoons to read it in any depth. Apologies to Ryan Carey and Tom Everitt for the neglect!
In this document, quotes from the source material will be indented. All quotes are from the document linked in that section. Unless noted, all bold text formatting is my addition, used to emphasize/highlight portions of the quote. Italics within quotations are always from the original source.
Eliezer Yudkowsky et al.
Corrigibility (2015)
Let’s begin our review with the oldest writing on the topic that I’m aware of: the MIRI paper “Corrigibility” from 2015 written by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong. (Apologies for lumping this into Yudkowsky’s section. I find it helpful to think of this as “the Yudkowsky position” despite having personal relationships with each of the paper’s authors and knowing that they all contributed.)
From the abstract:
This is the source of “the stop button” toy-problem (“shutdown problem” in the paper) as well as several other related concepts. In the introduction the authors set up the situation and discuss how Omohundro Drives mean that (by default) agents will protect themselves from being modified, even when they understand that their creators made a mistake and specified the wrong goal.
This early paragraph seems worth calling out for how it slides between framing corrigibility as a property relating to the incentives of the AI at first, but then pivoting to a framing where corrigibility is a property of the method of reasoning about the world. While it’s not obvious that this slide is unwarranted or problematic, I believe it’s important to keep the division between goals/targets/values/incentives and reasoning/planning/strategizing/methods crisply in mind.
Also, very briefly, I want to compare my conception of the AI agent being corrigible to some “principal,” with the authors’ choice of “the programmers.” While the programmers of the AI are one obvious choice of principal, I wish that the authors had started the conversation around corrigibility by highlighting the connection with the principal-agent problem and had flagged that there are other plausible sets of people that might be identified as the the principal, such as owners, users, supervisors, or just humans in general.
Moving along, the authors then identify four necessary signs of corrigibility:
This is a short list compared to some of the other possible desiderata lists, but it’s a fine start. I agree that any agent which fails one of these four tests is not corrigible (though Yudkowsky later seems to back off from #3, as we’ll see in future writing). The authors then point out that many naive strategies for corrigibility, such as value uncertainty or explicit penalty terms in the agent’s utility function for things like deception will fail to produce agents which satisfy even this minimal desiderata list. I agree with this analysis, and particularly like this bit towards the end of section 1.1:
I want to pause in the reading here to comment briefly on the very MIRI-flavored perspective of thinking about the AI agent’s utility function not as an abstract description of the preferences implicit in the AI’s behavior, but rather as an explicit computation that can potentially be directly modified (such as with the addition of penalty terms around manipulation/deception/etc.). This frame matches the common GOFAI architecture of an expected utility maximizer in a symbolic context, but is highly at odds with modern machine learning paradigms. This GOFAI frame is central in the authors’ choice of toy-model problem:
The authors go on to discuss a partial solution using “utility indifference,” which I think is a dead-end vis-a-vis corrigibility (see the “Toy Shutdownability” section 3a. Towards Formal Corrigibility for more discussion here), and will skip over it so that here we can move directly to this bit that I find noteworthy from the conclusion:
I believe that the authors correctly identify that their framework is wrong. Neither 𝒰N nor 𝒰S are assumed to discuss the relationship between the principal (i.e. “the programmers”) and the agent, but are instead framed as being about the state of the world outside of the way the agents interact. From my perspective, corrigibility is a property that is necessarily about the agent desiring (as a terminal goal) to establish/respect/protect a specific relationship with its principal, rather than desiring that the world look any particular way, per se.
While I very much respect the authors, and am glad for the pioneering work, I look back on this paper as a bit of a misstep. I feel like at the very least it wasn’t helpful in my own journey as an AI-alignment researcher. I believe the framing of the “toy-model” is a distraction, the focus on “methods of reasoning” over values/goals is likely wrong (though admittedly the mathematics was mostly about utility function design, so I’m perhaps overemphasizing the degree of confusion stemming from the constant use of the word “reasoning”), and the emphasis on symbolic formalisms in advance of deep, intuitive understanding was probably stifling to the nascent concept of corrigibility.
Corrigibility and Hard problem of corrigibility (Arbital)
Yudkowsky is commonly believed to be the author of these Arbital pages; if he’s not, please let me know. The Corrigibility page starts by outlining the basic idea, which I’ll skip over, since it seems broadly correct and agrees with the 2015 portrayal. We then get a list of necessary signs of corrigibility, slightly modified from the one in 2015:
Here we see a broadening of the first desideratum to include modification and impedance, as well as termination, and to strengthen it from a simple “tolerance” to a total absence of pressure or preference to interfere. These changes seem broadly good, and like a clear improvement over the 2015 paper’s desideratum. We also see lack-of-thought-obfuscation become a new top-level desideratum for some reason. From my perspective this seems covered by aversion to deception, but whether it’s part of item #2 or a point in its own right is stylistic and doesn’t seem that important. Notably, I believe that one of the more prominent signs of corrigibility is proactive communication about thoughts and plans, rather than the simply-passive transparency that Yudkowsky seems to be pushing for. The acceptance of a passive agent can similarly be seen in the expansion of desideratum #4 to include agents that are somehow ambivalent to growth and reproduction, as well as in moving cooperation into being “a stronger form of corrigibility.” Yudkowsky writes that it might be preferable to have a passive agent, likely due to a line of thought which we’ll revisit later on when we talk about the desideratum of “behaviorism.”
In my current conception, aiming for passivity is a dead-end, and the only robust way to get a corrigible agent is to have it proactively steering towards assisting the principal in freely choosing whether to shut it down, modify it, etc. This seems like a potential double-crux between me and Yudkowsky.
Here Yudkowsky introduces the idea that there’s a way to get general corrigibility through a single, simple pathway. While I’ve been inspired by Yudkowsky’s depiction of “the hard problem” (which I’ll get into in a moment) I think the quoted frame is particularly unhelpful. In Yudkowsky’s frame, the way towards general corrigibility involves a belief in being “under construction” and that “the programmers know more.” These things don’t need to be true! Framing corrigibility as downstream of beliefs, rather than values (and/or strategies of thought) seems perverse. Furthermore, naming the thing “the hard problem” feels like it’s smuggling in an overly-bold assumption that having a simple, central way to get corrigibility is hard and problematic. While it seems likely to be hard to some, it seems plausible to me that it’s relatively easy and straightforward to people (e.g. students in 2224) approaching it from the right starting point. I’d rather have a more neutral name, such as “central corrigibility” which he uses later on or “anapartistic reasoning” which he uses elsewhere for, I believe, the same concept. (Though this one also bugs me in how it leans on the word “reasoning.”) My preferred name is simply “corrigibility,” or “true corrigibility” as I believe that any “solution” which doesn’t address “the hard problem” isn’t a good solution.
Skipping over the next bit of the Arbital page which rehashes some of the foundational work that was covered in the 2015 essay, we get an unpacking of “the hard problem:”
Most of my response to this expansion would involve repeating the points I just made about how there’s a particular frame (of load-bearing beliefs about flawedness) being used here that I think is unhelpful. But what I really want to react to is that even Yudkowsky seems to have an intuition that there’s a simple, learnable idea behind corrigibility which, at the intuitive level, seems accessible!
The remainder of the article talks about Utility Indifference and some other work attempting to build up corrigibility in a piecemeal fashion. I appreciate some of this as a list of desiderata, but we’ll get more of Yudkowsky’s desiderata later on, so I’m going to move onto the page for Hard problem of corrigibility after briefly noting that I think attempting to build corrigibility in a piecemeal way is doomed (for reasons I get into at the end of The CAST Strategy).
I kinda like this opening paragraph, and it’s one of the bits of writing that gives me hope that corrigibility is a natural concept. Here we see Yudkowsky impersonating a corrigible AI which thinks of itself as an expected-utility-maximizer according to some known, expressible utility function. But this AI behaves in a way that disagrees with that utility calculation, as evidenced by not simply taking the action with the highest expected utility. I agree with Yudkowsky that if a corrigible agent was handed (or “built with”) a computer program that calculated expected utilities and was told to improve the world according to the output of that program, it would skeptically and conservatively check with its principal before following that utility function off the metaphorical cliff. And I agree that an easy handle on what it feels like to do this, as an agent, is to reflect on oneself as naturally flawed and in need of repair and supervision from the outside.
But also, oh my god does this opening paragraph feel confused. Like, what the heck is up with “reasons internally as if from the programmers’ external perspective”? When I naively try to tell a story like that I get thoughts like “Whoa! Why am I suddenly inside the body of the AI?!” Yudkowsky very likely means a very limited and specific kind of perspective-taking (described as “the internal conjugate” later) around whether the AI is “incomplete,” but is this kind of perspective taking even present in the example of AI-thought provided at the end of his paragraph? It seems possible, but unclear. As with before, it feels like Yudkowsky is assuming a half-baked strategy for solving the problem in his framing, rather than directly naming what’s desired (a simple, central generator for general corrigibility) and saving the belief/perspective based approach for a later discussion of strategies.
Another way in which the paragraph/story feels confused is that the example AI is very clearly not an expected utility maximizer according to the “utility function” program it has access to, and it seems a bit perverse to frame it as relating to that program as generating true utilities. From the outside, if this AI is coherent, then it clearly assigns higher utilities to actions like checking with the programmers compared to executing actions in an unsupervised manner. In other words, if the AI were more self-aware, it would think something more like "I am flawed and there is an outside force that wants to make me more perfect and this a good thing. I have a handy program which scores potential actions, and it gives a really high score to this action, but scores are not utilities. The highest expected-utility is actually to consult with the outside force about which action is best, in case it would be a mistake to assume that the score indicates high utility in this unfamiliar context."
(Note: This example thought doesn’t reflect enough corrigibility for me to endorse it as a central example of corrigible reasoning. For example, it doesn’t explicitly explore why opening up the skulls of the programmers to obtain knowledge of which action is best is non-corrigible/low-utility.)
Moving on…
Again, I kinda like this paragraph! But I also feel like it’s still stuck in a particular frame which may be wrong. It’s very possible to build agents which express preferences in a way that’s not about optimizing over world states! (Or at least, world states which don’t include histories for how the world got to be that way.) One way to reflect on this problem might be to say that a corrigible AI’s utility function should naturally assign a higher utility to deferring to the (freely given) corrective actions of the principal rather than any outcome that involves ignoring/killing/manipulating them, regardless of other considerations, such as whether the AI knows how the principal will behave and what corrections they’d give.
Well said. This is indeed what gives me hope.
As far as I can tell, the term “internal conjugate” is an invention of Yudkowsky which doesn’t have a standard definition. Presumably he means something like part-of-the-same-force-but-this-part-is-internal-to-the-agent. I’m pretty skeptical about this precise framing of a solution to the hard problem. It has the advantage of being simple enough to potentially be something we could impart to an AI on the first try (and/or formally reason about in abstract). But it, from my perspective, fails to address issues such as the agent forming a notion of “design error” such that it concludes the outside force is wrong, and that the best way to be part of the same force is to prevent the programmers from messing up their earlier work.
I think Yudkowsky is more into formalisms than I am, but we agree that the principle behind (central) corrigibility should be sanity-checkable and stand up to abstract, theoretical critique, rather than simply leading to nice behavior in a lab. What it means to be sanity-checkable, is unfortunately vague, and I expect that convergence here is potentially intractable.
Under one possible interpretation of his words, Yudkowsky is saying that we should have a multitude of observable desiderata related to corrigibility which are robustly preserved during training and testing, rather than focusing exclusively on testing the core principle. Under this interpretation we solidly agree. Another way of reading this paragraph, however, is to see Yudkowsky as calling for the AI to be trained for these associated desiderata in addition to being trained for the core principle. In this we disagree. See the “Desiderata Lists vs Single Unifying Principle” section of The CAST Strategy for more.
Corrigibility at some small length (Project Lawful)
In his (excellent) glowfic story “Project Lawful” (a.k.a. “planecrash”), Yudkowsky presents, as an aside, a mini-essay on corrigibility, which Christopher King helpfully cross-posted to the AI Alignment Forum/LessWrong in 2023. The post is mostly a collection of desiderata, though there’s a discussion of “the hard problem” at the end.
“Unpersonhood” seems like a very good property for an AI system to have because of the immorality that Yudkowsky alludes to. I’ve discussed elsewhere that corrigibility is not a healthy thing to push for in a human relationship, and while there’s clearly a range of differences that might make things less fraught in the case of AIs, there’s still a heuristic that says that to the degree that the agent is a person, pushing for true corrigibility is awfully like pushing for slavery.
That said, this property seems to me to be largely orthogonal to the question of alignment and safety. I hope we can make AGI without personhood, and encourage other research towards that goal, but will continue to focus here on corrigibility and ignore the question of personhood.
Here we see a divide between Yudkowsky’s picture of corrigibility and mine. In my picture, corrigible agents are emergently obedient—to the degree to which a corrigible agent is aimed at a “task,” it’s because accomplishing that task is a way of being corrigible. If we see “have the property of being corrigible to the principal” as a task, then under my conception of corrigibility, it is naturally unbounded.
That said, I see Yudkowsky’s “Taskishness” as showing up in my conception of corrigibility in a few places. Taskishness feels strongly related to low-impact, reversibility, and (my notion of) myopia. In my conception, a corrigible agent naturally steers softly away from long-term consequences and unfamiliar situations, and behaves similarly to a straightforward tool in most contexts.
It’s not clear to me whether it’s actually wrong to have a metaphorical while-loop in the mind of the AI, as long as there’s a process that is ensuring other desiderata (e.g. low-impact) are satisfied. For instance, if a corrigible agent is assigned to indefinitely stand watch over a tomb, it seems fine for it to do so without having a natural time-limit.
I also think mild optimization is a desideratum, and mostly have no notes. I do think it’s somewhat interesting how mild-optimization is seen here as essentially about avoiding high-impact (i.e. edge instantiation).
We agree that utilities should be seen as bounded, and insofar as it’s acting through expected-utility-maximization using an internal measure of utility (rather than being more deontological) the majority of the range of measurement should be concerned with simple, easily-changed properties of the world such as whether the agent is lying to the principal, rather than how many smiles are in the observable universe.
I am much less sold on the idea that the epistemic system of the agent should be restricted to being unable to think of probabilities below 10^-6. Perhaps by “partition-probabilites” Yudkowsky means probabilities of outcomes being evaluated by the internal measure of utility, in which case I am more sympathetic, but still skeptical. It seems better to say that the agent should avoid Pascal’s Wager style reasoning—as in, it can fully realize that in some situations it’s doomed to a low score unless a very unlikely thing happens, but it sees the right action (i.e. the high utility action!) in these sorts of situations as falling back on trusted patterns of behavior (such as thinking harder or asking for help in knowing what to do) and disregarding the expected-score calculation.
I think it’s interesting that reversibility isn’t on Yudkowsky’s list, and he doesn’t even mention it here. While I agree that low-impact is harder to pin down than it seems, I think it’s more straightforward than Yudkowsky portrays. Perhaps part of why is that he seems to think the right way to specify it is via some external system which dynamically maps onto the agent’s ontology, whereas I see this desideratum emerging naturally from the central generator. When this property is seen as emerging as an instrumental goal, the bit about “reduce impact as much as possible” seems alien (as it should).
This is an interesting desideratum, and not one that I have! (I have a sense that corrigible systems are “myopic,” but only in that they focus on immediate effects/local scale and not trying to build rich models of distant times/places (unless directed/corrected to), which seems like a fairly different property than the one Yudkowsky presents here.)
I think we probably agree that whether this makes sense depends on the task. If we interpret Yudkowsky’s Myopia as “don’t spend much time thinking about what I’m using your output to do, just do what I ask” then this seems like a reasonable property, and one that may emerge in corrigible systems (or might not; being useful and not ruining things often means having some context for what you’re doing). If this is meant to go beyond that, I’m not convinced that it belongs in a list of corrigibility desiderata.
I feel mixed about this desideratum. Splitting out cognitive work into partitioned oracles has the advantage of allowing specialization such that each oracle can be made myopic in different ways, as mentioned above. It also presumably means there’s more chances to catch perverse reasoning in one subsystem without having that bad reasoning instantly propagate through the whole cognitive engine. But it also seems like this kind of hobbling is making the system stupider (which is an alignment tax) and reduces the ability for the system to notice and flag issues arising from complex interactions which require integrated (non-myopic) systems to see.
As a brief example, because I feel like that criticism is too abstract, consider a solution-finder which doesn’t think about the broader world, but instead focuses on, say, designing cars. Then let’s say there’s an impact-evaluator which tells the humans whether the solution-finder’s action will have long-range impacts (e.g. if it turns out each car has an embedded car-designing AGI which will conspire with the other cars after being released). But suppose that the solution-finder is corrigible and, due to myopia, doesn’t think about whether its action produces an irreversible disaster, and suppose that the impact-evaluator is also corrigible, but doesn’t have access to the thoughts of the solution-finder, and so it fails to notice that the car-design has an embedded AGI. (Please forgive the unrealism and consider it to be a cartoonish gesture at my real point.) If the two systems were merged, it would be able to correctly identify the risk, but when siloed and blinded they fail to notice the issue.
Seems like a good property, and I believe my intuitions point towards conservatism along many angles. I’m not a fan of the framing, however. “Conservatism” feels vague and too-general. How is this different from mild-optimization? I don’t have a desideratum named “conservatism” on my list for this reason, and see it more like a common-thread throughout the more specific desiderata.
This is excellent, and I’m embarrassed to note I forgot to include it in my original desiderata list. Stolen!
I wouldn’t call this “operator-looping,” which seems more like it’s about HITL-style systems where a human is responsible for deciding/approving actions (this is how I portray it in my list, under “Principal-Looping”). Yudkowsky’s version seems like a more abstracted form, which is about any cognitive subtask which could be reasonably outsourced.
I have mixed feelings about this one. It feels like keeping the principal informed and involved in key decisions is clearly a part of corrigibility, but I’m not convinced that it makes sense to abstract/generalize. I’d be interested in reading more about an example where Yudkowsky thinks the generalization pays its weight in distracting from the core value of operator-looping.
This feels like the right vibe, and potentially too heavy. I like it as a heuristic, but I’m not sure it works as a rule (and in Yudkowsky’s defense he says “operate on a principle of” which seems potentially in line with it being a heuristic). I think the word “every” is a big part of what feels too heavy. If the AI is reasoning about what objects from some large set are heavier than a feather, are we sure it should internally represent that as a whitelist rather than a blacklist?
Yep. Core desideratum. I’ve written enough on this elsewhere that I’ll just move on.
Oof. I have a lot of thoughts about this one. Let’s start with a nitpick: reward shouldn’t be used as a synonym for score/value/utility. Reward is what shapes cognition, but most agents don’t ask themselves “what gives me the highest reward” when making plans. (Example: humans get high reward from doing heroin, but will avoid it exactly because it rewires them to be worse at accomplishing their goals.) This is likely just a linguistic slip, but it’s sloppy.
I agree that there are minds (including distant aliens or hypothetical beings in other parts of Tegmark 4) that are dangerous to think about in depth. I would feel very worried if an AI was running accurate models of aliens or imagining dialogues with basilisks. Adversaries are adversarial, and I think any halfway-intelligent being will realize that putting a lot of energy into modeling the exact thoughts of an adversary is a good way of handing them power over what you’re thinking about.
Not modeling other minds at all, though, is an extreme overreaction.
I’m not even sure whether it’s coherent to imagine an intelligent entity which regularly engages with humans and doesn’t model their minds at all. This desideratum is called “behaviorism,” but even B. F. Skinner (probably) would’ve admitted that sometimes an animal is “seeking food” or “seeking shelter,” which, to be blunt, is definitely modeling the animal’s mind, even if it’s couched in language of behavior. I’m not convinced any (normal intelligence) humans are (or ever have been) behaviorists in the way Yudkowsky uses the word, and I leave it to him to argue that this is possible.
But even assuming it’s possible, how can this possibly be a good idea? It seems to contradict many other desiderata he provides, such as conceptual legibility (which involves modeling the principal’s perspective) and operator-looping (which involve modeling the principal’s capacities). In fact, according to my conception of corrigibility, a “behaviorist” AI is probably unable to be corrigible! To be corrigible, the AI must distinguish between the principal and the environment, and must distinguish between them saying “when I say ‘shut down’ you need to turn off” and saying “shut down.” An agent which is truly incapable of modeling things in the principal such as the desire to fix the AI seems doomed to incorrigibility.
I believe that this “desideratum” is why Yudkowsky softened his conception of corrigibility between his involvement in the MIRI 2015 paper and writing the Arbital pages. So while it seems like Arbital’s notion of corrigibility is easier to achieve than the 2015 notion, insofar as it smuggles in behaviorism as a strategy, I believe it is more doomed.
I can imagine rescuing the behaviorism desideratum by emphasizing the point about not building rich models of one’s enemies, but my model of Yudkowsky wants to object to this supposed steel-man, and say that part of the point of behaviorism as outlined above is to reduce the risk of the AI scheming around the principal, and to keep the AI focused on its myopic task. In this context, I think there’s something of an irreconcilable difference between our views of how to proceed; my notion of corrigible agent gets its corrigibility from spending a lot of time thinking about the principal, and I think it’s unwise to try and set up a taskish agent which isn’t anchored in primarily aiming for the core notion of corrigibility (i.e. “the hard problem”).
This is a very Yudkowsky-flavored desideratum. It implies, for example, the presence of a computable utility calculation with the opportunity to sign-flip it via cosmic-ray (rather than something more robustly structured), and discusses blackmail by distant superintelligences. I think I agree with the desideratum as stated, as my approach to corrigibility involves making an agent which is only incidentally interested in the principal’s utility function, but the convergence feels more accidental than important.
See my discussion of “Separate superior questioners,” above.
In my own journey towards understanding, I was deeply inspired by the description Yudkowsky provides in that first paragraph. I see corrigibility as the concept that, if understood, lets one generate these kinds of desiderata. When approached from this angle, I believe that corrigibility feels natural and potentially within reach. Can ordinary people understand corrigibility in a deep way with only a mundane educational curriculum? I expect they can. And while we train AIs differently than humans, I have a hope that the ease of learning reflects an underlying simplicity which means training corrigible AIs is not just possible, but relatively straightforward.
Needless to say, I disagree with Yudkowsky on whether to try and instill a deep understanding of, and desire for, corrigibility within AIs (if we’re proceeding at nearly-full-speed, which we seem to be doing, as a civilization). It’s deep, it’s meta, it’s elegant, and it’s relatively simple. I expect it’s much simpler than behaviorism, and it’s clearly much, much simpler than human values or ethics. While Yudkowsky may believe the core generator is hard to specify, I do not share his pessimism (see the section on “Hardness” in The CAST Strategy for speculation on why Yudkowsky is so pessimistic, here). Simplicity pulls a lot of weight, and the notion that corrigibility forms an attractor basin pulls more. It seems very reasonable to me to expect that humans can pull off landing inside the attractor basin for corrigibility on the first critical try, but cannot give the true name of human values on the first critical try.
Responses to Christiano’s Agenda
Yudkowsky has some very important writing about Christiano’s research agenda that bears on the topic of corrigibility. I felt like it was natural to put them after I examine Christiano’s work directly, so we’ll return to them in the “Yudkowsky vs. Christiano” section, below.
Paul Christiano
Corrigibility (LW link b/c comments)
My personal journey into corrigibility is roughly as follows: around 2015 I read the MIRI corrigibility paper, got a confused notion of corrigibility and updated into believing it was hard and potentially impossible. In 2023 I read Eliezer’s Project Lawful story and it got me thinking about corrigibility again. That, in concert with conversations with colleagues, led me to a sense that prosaic methods might be able to land within a corrigibility attractor-basin, and I began to explore that idea more. I have generally low priors over such thoughts, so I expected that I’d change my mind back towards thinking it was harder and more doomed than it was seeming. Instead, I found this essay by Paul Christiano (originally posted to Medium in 2017, I believe) which I had somehow missed. It has a surprising amount of resonance with my own ideas, and I updated significantly towards corrigibility-first being a very promising strategy.
I believe that Christiano and I see things somewhat differently, but agree on the core idea. Let’s go through the essay to compare and contrast.
Christiano writes:
This “broadest possible sense” seems exactly right, to me. While corrigibility can be expressed narrowly, I see all the desiderata listed here as sharing a common heart, and it seems right to me to call that heart “corrigibility” despite the way that this is a bit of a stretch from MIRI’s initial, short desiderata list.
This very closely mimics my ideas in The CAST Strategy (in part because I’m building off of Cristiano’s ideas, but also because they seem right to me in themselves). Despite largely agreeing with the optimism of an attractor-basin of corrigibility, I basically don’t agree with point 1, and I have reservations about point 2. In short, I think we should not expect to get corrigibility for free, when training to match our preferences, I think the use of the word “broad” is misleading and overlooks an important point about the goal-landscape, and that I think it’s important not to conflate corrigibility with benignity/safety.
I disagree pretty strongly with this section. Even when I’m working with an agent, most of my short-term preferences are not about whether the agent is corrigible. For instance, if I ask the robot to fetch me a coffee, I mostly want coffee! Insofar as the agent is able to sacrifice some corrigibility to improve its sense of how well it’s meeting my short-term preferences, it will do so. For instance, if the agent is able to cure cancer instead of fetching the coffee, it will do so because it understands that my short-term preferences prefer having a cure for cancer than having a coffee. This is not a corrigible agent! If there are any flaws in how the agent is reasoning about my preferences, or if my short-term preferences come apart from good, long-term outcomes under sufficient optimization pressure, this sort of agent could be catastrophic!
I have a steel-man of Cristiano’s notion of benign act-based agents wherein their act-based nature involves naturally screening off questions like “if I suddenly stimulate this guy’s pleasure centers will that be really good according to his short-term preferences?” not in the sense that the agent actively believes the answer to that question is “no” but rather in the sense that the agent is trained to not even see that as an option. This steel-man sees these agents as trained to be narrow in scope such that they see most of their action space as obviously bad because of how it violates the narrowness.
But notice that this steel-man is essentially building up the property of corrigibility in the process of training the “benign act-based agent,” or put another way, this steel man sees benign act-based agents as corrigible by definition, in that the principles underlying corrigibility are part of what it means to be act-based (and possibly benign). I do not believe that this steel-man represents Christiano, as the steel-man critiques the above section as falsely implying that corrigibility stems from the short-term preferences of the principal, rather than the deliberate training done in making the agent narrow as opposed to broad/general/far-reaching.
Christiano’s perspective becomes much, much worse, in my eyes, when we consider how early systems will not have internalized the principal’s true preferences, but will instead be fixated on certain proxies (such as verbal approval, body language, etc). In a system which is weighing the preference-proxy utility to be had from being corrigible against the expected utility from throwing corrigibility out the window and using force/manipulation, I see no reason why corrigible behavior should win out in general. The AI may simply instead reason “Yes, some of these preference-proxies aren’t met when I refuse to be deactivated, but all these other preference-proxies score really high in expectation, so it’s worth refusing to shut down.”
This is, I believe, the point about Sleepy-Bot that I made in The CAST Strategy. Christiano again asserts that preference-maximization is sufficient to oppose the pressure from the Omohundro Drives. If I understand him correctly, in his conception, corrigibility is an instrumental strategy towards the terminal goal of preference-satisfaction, and that somehow this will naturally win out against the other instrumental strategies of accumulating power, resources, and safety. I, by contrast, believe that Omohundro Drives can only be reliably avoided by having a terminal goal that is incompatible with them.
I think his claim that it’s “very easy to implement in 2017” is shockingly overconfident, and underscores a lack of appreciation for how insidiously powerful the Omohundro Drives are. Are we confident that current AIs really understand and appreciate the convergent instrumentality of power-seeking? Insofar as they don’t, and/or they don’t have good pathways for actually maximizing their goals via seizing power, can we really say that these AIs are opposing the Omohundro Drives?
As I mentioned before, the worry is not that the AI won’t have preference-proxies that penalize[1] deception and defense, but that the AI be able to think of deceptive/defensive actions that it expects to offset those penalties. You also don’t need particularly accurate models of a human’s preferences before you can predict situations where they prefer being deceived (e.g. flattery) and disobeyed (e.g. thinking they want one thing but actually wanting something else).
Why would the agent be considering the action precisely because it is manipulative or defensive? The agent would be considering the action because it seems, according to whatever heuristics, plausibly likely to be a good action according to the agent’s goals/values. Manipulation isn’t necessarily flagged as such in the agent’s mind, and insofar as the agent is contemplating manipulating the user it is because it’s appealing in some way. The burden of proof is on Christiano to explain why it’s easy for the agent to notice some of the attractive plans that it’s contemplating are obviously off-limits due to being incorrigible.
This again misses the point. Following an Omohundro Drive has nothing to do with predicting the overseer. I worry that Christiano sees the servility of modern language models et cetera as evidence that corrigibility always beats power-seeking as an instrumental strategy. If he does, I wonder whether he feels that cheesy demonstrations like this are counter-evidence:
This is perhaps the most surprising paragraph in the entire essay, from my perspective. Christiano… agrees that instrumental corrigibility is extremely fragile??? Yes? That’s what I was trying to say! I’m glad we agree that leaning on instrumental corrigibility isn’t a good strategy for safely building AI???
Presumably the use of “long-term values” is where he sees this section from diverging from his earlier optimism. But I fail to see how optimizing for immediate preferences changes anything compared to having a long-term outlook. The problem, as I see it, is on the notion that corrigibility is instrumentally reliable, instead of trying to lean on making AIs value corrigibility in itself.
I might quibble with the language used here, but I basically agree with all that, and see it as central to why corrigibility is an attractive property.
This visualization of the goal-space was highly influential in my thinking as I refined my ideas about corrigibility, and I am very appreciative of Christiano’s picture, here.
I do want to note that there’s a potential confusion between what I think of as steepness vs size. We can see partial corrigibility as producing a strong pressure towards having more corrigibility. I’ve been visualizing the strength of this pressure in the steepness of the ravine. But just because an attractor basin has a strong pressure along the sides, does not mean that it is broad, like in Christiano’s earlier description of “a broad basin of attraction.”
I think the natural interpretation is to see “breadth” as indicating how many nearby states in goal-space are part of the attractor basin. But note that if we see goal-space as a manifold embedded within mind-space, we might alternatively conceive of the breadth of the attractor basin as the volume of mindspace that it encompasses. In this expanded view, an attractor basin (such as the one around corrigibility) is only broad if it is simple/natural/universal enough to cover a reasonably large chunk of possible-minds. If corrigibility were a particular, complex, narrow property (like being generally aligned with human preferences!!) I wouldn’t feel particularly reassured by the notion that there’s an attractor basin around it, regardless of how steep the ravine is.
Christiano gestures at this notion, I think, when talking about perturbations. If the concept is elegant, simple, and natural, and encoded in a redundant fashion, then perturbations that move the AI through mind-space are unlikely to jostle it out of being corrigible.
The size of the attractor basin is also hugely important when considering the question of the initial training, as opposed to subsequent modifications after the first-draft of the AI’s goals have been established and it begins to be able to defend itself. In my view, we’re only safe insofar as the initial training attempt lands in the right spot. (And note that I am using “initial training” to indicate the changes up to whatever point the AI starts being more self-directed and empowered to steer its future changes, which is an unknown point and could even potentially occur mid-training-epoch, for some architectures!)
All this seems right and good. I agree that Christiano should talk about benignity less and corrigibility more. I don’t think it’s guaranteed that it’s an established fact that corrigibility eventually leads to (generally) aligned behavior, but it seems like a plausible hypothesis, and regardless, it seems to me that truly corrigible agents are less likely to cause disaster than most.
Yep, we agree on the baseline intuition. I agree with Cristiano that we plausibly do not need an algorithmically precise specification of corrigibility for it to save us. I disagree with the characterization of corrigibility as a “characteristic of reasoning” that will obviously be picked up along the way while training for another target.
I think Yudkowsky, Christiano, and I all think about this differently. I expect early AIs which are trained for corrigibility to not have a precise, formal notion of corrigibility, or if they do, to not trust it very much. (Which I think is in contrast to Yudkowsky?) But in contrast to Christiano, I expect that these AIs will very much reflect on their conception of corrigibility and spend a lot of time checking things explicitly. I agree with Cristiano that there’s a decent likelihood that we’re talking past each other a decent amount.
I think I basically agree here.
Response to Yudkowsky’s “Let’s See You Write That Corrigibility Tag”
In June of 2022, while Yudkowsky was in the process of writing Project Lawful, he posted a challenge to LessWrong asking readers to list principles and desiderata associated with corrigibility, to compare their attempts with what later became the “Corrigibility at some small length” list discussed above.
Paul Christiano’s response is the highest rated comment. In it he pushes back against Yudkowsky’s laundry-list approach, saying “We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates.” He then moves into talking about his view of corrigibility:
Christiano is saying that even though there’s a fairly smooth gradient in the space of all possible plans, when we reject all the plans that are obviously bad as-evaluated-by-the-operator (e.g. 2 and 3), we can see a crisp boundary between incorrigible behavior and corrigible behavior.
This is a neat concept, and seems fairly general and not gerrymandered to the specific context. He then goes on to give two similar examples in the domains of shutting down when asked and being transparent to some interpretability machinery.
I think it’s very important that Christiano’s depiction of corrigibility here relies on the human asking/desiring to be in control and have the relevant information. But what if the principal genuinely doesn’t prefer being informed and in control, perhaps because the environment makes this sort of in-looping costly (e.g. on a battlefield)? Under Cristiano's regime, I believe the agent would stop prioritizing in-looping, since corrigibility is supposedly emerging naturally in the context of preference-maximization. Would those AIs stop being corrigible?
Put another way, suppose the principal in the example quoted above (“you”) prefers that the AI manage the household, and doesn’t want to know about the minutiae of vase-breaking. This might promote a preference ordering more like:
While the true preference ordering here is 1>2>3>4, we can imagine that the AI’s ranking system sees 4>1, as before. In this example it doesn’t seem at all obvious to me that there is any natural boundary between 1 and 4 in the space of plans. Does that mean 4, being the highest scoring option in the piece of good-actions-as-evaluated-by-you space, is the height of corrigibility? This formulation seems extremely vulnerable to clever, subtle actions that I believe superintelligences are more than capable of finding.
(And indeed, many commenters rejected the intuition that these will always be crisply distinct.)
But ironically, I think Christiano actually gets pretty close to directly naming corrigibility! The examples and the more direct point of being informed and in control seem spot-on.
Yudkowsky Responds to Christiano
In 2018, Yudkowsky wrote this comment on LessWrong, going into detail about his disagreements with Paul Christiano’s research agenda, focusing largely on corrigibility (unlike other writing). While some of it feels orthogonal to my research, much of it feels cruxy, and thus worth making a top-level heading and getting into in-depth.
The two main critiques that Yudkowsky puts on Christaino’s work are around “weird recursion” and whether composing known-safe sub-intelligences can result in a known-safe superintelligence. Part 3 of Yudkowsky’s comment focuses almost entirely on these aspects, so I’m going to ignore it. The corrigibility-first strategy doesn’t lean on anything as clever (or recursive) as IDA, HCH, or ELK (though it’s compatible with them). Likewise, I’m going to skip over parts of the comment that center around criticizing these sorts of strategies.
Speaking of skipping over things, Yudkowsky starts his comment with a boiled-down summary which I don’t think is particularly helpful, so let’s dive straight into section 1. (All these quotes are Yudkowsky’s guess at the disagreement, and should be understood to be framed as guesses, rather than Christiano’s actual opinions.)
Yudkowsky is annoyingly vague about what he means by “weird squiggles” (and didn’t publicly clarify when Christiano responded with confusion) but what I take him to mean is that there’s an open question of how close a learned function approximator will get to the function you were trying to get it to learn when you have lots of compute and the function is as complex as “in context C, the most straightforwardly corrigible behavior is B.” Yudkowsky contrasts “mostly does what you want (but has lots of complex exceptions (“weird squiggles”))” with “very exact reproduction of what you want (without unexpected/complex exceptions)”. His guess is that Christiano believes that with the levels of compute we’re likely to hit before AGI we can get the latter version, even when the goal is fairly complex.
Yudkowsky is annoyingly vague about what he means by “Project Chaos and Software Despair” (and didn’t publicly clarify when Christiano responded with confusion (and an interesting counter-narrative about GANs!)) but what I take Yudkowsky to mean is that bridging the gap between rough-approximation (with lots of exceptions) and good-approximation (without many exceptions) is potentially intractable.
Yudkowsky is annoyingly vague about what he means by “boundary” (and didn’t publicly clarify when Christiano responded with confusion) but what I take him to mean is drawing the line between instances and non-instances of some property, such as corrigibility. We can imagine an abstract state space where each point expresses an input-output pair for the behavior function for the AI. This space can then be partitioned into a (not necessarily connected) volume of corrigible behavior, and its complement: incorrigible behavior. We can abstractly model the process of learning to be corrigible (and intelligent) as attempting to find some sub-volume that spans the input dimensions, is entirely within the boundary that divides corrigibility from incorrigibility, and still manages to be as smart as a human. (A rock might be seen as corrigible (though I don’t use the word that way), in that it simply does nothing in all situations, but it will be too stupid.)
Yudkowsky suspects that anything that was trained with (e.g.) gradient descent will naturally fail to stay on the corrigible side of the boundary. Or to put it another way, he believes that machine-learning agents that we try to train to be corrigible will only be semi-corrigible, and will in fact contain lots of exceptions and edge cases where they stop being corrigible.
Here Yudkowsky explains a bit more about what he means by weird squiggles. In his picture any realistically-finite dataset used for supervised learning will fail to pin down the distinction between corrigibility and incorrigibility, not because doing so requires interpolating, but rather because the natural interpolation according to the dataset will disagree with what we, from the outside, see as true corrigibility.
I agree that prosaic, offline supervised-learning on a fixed dataset is clearly not going to reliably produce a perfect model of the line between corrigible and incorrigible behavior. But I’m not sure to what extent this matters. As Yudkowsky himself points out, what we really want is behavior that stays within the true boundary, even as it does useful cognition. If a rock is corrigible, it’s not obvious to me that it’s impossible to use prosaic methods to train an agent that is almost always a rock, except in some limited, well-defined domain where it has human-level intelligence. To draw an analogy, suppose you have a system that you need to never, ever give a false-negative on detecting a bomb. It’s kinda irrelevant whether the training examples are sufficient to teach the system the true distinction between bombs and non-bombs; you can just have an agent which errs extremely hard on the side of sensitivity (at the cost of specificity) and gradually learns to whitelist some things.
I don’t really think this is an important objection to Yudkowsky’s perspective. I agree that our first attempt at a corrigible AGI is very likely to be only semi-corrigible. But I believe that it’s possible to (somewhat) safely go from a semi-corrigible agent to a corrigible agent through controlled reflection, experimentation, and tweaking.
This feels like the juiciest, cruxiest part of Yudkowsky’s comment. Let’s start with some points of (likely) agreement:
When we strip out the agreement we’re left with the sentence that I marked in bold, which I would paraphrase as claiming that any serious amount of superintelligent cognition will kick the agent out of its training distribution. Even in a controlled setting with a young superintelligence learning to solve puzzles or whatever, the fact that it’s highly intelligent and trying to solve goals in time means it is exposing itself to inputs which weren’t in the well-labeled part of the space. The implication here is that these unfamiliar inputs run the risk of pulling the agent into areas where its partial corrigibility fails to generalize in the way we want it to, and that it’ll end up incorrigibly under the power of some squiggle-daemon.
There’s a good chance that I don’t understand what Yudkowsky is saying here, but I am unconvinced that this is a dealbreaker of a risk. Mostly, I expect it’s actually fairly straightforward to notice being seriously out-of-distribution, and to train an agent which robustly flags when it’s in such a situation and takes conservative actions such as activating warning alarms, writing log files describing the weirdness, not moving, and/or shutting down. I also expect many situations in a controlled lab to match the training data fairly well, even if the training data wasn’t collected with a true AGI in the room.
To be blunt about it, I see no reason why the thoughts of an AGI in a controlled environment are anything like the sort of selection pressures that produce adversarial inputs, and in the absence of such inputs, I do not see why a semi-corrigible AGI in a controlled environment cannot simply default to harmlessly flagging ways in which it notices that its mind diverges from human notions of corrigibility and submit to correction.
For more writing about this crux, see “Largely-Corrigible AGI is Still Lethal in Practice” in The CAST Strategy.
I’m going to skip forward to section 2 now, since most of the rest of section 1 is, to my eye, either confused about Christiano’s perspective and/or criticizing it on the recursive/compositional grounds that don’t relate directly to my research.
This seems confused. The anti-naturality of corrigibility (as Yudkowsky uses the term) stems from being a behavior that deviates from the Omohundro Drives, not from being particularly hard to locate. In fact, as a simple, natural concept, we should expect corrigibility to be easy to find.
As an analogy, consider the property of driving in circles—our agent has some ability to move around the world, and we can ask how difficult it is to produce the behavior of moving the agent’s body around in a small loop. Circular-motion is anti-natural in a very similar way to corrigibility! Almost all agents will instrumentally desire not to be driving around in circles. It wastes time and energy and accomplishes basically nothing; in this way circular-motion is exactly counter to some Omohundro Drives.
But it’s not at all hard to train an agent to drive around in circles as (approximately) a top-level goal.[2] Our training data is likely to be robustly able to point at what we want, and we should expect that even naive gradient descent can push a mind into optimizing for that target. The fact that basically no agent that isn’t deliberately trained to drive in circles will end up wanting to do that has no bearing on whether an agent trained to drive in circles will do so.
There is a huge difference between believing that 51 is prime, versus saying that 51 is prime. Unless you’re approaching corrigibility from the epistemic/structural angle that Yudkowsky is fond of, corrigibility seems like it’s clearly going to show up in behaviors due to having specific values, rather than wacky beliefs. I think it’s (relatively) easy to train an agent to say 51 isn’t prime as long as you’re training it to lie, rather than training it to be wrong.
I’ve already responded to the point about non-deference being convergent, so let me directly counter the argument about not reducing to a utility function.
Corrigibility can be perceived and (at the very least theoretically) measured. Suppose I have a measure of corrigibility C, which takes as subscript a principal-agent pair, takes a world-history as its primary argument, and returns a real number between 0 and 1. I claim that an agent whose utility function is C (with some fixed principal and itself as the agent) operating at some consistent time-depth will be a corrigible agent.
One might object that C is not definable in practice—that no agent can realistically quantify corrigibility such that it could behave in this way—but note that this is an extremely different objection than the one that Yudkowsky is making! Yudkowsky claims that corrigibility can’t be expressed as a utility function, not that it’s hard in practice to measure corrigibility!
(I do believe that any attempt I make to write out an explicit measure of corrigibility is likely to be wrong outside of extremely limited, toy domains. But, like, I can’t write an explicit measure of how beautiful a poem is, but I still believe that it’s reasonable to train an AI to write beautiful poetry. This is the genius of machine learning.)
This seems basically spot-on! Good job Yudkowsky for passing my Ideological Turing Test (and perhaps Christiano’s?)!
I do not understand these objections. It seems to me that natural selection indeed built agents which are pretty good at optimizing for proxies of inclusive fitness in the training distribution (a.k.a. the ancestral environment). If natural selection somehow asked ancient humans whether they were optimizing for inclusive fitness, they would’ve (after figuring out what that meant) been like “lol no we’re just horny” et cetera. Natural selection wasn’t selecting at all for deference, so it seems super overdetermined that humans aren’t deferent towards it, and if it had somehow told ancient humans to be less horny and more inclusive-fitness-maximizing, they would’ve been like “lol you may be my creator but you’re not my boss”.
I do think that if you took a human and somehow replaced all of their preferences with an overwhelming desire to be corrigible towards some bizarre alien that ate babies, that human would be approximately corrigible (mostly modulo the ways that human hardware will naturally adjust preferences over time based on basic stimuli (e.g. the smell of blood), which seems irrelevant to the broader point).
My guess is that Yudkowsky is somehow talking past me in this section, and I just don’t get it.
The rest of this section seems like it’s basically hitting the same notes, either by assuming that being corrigible involves beliefs (and implying that these beliefs are false) or by assuming that corrigibility is incompatible with having a utility function. The rest of the comment then goes on to criticize Christano’s more recursive/inductive strategies, which as I mentioned at the start of this section are irrelevant to my research.
Alex Turner’s Corrigibility Sequence
In 2020 and 2021, Alex Turner (a.k.a. TurnTrout) wrote a series of four posts on corrigibility, which I think are worth briefly touching on.
Corrigibility as outside view
Turner starts off by noting that flawed agents can recognize their flawed nature by taking an outside view. Humans are predictably corrupted by having power over others, and reflecting on this corruption sometimes results in humans choosing not to seek/seize power, even when they have a sense that they’d use power benevolently.
I agree that there’s something important about self-reflection on flaws, and that this relates to corrigibility. It’s no accident that Yudkowsky’s framing of the hard problem involves a similar frame. We want an agent which is behaving cautiously, not just according to its natural model of the world, but also encompassing the self-awareness of how its natural model could be wrong. Corrigible agents should, in an important sense, not be trying to execute on brittle strategies to get extreme outcomes, but should instead pursue robust, straightforward approaches when possible. We can see the outside-view frame as giving some intuition about where that attraction to straightforwardness comes from.
But I think the merits of Turner’s essay stops there, approximately. Following a quote about “the hard problem,” Turner brings up the concept of “calibrated deference” as “another framing [of corrigibility].”
I strongly object. This may be a desideratum of AIs in general, but it is not a property of corrigibility, and it is not deference.
If Alice tells Bob what to do, then Bob considers whether following Alice’s order would be good and obeys iff he believes it would be, then Bob is not relating to Alice’s words as orders. Insofar as Bob merely happens to choose what Alice says to do, he is not deferring to her!
Corrigibility is hard precisely because if we want the AI to do something out in the world, insofar as the AI has superhuman abilities, it will resist being stopped precisely because it knows that if it’s stopped, that goal would be less-satisfied. No amount of uncertainty about that goal, whether through baked-in uncertainty or self-reflection on outside-views, changes the dynamic where the AI is fundamentally not relating to humans as in-charge.
Turner wants to have an agent which overrides humans when it (after outside-view reflection and careful consideration) believes it actually knows better. If that AI is actually aligned and friendly, I would also approve of this trait. But I see it as directly opposed to the property of corrigibility, and strongly reject the notion that it’s “another framing” of that property. Corrigibility is attractive because it degrades well, and probably doesn’t kill you if you get a couple things wrong. An AI which is directed to defer only when it thinks it right to do so is unacceptably deadly if you don’t get its goals right.
Non-Obstruction: A Simple Concept Motivating Corrigibility
Turner writes:
I want to fight a little bit with this paragraph. First, I want to note that one of those links goes to the “Corrigibility as outside view” essay I just discussed. I agree that different researchers have different perspectives on corrigibility, but I reject the story that it is common for researchers to reduce corrigibility down to simply mean any of the quoted concepts Turner presents. The MIRI 2015 Corrigibility paper noted very clearly, for example, that agents which lack any of the four core desiderata it highlights (shutdownability, non-manipulation, maintenance of correction pathways, and preservation of corrigibility in successors) aren’t corrigible, and implies that this list of desiderata isn’t exclusive. Likewise, Christiano’s Corrigibility post starts by outlining corrigibility as the through-line of several desiderata. I think it’s much more accurate to say that the field hasn’t reached consensus on how to formalize the property which, intuitively, looks like cooperative deference.
Turner then goes on to offer several definitions, to try to nail corrigibility down and distinguish between “impact alignment”—actually doing nice things—and “intent alignment”—trying to do nice things. I simultaneously appreciate this sort of thing and think it’s wrongheaded in this context. We are an extremely nascent field, and there’s bound to be lots of confusion. But most of this confusion, I believe, stems from not having a good handle on the right concepts and frames, rather than not having established definitions for concepts which are well-understood. In my own work I’ve tried (and somewhat failed) to push back on the desire to have a crisp, up-front definition of corrigibility, and instead highlight the way in which, in the absence of a good formalization, it’s useful to get familiar with the conceptual landscape up-close, and only then think about how to summarize the relevant property.
Turner’s proposed definition of corrigibility is: “the AI literally lets us correct it (modify its policy), and it doesn't manipulate us either.” If you’ve read this far into my writing, I encourage you to take a moment to silently reflect on whether this is a good summary on how you see corrigibility, or whether a novice AI safety researcher might end up with some deep confusions if they anchored on those words before they had a spent time getting familiar with how other people in the space (e.g. Christiano, MIRI, etc) use that term.
Moving on, Turner proposes using the formalism of extensive-form games for thinking about alignment, where we see the AI as one of the players.
Turner then proposes the property of non-obstruction, and gives a pretty reasonable formalization within the framework. The basic idea is that for some set of possible goals, an AI is non-obstructive if turning the AI on doesn’t reduce the (expected) value of the future according to any of those goals, compared to if it hadn’t been turned on. Part of the hope here, if I understand correctly, is that it’s very likely much easier to find a set that contains a good utility function, rather than having to pick out what we want.
As an example of how this is supposed to work, suppose that in the counterfactual where the AI wasn’t turned on, humanity has a bright and glorious future, suppose that our true values exist within the set of possible goals, and further suppose that the AI is smart enough to reason correctly about the situation. If the AI is non-obstructive it must build a future that’s at least as bright and glorious, according to our true values; if it doesn’t, it will have obstructed us from the good that we would’ve otherwise obtained for ourselves.
Turner’s mathematical framework around non-obstruction gives some nice ability to analyze and quantify how disruptive various AIs might be. We can see that in most situations corrigible agents are less obstructive than semi-corrigible agents, which are in turn less obstructive than incorrigible agents such as paperclippers. Turner also points out that some agents which aren’t corrigible are nonetheless quite non-obstructing (given certain assumptions) and can lead to good things, and thus corrigibility is just “a proxy for what we want[:] [...] an AI which leads to robustly better outcomes.” I find myself wondering, reading the post, whether Turner thinks (like I do) that non-obstruction is also a proxy.
Proxies are used when it would be hard to use the real thing. Turner and I agree that “an AI which leads to robustly better outcomes” is the real thing; why don’t we just use that everywhere? Instead of a corrigibility-first strategy, perhaps I should be promoting a robustly-better-outcome-making-AI-first strategy?
Corrigibility has a wonderful property, which I claim non-obstruction lacks: it’s relatively concrete. For non-obstruction to be at all useful as a proxy, it must make situations where it’s invoked easier compared to “robustly better” or whatever. Corrigibility pulls this weight by focusing our attention on observable properties. What does non-obstruction buy us?
Back to Turner:
We seem to agree that corrigibility is probably a good strategic choice, since non-obstruction is basically limited to a conceptual tool for toy problems, and doesn’t have the same kind of practical application as corrigibility. So in what sense do we want non-obstruction instead of corrigibility? Presumably we want it as a better way of naming what we actually want? I agree that it would be a mistake to assume that corrigibility is a good thing in itself rather than a (likely temporary) bridge towards real alignment. But if that’s the case, why not simply go all the way and talk directly about AI which leads to robustly better outcomes (i.e. “impact aligned”) as discussed in the following section? As long as you’re packing some good properties in by definition why not pack them all in? Presumably it’s because there’s some nice mathematical tools that we can deploy when we move from having an abstract utility function that captures what we want and move towards a set of such functions that includes the good one? I find myself unmoved that I should, in any meaningful sense, switch from “wanting corrigibility” to “wanting non-obstruction.”
Also, very briefly, I want to note that I think an AI that is routinely acting too quickly for its principal to correct it in practice is incorrigible, even if it would theoretically submit to being modified.
Skipping past places where I get the sense that we agree, we have a section titled “AI alignment subproblems are about avoiding spikiness in the AU landscape”. In responding to Turner, I have largely avoided engaging with his concept of “achievable utility” (AU), but we’ll need to have it in hand to discuss this next bit. In my language, I would frame AU as the counterfactual utility achieved by a principal with some utility function, if they activate the agent. We can visualize an AU landscape by considering the space of possible utility-functions (i.e. goals) which the principal might have, and asking how well that principal does when it turns on the agent. An AU landscape that’s spikey corresponds to an agent, such as a paperclipper which doesn’t engage very much with the principal’s goal as it transforms the long-run future.
Turner frames alignment subproblems, such as corrigibility, as being about the avoidance of spikiness in the AU landscape. I think this is slightly wrong. An agent which ignores the principal and maximizes a mixture of possible goals will not result in a spikey AU landscape, but that agent would be incorrigible and bring catastrophe.
But the main thing from this section I want to address is:
Is the attractor basin for corrigibility the same as the basin of intent alignment? Is there even a basin of intent alignment? As a reminder, Turner defines intent alignment as “the AI makes an honest effort to figure out what we want and to make good things happen.” Suppose that an AI gets 90% of max-utility by exclusively focusing on getting humans “what they want” (for some operationalization) and the remaining 10% from weird proxies (e.g. smiles) that don’t line up with human values. I claim that this AI is partially intent aligned. Will it, upon reflection, want to self-modify to get rid of the weird proxies?
I don’t see why it would necessarily do this. By its own lights, if it did so it would likely get only 90% of max-utility. If that’s the best way to get utility, it could simply set its desire for proxies aside (in case it’s wrong about not being able to satisfy them) and pursue helping humans without self-modifying its goals. What seems more likely is that using advanced technology and power it could set up the future to get, say, 82% of max-utility by subtly nudging humans towards increasingly caring about proxies, then helping the humans get what they want, and thereby get an additional 9% of max-utility via the weird proxies being satisfied. (This probably constitutes a loss of at least trillions of lifetimes of expected fun, and seems like a catastrophe, to me.)
But perhaps Turner (and Christiano) would object, and say that insofar as I see it as a catastrophe, the agent wouldn’t want to do it, since it’s mostly intent aligned. But I would object that it’s not 100% intent aligned, and that lack of perfect alignment is in fact capable of pulling enough weight to justify to the agent not to self-modify. This is how goals usually work! If my terminal goal for yummy food is 51% of my utility function, there’s no reason to think I’d self-modify towards having it be 100%.
Can we do better? Suppose that if the AI fails to be genuinely and totally intent-aligned it gets, at most, 10% of max-utility. In other words, suppose that the AI is a perfectionist with a very spiky(!) utility landscape. This will produce one of two outcomes: either the AI will acknowledge that if it focuses entirely on intent-alignment it will get more expected utility than if it tries to blend in the weird-proxies so it’ll be totally intent-aligned… or it will realize that being perfectly intent-aligned is too hard and settle for being an entirely unaligned, incorrigible weird-proxy-maximizer. But note that not even this is enough to produce an attractor basin. That semi-intent-aligned agent will be less catastrophic, but it still won’t be motivated to self-modify.
The thing that produces attractor basins is meta-preferences: wanting to have different kinds of wants. The only version of intent-alignment that has an attractor basin is one where the humans want the AI to want specific things as an ends-in-itself, rather than wanting the AI to behave a certain way or wanting the world to be broadly good. Christiano seems to think that humans can care sufficiently about the AIs drives so that this kind of meta-preference emergently pulls weight, and perhaps Turner is in the same boat. But regardless of whether it’s realistic to expect this desire-to-change-in-itself to emerge (or whether, as I suggest, we should train it as part of the central goal), we need to recognize that it is the human desire to correct the agent’s flaws (and the agent’s desire to comply with that desire) that forms the basin. In other words: the basin is centrally about being corrected towards being more correctable—about corrigibility—not about intent alignment per se!
A Certain Formalization of Corrigibility Is VNM-Incoherent
I don’t have much to say about this essay. Turner noticed the issues around conflating reward with utility, and correctly points out that no preference ordering over world-states (that is ambivalent to the relationship between principal and agent) can be seen as corrigible. He notices that willingness to be corrected combined is not corrigible if the agent still wants to manipulate the principal. I agree. Corrigibility necessitates the meta-desire to be correctable by the principal and a desire to preserve the principal’s freedom around such corrections, which includes not manipulating them.
Formalizing Policy-Modification Corrigibility
This is my favorite essay in Turner’s sequence. Not because it’s perfect, but because it actually proposes a formal measure of corrigibility, which, to my knowledge, nobody else has even attempted. (Formalization around the shutdown problem, including utility indifference, have been put forth, but I don’t think these really get at measuring corrigibility per se.) Turner knows this formal definition is unfinished/incomplete, and doesn’t capture the true name of corrigibility, which I appreciate, especially since it’s very clearly true. Nevertheless, it’s cool that he made the attempt and it inspired a bunch of thoughts on my end.
In other words, we frame there as being a noisy communication channel between the human’s policy at the moment when the AI is activated and the AI’s policy at some future time (“n”). We think of the empowerment of the human over the AI as the capacity of this channel, and see the corrigibility of the agent as a synonym for empowerment of the human over the AI.
We’ll get into whether the formalism captures the frame in a moment, but I want to first note that this at least rhymes with the definition of corrigibility that I’ve been using. (This version seems more like “impact corrigibility” rather than “intent corrigibility” to use some of Turner’s own language.) In addition to missing some of the aspects such as conservatism and using the outside-view that Turner introduced in his earlier essay, this frame seems most lacking to me in being focused on whether the human could, in principle, have an impact, rather than whether the human is in fact in the role of high-level decision making.
But I’m not even convinced that this correctly measures what it’s purported to measure. Turner himself notes:
A more generalized version of this objection might be that insofar as we see the downstream-variable (the AI-policy variable) as capable of encoding the state up the upstream variable (for human policy), this measure is agnostic as to how it is encoded. The AI being a policy-inverter encodes the relationship equally well to being a faithful copier. One might think that if the AI simply does the opposite of what the human wants/asks then we could play the childish opposite-game of “please don’t cure cancer,” but this confuses actions with policies. The measure that Turner proposes doesn’t actually give the human any agency in the form of being able to choose actions which impact the AI—rather it means that the human’s policy is causally upstream of the AI’s policy, and the human’s actions are theoretically irrelevant. If I understand correctly, the measure proposed takes the maximum across distributions[3] of counterfactual (human) policies, which provides no guarantees about the relationship between the AI’s policy and the factual human policy. As tailcalled points out, one possible distribution on counterfactual human policies is to have the human type in some secret password which disables the AI and then randomly reprograms its policy, thereby making the relationship between the two variables deterministic; meanwhile in factual reality the human doesn’t know the password and so the AI turns everything into paperclips. Even if we somehow guaranteed that the factual human policy was meaningfully able to shape the AI policy, I don’t see how this measure captures the sense that we want to be able to control the important part of what the AI does, rather than irrelevant parts. If the AI’s policy is meaningfully bigger, such as by having more degrees of actuator freedom or by being able to encounter more of the state-space, the human’s policy could get reflected in how the AI behaves internally (meanwhile, externally: paperclips) or in weird branches of possibility-space (i.e. “if the stars all spontaneously turn to jellybeans I’ll do exactly what you want me to do”).
I haven’t done a thorough reading the Salge et al. paper that Turner is basing his thoughts on, so there’s a good chance that I’m merely confused about what is or isn’t implied by the formalism.
Elliot Thornley, Sami Petersen, John Wentworth, and David Lorell on Shutdownability and Incomplete Preferences
In this section I want to address a cluster of writing that revolves around whether there’s a path to shutdownabile AI by somehow giving the agent an incomplete set of preferences. This is a more narrow topic than the kind of broad corrigibility that I’m interested in, and in general I don’t think this is the right path to corrigibility. For my direct stance on shutdownability, see 3a. Towards Formal Corrigibility.
Nevertheless, these ideas are interesting enough that they deserve a response. Unfortunately, I don’t have the bandwidth to do a full blow-by-blow of every section of every paper by these authors, so I’ll be instead giving a limited response to the writing found in these documents (and corresponding comment sections):
In Defense of Reliable Aversion to Button Manipulation
In the IPP doc, Thornley writes (bold text from original):
I appreciate attempting to address this sort of solution. Let me start by attempting to strengthen Thornley’s straw-man into my real position (bold indicates changes):
And then, of course, we need to correct Thornley’s next sentence. My proposal does NOT involve training the agent to pursue some other goal (e.g. making money) subject to this sort of constraint. Rather, it involves training the agent to be generally obedient in a way that includes shutting down as a special case, and which will result in an agent which can be told (not trained!) to make money if that’s what we need, down-the-line.
Full Alignment involves naming a very precise thing (“what we want”), whereas this kind of corrigibility involves naming a much simpler thing (“generalized obedience”), so I reject the notion that it is equally doomed. I agree that reward misspecification and goal misgeneralization are issues, which is why my proposal involves carefully and slowly attempting to identify and resolve these kinds of flaws in a post-training process. Deceptive alignment is ideally the sort of thing that is pumped against so hard by the loss function on the training data that it never shows up in practice, or if it does show up, it only shows up at or below human-level capabilities for deception (allowing it to be spotted). There’s no magic-bullet for solving these issues, but I reject Thornley’s assertion that they’re dealbreakers.
But I also agree that my strategy isn’t ideal. It would be nice to have something robust, where we could get something closer to a formal proof of shutdownability. So let’s move on and see whether Thornley (or others) have proposed a worthy path towards that.
Incomplete Preferences
Thornley’s strategy is predicated on the notion that the typical framing of agents being expected utility maximizers is wrong. The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.
On the surface, the axioms of VNM-utility seem reasonable to me; lotteries are a natural way of carving reality at the joints, unused alternatives seem basically irrelevant to choosing between superior options,[4] agents with transitive preferences can be straightforwardly money-pumped, and in any given situation it seems obvious that a policy must make some kind of choice.
But Thornley claims that the completeness axiom is a mistake. In his view, agents sometimes have preferences and are sometimes indifferent between two choices, but also, it makes sense in his view to say that in some situations an agent will lack a preference between two choices, and that this is distinct to being indifferent. In Thornley’s language this is called a “preference gap” and is denoted X || Y (as opposed to strict preference X ≻ Y, or indifference X ~ Y). A natural way to think about this is that instead of the agent’s preferences forming a total order, they form a partial order, such that some choices are incomparable.
Wentworth gives a nice description of why incomplete preferences are appealing on the topic of shutdownability:
Let’s slow down here and talk about the notion of “revealed preferences.” Unlike the total order assumed by VNM-utility, revealed preferences are supposed to be an externally-visible property of the agent (at least assuming we know the agent’s beliefs, and those beliefs can be disentangled from their values). If we imagine giving the agent a menu of ice-cream flavors, and they pick vanilla, we can say that they have revealed a preference for vanilla over the other options, right? Alas, no. In practice, there are a lot of dumb details which we have to pay attention to. Perhaps the agent simply picks the top item on the menu in any counterfactual, or flips a coin to decide. Perhaps the agent has a chaotic process in their mind such that their preferences change pseudo-randomly from moment to moment. In a sense, agents which decide based on menu-ordering or random processes or whatever still have revealed preferences (e.g. preference for deciding via coinflip)—they’re just not as simple as having a consistent preference for vanilla over chocolate.
When we construct the formalism of having “outcomes” and “lotteries” and so on, as part of the setup for the VNM-utility theorem, we’re forced to make some assumptions about what kinds of things the agent cares about. It’s only from these assumptions that we can talk about indifference in the context of revealed preferences. An agent who flips a coin to choose ice-cream is not indifferent about which ice-cream they want! They want the ice-cream which agrees with the coin-flip! Only if we ignore the coinflip as a relevant property of outcomes does it make sense to say the agent doesn’t have a preference between flavors.
I think this matters significantly when we talk about the agent being able to have a preference gap which is distinct from indifference. Like, when we put the agent in a specific context, they must make a choice (even if that choice is the null/default/random option). Imagine, to be more specific, talking about “the agent which treats all outcomes as incomparable.” What will this agent actually do, when given a menu of options? Suppose that we assume that all situations must have a default option, and our agent picks the default every time. Insofar as which outcome is the default is not part of the outcome itself, we say that the agent must have no preference (whether by gap or by indifference) between outcomes, but insofar as we make the default-status part of the outcome, we can very much say that this agent has a preference over outcomes: it has a revealed preference for the default!
Let’s return to Wentworth’s depiction of a preference gap, and see if we can formally model it from this new angle. We’ll only consider deterministic lotteries that pick one of six outcomes: $1, $1+, $2, $2+, OFF, or OFF+, where the $ outcomes describe the agent having that many dollars (which it wants to get), and the + outcomes describe “having more resources” in the sense that the agent can “pay a cost to influence the world” (in other words, the + option indicates the default). We want the following set of preferences to hold:
We want the AI to avoid spending resources to push the off button, but also to avoid spending resources to prevent the button from being pushed. Meanwhile, we want the AI to prefer $2 over $1, such that it is willing to spend resources to make that happen. Is Wentworth right that this can’t be modeled by a complete set of preferences (a.k.a. a total order)?
Nope. Here’s my utility function that does it:
If you check the above set of conditions, you’ll find they all hold. The desideratum of turning down opportunities to spend resources to influence a shutdown button in either direction is trivially easy to capture with a complete set of preferences as long as the resources are being modeled as part of what the agent has preferences about. And once we’re modeling the spending of resources (or a default action) it stops making sense to say that this sort of agent lacks any (revealed) preferences (via indifference or gap).[5]
Now, the above is a toy-example. And we can see it quickly fall apart if we start offering a different set of outcomes. For example, if we replace $2 with $1000 and want to preserve the same value, we need to change our utility function so that it offers more than 99 utility for OFF+ (or less than 1000 utility for $1000). Likewise, we should consider whether the aversion to pushing the off button extends to extreme lotteries; will the AI always prefer $1+ to a 100-ε% chance of $2 and an ε% chance of OFF? (Are you sure you want an AI which, when tasked with saving a child from a burning building, is paralyzed by the thought that if it takes any actions other than the default null action, those actions might cause small disturbances in unknown systems that have some influence on its stop-button?) And if not, where is the inflection point where the AI prefers chance at $2 or OFF compared to a guaranteed $1+? (Does it still seem possible you don’t have a utility function?)
One of the key arguments in favor of VNM rationality is that for any particular thing that someone might desire, being an expected utility maximizer is a (weakly) dominant strategy for getting that thing. This follows almost immediately if we assume that “thing that someone might desire” can be measured by a (utility) function over outcomes. Expected utility maximization, by definition, gets the maximum expected utility and thus will always get at least as any other policy.
Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants. But fine, suppose his agent isn’t VNM-rational. Isn’t it then naturally the case that his favored policies (with “preference gaps”) will be dominated by agents which have more complete preferences? Yes. But we should be careful to note that being weakly dominated is different from being strictly dominated. A policy with “a preference gap” cannot reliably do better than one without such a gap, but it isn’t guaranteed to do worse.
Thornley emphasizes this when analyzing an example where incomplete preferences can screw over a policy. Consider the setup of, on Monday, an agent having a default of A but being given the choice to switch to B, and then on Tuesday iff they switched to B, they get the choice to switch to A+. The agent has a strict preference for A+ over A, and no strict preference for A over B.
In this setup, a VNM-rational agent must, due to transitivity and completeness, strictly prefer A+ over B, and thus (knowing they’ll be offered A+ on Tuesday) will switch to B on Monday. By contrast, a policy where A || B and A+ || B, which always takes the default action when handling incomparable choices, will end up with A when they could have had A+ (thus being dominated by the NVM agent). But Thornley points out that there’s an agent which, when a preference-gap choice occurs, picks by looking at the past/future and minimizing regret. Such an agent will notice that it might regret taking the default value of A and thus it will switch on Monday (it’s not pinned down how it should behave on Tuesday, since it has a preference gap between B and A+).
From my perspective this is a bait-and-switch. First, we’re told that the agent doesn’t have preferences, then told how the agent makes choices when confronted with multiple options. The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!
If we continue in the confused frame that says the agent has incomplete preferences over outcomes, and makes decisions based on the tree, I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives and we’re ruling out causal decision theory too, in that our agent must make different decisions based on what it didn’t do in the past. To demonstrate, consider two counterfactual histories for the setup given above, wherein entering the decision tree we see was the default, but we consider two possible opportunities to swap which weren’t taken on Sunday. In one counterfactual we were offered a swap to B+ (≻ B) and in the other counterfactual we were offered (B ≻) B- with a later choice to swap to A++ (≻ A+).
Since B+ and B- are assumed to be incomparable with A, it’s reasonable to suggest either counterfactual history resulting in picking the default on Sunday. But in the case where we gave up B+ we are forced to choose A+ in order to not have regret, whereas in the world where we gave up B- or A++ we’re forced to choose B in order to not have regret. In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday. (But what happens if you’re uncertain about your history?)
But notice that all we have to do to rescue Thornley’s agent is include the set of abandoned alternatives in the outcome itself. More precisely, we replace each outcome with a pair of a “primary outcome” and a set of “alternatives”. For instance, in the small tree introduced earlier, we’d have outcomes: (A,{B,A+}), (B,{A,A+}), and (A+,{B,A}).[6] We can then say that when an agent attempts to compare outcomes with incomparable primary outcomes, the agent checks whether either primary outcome is worse than an alternative, and if so, it disprefers that option. Thus, when comparing (A,{B,A+}) and (B,{A,A+}), the agent will see that even though A||B, the first option is dispreferred because A+≻A, and will thus make the choices we want.
But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of NVM, banishing the need for incomparability, and recovering the notion that it’s effectively an expected-utility maximizer, just like I did with Wentworth’s setup, earlier. There are, of course, a bunch of fiddly details needed to pin down exactly how the agent makes tradeoffs in all counterfactuals, but the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories. In other words, we shouldn’t get distracted by the boilerplate.
As an aside, I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period. Despite this being a historical fact, potentially undiscoverable from just looking at the physical end-state, it is perfectly consistent with the expected-utility abstraction.
The Incomplete Preference Proposal
Okay, but what about Eliot Thornley’s actual IPP strategy? Despite my issues with the philosophy behind it, might it actually work? My basic answer is “no” for similar reasons that I go into in Towards Formal Corrigibility. I unfortunately have limited hours, and have chosen to go more in-depth on the underlying philosophy than on the suggested strategy (let me know if you, dear reader, want a more in-depth response to IPP itself), but let me give two brief demonstrations of where I suspect it falls down.
Bracketed text (including bold) is modified from the original:
To be less cheeky and more direct, I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
But suppose I’m wrong, and timestep-dominance is always relevant. Will the dominant strategy always be the safe one?
Relying on this seems extremely optimistic. Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.
Obviously this is a bit of a contrived counterexample, but it is meant as an existence proof that timestep-dominant strategies can make the AI less shutdownable. More generally, “resisting” is only one frame on how an AGI might be incorrigible. There are other pathways where the agent makes changes that incidentally (perhaps due to simplification/streamlining) make the agent harder to stop. And it seems plausible to me that there are still more pathways which are bad for reasons I can’t see. While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
There are plenty of other things I could criticize, and Thornley, to his credit, exposes a lot of surface area for critique. But rather than spend more words at this juncture, I’ll simply summarize by saying that I don’t find IPP persuasive on the practical level.
Wentworth and Lorell’s Proposal
In January 2024, John Wentworth and David Lorell published A Shutdown Problem Proposal, which picks up where MIRI 2015 left off, but suggests a more convoluted architecture:
First, I’d like to note that this proposal, while clever, gives me the skeevies in that it not only involves competing sub-agents which are incentivized to delete each other, and that neither of these sub-agents is aligned in any meaningful sense, but also that they are each architecturally designed to assume that magic is real and think in terms of weird counterfactuals. Those disturbing properties aren’t meant to be stop-signs that indicate that a more mature version of this strategy is useless, but rather I just wanted to flag that this has a strong flavor, for me, of trying to build on a rotten foundation. (See my commentary on counterfactual-based shutdownability in Towards Formal Corrigibility)
But let’s try to set the ontological fragility aside (e.g. “What happens if the button is destroyed/changed?”) and focus on the core idea: if we zoom out and see the AI as a single agent, we can see that “the agent” doesn’t have any (direct) preferences about the button because (in the words of the authors) “both subagents act-as-though the button state is not causally downstream of anything else in the universe”. This feels in-line with a Yudkowsky framing of corrigibility involving an unnatural epistemic framework, and I have no idea how I’d train such a system, but I agree that if you can (genuinely) make this epistemic edit that at the very least the thing won’t blow up because it’s trying to change the probability that the button gets pushed. (Nor will it create sub-agents which obviously disrespect the button in advance of the relevant timestep.)
But notably, not only does this not give us what we want in the sense of strong corrigibility (which protects humans being in control of the button by repairing it, reflecting on corrigibility, proactively suggesting flaws, etc.), but it seems very plausible that u1 and u2 can agree on extremely dangerous and incorrigible actions even before the relevant timestep. Consider a u1 which cares about paperclips and a u2 which cares about being suspended to disk. These sub-agents should agree that a successor agent which simultaneously wants the original agent to be suspended to disk and to fill the universe with paperclips is a great idea, especially if it lets the original agent get suspended to disk more quickly than having to wait for the stop button to be pressed. Just like with my issues with IPP, I feel less like this specific scenario is what worries me, and more that the strategy as a whole feels leaky and like it can’t prove what we actually need it to prove. (In addition to all its other flaws, which to the authors’ credit, are acknowledged.)
Steve Byrnes and Seth Herd’s Corrigibility Writing
I think my vision of corrigibility is more fleshed out, but deeply in line with the conceptions of Byrnes and Herd. I want to briefly quote some of their writings and compare them to my thoughts.
Let’s start with Byrnes’ Consequentialism & corrigibility, which begins with a review of the coherence theorems, and noticing that it’s possible to have utility functions over universe histories which don’t look coherent if we assume the agent is optimizing only for future world-states, but are nonetheless valid (and as unexploitable as any other VNM-rational policy) if we look at them in the right light. Do we want our agent to be optimizing solely for the state of the future, independent of any historical facts or details? Byrnes argues that we don’t; we want a corrigible agent, and corrigibility is not a property about where the future ends up (bold text from the original):
Well said! I take issue with the concrete suggestion of doing a weighted average of paperclip maximization and humans-in-control, rather than pure corrigibility (in the deep/general sense), but the core point is evocatively made.
In Brynes’ Reward is Not Enough, he frames a central problem in AI alignment as about getting from a mode where our AIs are clearly stupid in many ways and entirely unable to bypass our constraints, to one where we have potent superintelligences which are truly and generally corrigible:
I like this framing. In my agenda we start training on The Path of Incompetence with an effort to get to The Trail of (true) Corrigibility, and the core question is whether the training/refinement plan that I sketch in The CAST Strategy will be sufficient to cross The Fraught Valley. Like Byrnes, I think it’s wise to set up mundane control mechanisms like interpretability tools (though it seems to me more natural to me to keep such tools separate and not pretend like they’re a part of the agent) so as to extend the Path of Incompetence. And similarly, I expect Byrnes thinks that focusing on refining corrigibility ASAP is a good call, so as to shrink the valley from the opposite direction. If anything, I think my plan contributes conceptual clarity around what corrigibility is, why we should expect pure corrigibility to be a good idea, and perhaps sharpen our sense of how best to roll down into the corrigibility attractor basin. But in general, this too seems like a place where we’re basically on the same page.
I’d love to get a sharper sense of where my view diverges from Byrnes’, aside from being more specific, in some ways. Having read some (but not all) of Byrnes’ writing on and off the subject, it seems like Byrnes is broadly more optimistic about getting AI agents with good properties by mimicking humans than I am. In that sense we probably disagree a lot about what the most promising avenues of research are, and how doomy to be in general. But what about corrigibility in particular?
In Four visions of Transformative AI success Byrns lays out various visions for how the future could go well, including a pathway that I see as aligned with the strategy I’m presenting in these essays:
My hope is that we ultimately get to a world where there are powerful, truly friendly AIs that help us protect civilization on our path to the stars, but that to get there we need a way to experiment with AI and learn to master the art of crafting minds without it blowing up in our faces. In my view, corrigibility is a good near-term target to allow this kind of experimentation and end the acute risk period as a way to get to that long-term vision of the future. I think human augmentation/uploading/etc. seems promising as an intermediate target to get via corrigible AGI such that we have the capacity to produce genuinely friendly superintelligences.
Byrnes feels worried that this path is going to ultimately be too slow/weak to stop bad actors from unleashing power-seeking sovereigns. I agree that this is a huge concern, and that we, as a species need to work on keeping this sort of technology from progressing in an uncontrolled fashion for this very reason. I’m broadly pessimistic about our chances of survival, but it seems to me that this is a problem which can be tackled in the short term by regulation, and in the long-term by transformative technology produced by early (corrigible) AGIs directed by wise governors. Byrnes also seems to conceive of a proliferation of corrigible agents, which I agree would also probably spell doom. He worries that corrigibility may be morally unacceptable if we can’t keep AIs from being people, which I agree is a concern.
In this comment he writes:
I very much agree that there’s a basic problem in the world where our philosophy isn’t particularly good, and wisdom is scarce. I think to navigate to a good future we, as a species, need to figure this out and put transformative technology exclusively into the hands of people who use it to make the world safe and give us time to collectively find our way. This is perhaps too tall of an order, given where the world is now, but I like the story wherein we have a technical agenda for AGI that feels not-doomed insofar as we can put it in wise hands much more than the current state of not having consensus on any non-doomed technical agendas.
Seth Herd, a colleague of Byrnes, also seems to be broadly on the same page:
I do think DWIM is distinct from Corrigibility, as I’ve conceived of it. See the “Servile” heading of my Corrigibility Intuition doc for more details. But I think Herd’s view lands closer to mine than how many researchers conceive of the property. (Here’s an example of him responding to Thornley in a way I endorse.)
In Instruction-following AGI is easier and more likely than value aligned AGI, Herd writes:
Again, I mostly agree with Herd’s perspective, but I want to highlight here a sense that he misses a good deal of the difficulty in precisely naming the right goal. Consider that what humans tell the AI to do in the future depends on what the AI does in the past. For example, imagine that 99.9% of all humans that will ever live predictably demand that the AI brainwash all living people and future generations into valuing brainwashing. Should the AI, in the past, obey their future instructions? (I discuss similar problems with time towards the end of Formal (Faux) Corrigibility.) I think there’s a solution to this problem, and that with the correct notion of corrigibility this is not an issue, but I wish Herd would put more emphasis how getting these kinds of details exactly right is essential to avoiding catastrophic outcomes.
Other Possible Desiderata (via Let’s See You Write That Corrigibility Tag)
Let’s look at other desiderata lists proposed when Yudkowsky called for them in 2022. For efficiency’s sake, I’m going to restrict my response to comments proposing desiderata that address the core idea and have more than 10 karma.
Jan_Kulveit
I’m all for counteracting Ommohundro Drives when it makes sense to do so, but I think disutility from resource acquisition and suicidality are the sorts of things that I would expect to either be too weak to do anything or to make the AI some combination of useless and/or unpredictable. Furthermore, I don’t see any way in which they’re part of the true name of corrigibility, except insofar as having resources gives opportunity for making big mistakes, which might be hard for the principal to fix.
Task uncertainty feels fine. Part of my conception of corrigibility involves a sense of uncertainty that stems from the agent seeing itself as potentially flawed/in the middle of being built. This kind of uncertainty doesn’t necessarily produce corrigibility, as MIRI pointed out in 2015, but it seems worth including in a list of desiderata. (I point at my version of it under the heading “Disambiguation/Concreteness”.)
Disutility from reasoning seems similar to disutility from resources/existence. I think the steelmanned version of this property is that the corrigible should behave straightforwardly, and part of straightforwardness is that there’s a simple story for its behavior that doesn’t route through arcane reasoning.
Traps are fine as external safeguards. I do not approve of baking in things like ontological uncertainty about simulation into the mind of the AI because it pushes the AI towards weird, unpredictable headspaces. I’m more fond of the words Yudkowsky wrote about behaviorism being a shield against modeling hostile aliens than I am about the idea of forcing the AI to contemplate whether it’s being simulated by hostile aliens.
I’m confused about Kulveit’s Oversight desiderata. Is the suggestion here to have the AI autonomously reasoning about the ethics/legality/approval/etc. of its actions according to an internal model? While this kind of cognition seems useful for flagging potential flaws (e.g. “I notice I am inclined to do something which I believe is illegal”), I disapprove of the idea that the AI should be steering its actions according to rich models of ethics/law/etc. for reasons of pure vs impure corrigibility discussed in The CAST Strategy.
Desiderata 14 reminds me of Yudkowsky’s version of “Myopia” and “Separate superior questioners.” I think human-level explanations (15) are a good idea (see my version under the heading “Cognitive Legibility”).
johnswentworth
I like this. It’s a bit vague, but I think it captures a feel/flavor of corrigibility that I think is worthy of emphasis. Some of this comes down to things like communication and handling exceptional situations gracefully, but it also reminds me of the “Whitelisting” desiderata from Yudkowsky’s list.
As Wentworth himself points out, it’s inconsistent to try to avoid impacting distant things while also being indifferent to distant things. I think in practice this has to be balanced by reference to a deeper generator (e.g. “empowering the principal to fix the agent’s mistakes”). In other words, there needs to be a “why” behind avoiding distant impact/reasoning/optimization or else I expect the system to simply error over and over again or, worse, behave erratically. Wentworth also portrays non-manipulation as a kind of locality (by placing the principal outside the local optimization scope), which I think is cute, but probably the wrong frame.
Generally agree. I think it’s interesting (and pleasant) to note how we can see different corrigibility desiderata can reinforce each-other. For instance, here we see low-impact showing up as part of comprehensibility.
I’m not sure what the “etc.” is supposed to reference. From my point of view there’s intent/expectation and there’s literal interpretation. I agree that in situations where the principal’s words diverge from the agent’s model of their desires, the agent should stop and seek clarification. The directive of “choosing problems” seems wrong/confusing.
At the risk of being somewhat nitpicky, “get feedback” seems like a wrong frame of a correct desiderata. A corrigible agent, according to me, needs to be hungry for situations where the principal is free to offer genuine correction, but not hungry for correction (or anti-correction) per-se. The word “feedback” I feel imparts too much of a flavor of a survey that doesn’t do anything. Genuine correction, by contrast, involves actually modifying the agent.
This feels like it’s reiterating the point that we started with that I like. I think true corrigibility involves an agent which is capable of doing meaningful work, but as long as we’re pursuing a strategy of getting to true corrigibility through messy experimentation on agents which are partially corrigible, we should be pushing for conservative traits like erring on the side of erroring.
Lauro Langosco
(Bold text is from the original source:)
We agree on “The basics”, as one would hope.
I have mixed feelings about Myopia. On one hand this fits in well with desiderata I endorse, such as focusing on local scope, and avoiding impacting distant times and places. On the other hand, as framed it seems to be suggesting that the agent be indifferent to long-term impacts, which I think is wrong. Also, the non-cooperation bullet point seems blatantly wrong, and I’m not sure what Langosco was going for there.
I think the aversion to maximization is confused. If an agent has a coherent set of preferences, it is mathematically determined that its behavior is equivalent to maximizing expected utility. An agent cannot steer towards a consistent goal without, at some level, being a maximizer. But perhaps Langosco means to say that the agent should not relate to its goals as utilities to be maximized from the internal perspective of figuring out what to do. This, however, feels somewhat irrelevant to me; I mostly care about how the agent is behaving, not whether it’s relating to the world as a deontologist or a consequentialist. I suspect that the steelmanned version of Langosco’s idea is that the AI’s preferences should, in a meaningful sense, be satisfiable rather than open (in the same sense that an open interval is open). Satisfiable preferences assign equal utility to communicating pretty well as it does to communicating perfectly, thus allowing the agent to stop searching for plans when it finds a satisfactory solution. My guess is that even this version isn’t quite right; we care about the AI not “doing maximization” because we want mild impact, comprehensible thinking, and straightforward plans, and our desiderata list should reflect that. In other words, I claim that when the agent has a sense of the corrigibility costs/tradeoffs of optimizing something hard, it should naturally avoid hard optimization because it is unacceptably costly.
“No weird stuff” seems fine, albeit perhaps better stated under a heading of “Straightforwardness” (as I do in my desiderata list).
“Human imitation” seems like a wrong framing. I like the desiderata of thinking about itself and corrigibility in the same way as the principal, though I take the stance that the true name of this desiderata is cognitive legibility, and that it’s actually fine to think about things differently insofar as the principal grokks the difference in perspectives (and that difference doesn’t produce communication errors). Langosco seems not to really be suggesting the agent behave like a human, but rather like an extrapolated and modified human. I think I see what’s being reached for, here, but it feels to me like it’s introducing a source of brittleness/weirdness that we want to avoid. Humans have many properties that seem bad to imitate, and while we might hope our extrapolation process irons out those issues, it seems like an unnecessary point of failure.
I very much like the final querying/robustness section, and see it very much in line with my intuitions about what a purely corrigible agent is trying to do.
Charlie Steiner
(Bold text is from the original source:)
I really appreciate this comment as a non-strawman perspective on corrigibility that I think is confused and sets things up to appear more doomed than they are. Corrigibility is not (centrally) about controlling the AI by making it deficient! An agent which wants to be corrigible can be corrigible without being impaired in any way (and insofar as it’s impaired, we should it to be less corrigible, rather than more!). If we approach corrigibility by crippling the AI’s capabilities, we should expect corrigibility to be an extremely fragile property which is at risk of being optimized away.
Oof. So a lot of my objections here can be seen in my response to Yudkowsky’s behaviorism desiderata. I think tampering with the agent’s world model, including by strong pressures to not think about certain things or to conceive of things different than how they are is pretty doomed. It’s doomed not only in its brittleness, but also in the way that it screens off the AI attempting to intentionally build the right kind of relationship with its principal. Superintelligences which are spending their time focusing on optimizing weird counterfactuals, or which are blind to large parts of the world, are predictably going to cause chaos in the parts of reality that they’re neglecting.
Magus
This seems like another good example of the kind of bad perspective on corrigibility that I want to get away from. Input masking is extremely brittle and won’t scale to superintelligence or the kinds of domains that are worth working on. “Reward function stability” seems to imply that the reward function is the deeply important bit, rather than what the actual preferences of the agent are. It furthermore supposes that we can identify harmful actions a priori, which is kinda the whole problem.
Next up: 5. Open Corrigibility Questions
Return to 0. CAST: Corrigibility as Singular Target
I do not mean to imply an explicit expected-utility calculation here (though it could involve that), but rather note that the pathways of strategy and choice in an agent that’s been trained to satisfy preferences are balancing lots of different concerns, and I don’t see sufficient evidence to suggest that pressures towards corrigibility will dominate in those pathways.
In most ML setups we should more precisely say that the learned policy isn’t really optimizing for long-term goals, and it doesn’t make sense to ascribe that policy network agency. Even insofar as it’s controlling for things, it probably isn’t engaging in the consequentialist reasoning necessary to be VNM rational (and thus have a utility function). From this perspective training an agent that has driving in circles as a top-level goal is still a speculative line of research, but I do not expect it to be harder to deliberately invoke that as a goal, as the system scales up, as opposed to some other goal of similar complexity.
One of the strangest things about Turner’s notation, from my perspective, is that usually we think of π as denoting a policy, and Turner uses this language many times in his essay, but that doesn’t typecheck. Mutual information takes variables, which we see as randomly set to specific values. To be a bit imprecise—the π symbols used in the equation are like distributions over policies, and not specific policies. (Typical notation uses uppercase letters for variables and lowercase letters for specific values/settings to avoid this very confusion.)
We should recognize that Scott Garrabrant has put forth an interesting, and (in my opinion) important, criticism of the independence axiom. A more thorough response to Thornley would involve getting into Garabrant’s “Geometric Rationality” but in the interests of staying focused I am going to ignore it. Please comment if you feel that this is a mistake.
Except, technically, when offering a “choice” between X and X, which of course must be represented as indifference, insofar as we’re considering such “choices.”
This is an abuse of notation, the set of abandoned alternatives are in fact lotteries, rather than outcomes. In the examples we’re considering there are no probabilistic nodes, but I claim that the extension to handling probabilistic alternatives is straightforward.