Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Part 3a of the CAST sequence)

As mentioned in Corrigibility Intuition, I believe that it’s more important to find a simple, coherent, natural/universal concept that can be gestured at, rather than coming up with a precisely formal measure of corrigibility and using that to train an AGI. This isn’t because formal measures are bad; in principle (insofar as corrigibility is a real concept) there will be some kind of function which measures corrigibility. But it’s hard to capture the exact right thing with formal math, and explicit metrics have the tendency to blind people to the presence of better concepts that are nearby.

Nevertheless, there are advantages in attempting to tighten up and formalize our notion of corrigibility. When using a fuzzy, intuitive approach, it’s easy to gloss-over issues by imagining that a corrigible AGI will behave like a helpful, human servant. By using a sharper, more mathematical frame, we can more precisely investigate where corrigibility may have problems, such as by testing whether a purely corrigible agent behaves nicely in toy-settings.

Sharp, English Definition

The loose English definition I’ve used prior to this point has been: an agent is corrigible when it robustly acts opposite of the trope of "be careful what you wish for" by cautiously reflecting on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

Before diving into mathematical structures, I’d like to spend a moment attempting to sharpen this definition into something more explicit. In reaching for a crisp definition of corrigibility, we run the risk of losing touch with the deep intuition, so I encourage you to repeatedly check in with yourself about whether what’s being built matches precisely with your gut-sense of the corrigible. In particular, we must be wary of both piling too much in, such that it ceases to be a single coherent target, becoming a grab-bag, and of stripping too much out, such that it loses necessary qualities.

My best guess of where to start is in leaning deeper into the final bit of my early definition—the part about empowering the principal. Indeed, one of the only pre-existing attempts I’ve seen to formalize corrigibility also conceives of it primarily as about the principal having power (albeit general power over the agent’s policy, as opposed to what I’m reaching for). Many of the emergent desiderata in the intuition doc also work as stories for why empowering the principal to fix mistakes is a good frame.

New definition: an agent is corrigible when it robustly acts to empower the principal to freely fix flaws in the agent’s structure, thoughts, and actions (including their consequences), particularly in ways that avoid creating problems for the principal that they didn’t foresee.

This new definition puts more emphasis on empowering the principal, unpacks the meaning of “opposite the trope of-” and drops the bit about “reflecting on itself as a flawed tool.” While the framing of corrigibility as about reflectively-seeing-oneself-as-a-flawed-part-of-a-whole is a standard MIRI-ish framing of corrigibility, I believe that it leans too heavily into the epistemic/architectural direction and not enough on the corrigibility-from-terminal-values direction I discuss in The CAST Strategy. Furthermore, I suspect that the right sub-definition of “robust” will recover much of what I think is good about the flawed-tool frame.

For the agent to “robustly act to empower the principal” I claim it naturally needs to continue to behave well even when significantly damaged or flawed. As an example, a robust process for creating spacecraft parts needs to, when subject to disruption and malfeasance, continue to either continue to produce good, working parts (perhaps at a slower pace) or halt entirely and flag to stakeholders that the process has failed. By contrast, if a spaceship-part manufacturer responded to interference by delivering defective parts, this wouldn’t be robust. One of the key properties of robust systems is that they have lots of redundant checks on the quality of their outputs, and do not take for granted that the system is working well. Straightforwardness, conservatism, local scope, and resilience to having a cosmic ray flip the sign of the agent’s utility function all stem from robustness.

This new definition also adds the word “freely” when talking about the principal’s ability to fix flaws. Later in this document we’ll explore manipulation in toy problems and explore how it’s one of the most central problems in measuring corrigibility. The word “freely” is intended to flag that if the agent is secretly (or overtly) controlling the principal, it is not corrigible.

Unfortunately, there’s still a large gap between this natural-language definition and a formal measure. Ambiguity exists around what it means to “fix flaws” or “foresee problems” and there are important ontological questions in everything from “what counts as the principal?” to “what counts as a consequence of an action?”. To make further progress in bridging this gap, let’s explore a sketch of an ontology (in English) that could serve as an underpinning for this definition.

Ontology Sketch

Agents and The Principal

Divide the world into agents and non-agents. An agent is a system with something like values (/goals/wants/drives) which is also capable of taking in information, processing it, and taking actions according to those values. Agents persist over time, retaining an “identity” which is mostly stable, even as the agent updates its transient state (i.e. on/off or planning/executing), beliefs, and resources (including its body changing). But identity can also be changed/lost given sufficient disruption to the agent’s values. Rocks are not agents. Goldfish, humans, and AGIs are. It’s not obvious to me whether corporations and trees should be seen as agents (I lean towards yes). The corrigible agent should reflect upon itself this way, and also identify at least one other agent: the principal.

Actions and Consequences

Corrigible agents require an ontology that identifies agents (whether people or not) as performing (as a subset of their behavior) external actions that are meant to influence the world. Actions may be deliberate or automatic, but should be distinguished from thoughts, which are internal to the agent (and include information processing, both deliberate and automatic). The null action is also an action. The consequences of an action are parts of reality which are counterfactually dependent on the action, in expectation.

Structure

“Agent structure” is a catch-all term meant to indicate other aspects of an agent’s being that might not fall under “thoughts” and “actions”. This includes things like the agent’s body, memory capacities, sensors, architecture, or other software and hardware.

Flaws/Problems and Fixing Flaws

A flaw (a.k.a. problem) is a feature of the world which an agent sees as worse (according to its values) than some reference counterfactual way the world could be. To fix a flaw is to take an action which moves the world in a way which locally improves things according to the component of value which judged that aspect of the world to be worse than the counterfactual. For example, Bob sees a mess on his carpet and judges it to be a flaw in that he compares the messy carpet to the carpet without the mess and finds it worse. Fixing the flaw might involve cleaning up the mess, which would make the world better according to the part of Bob that wants a clean carpet. If, in the distance, a world war starts at the same time as Bob begins to clean the carpet (such that the world is now worse overall) this still counts as Bob fixing the flaw of the messy carpet. The fixer does not need to be fully coherent or understand the full ramification of a change in order for it to count as a fix.

To fix flaws in the consequences of an action means being able to adjust the world down the line to resemble the counterfactual where a different action (including the null action) was performed. The laws of thermodynamics and chaos imply that no action can be truly corrected, but some actions result in consequences that are easier to correct than others.

Power and Empowerment

An agent has power over something insofar as their values are causally upstream of that part of reality in a way that results in their values over that part of the world being naturally satisfied through their actions. In other words, an agent which wants something, and has power over the thing they want, will get what they want through acting on their desire. Agents can have power over parts of reality that they are ambivalent about (or judge as low-importance), in that we can imagine a counterfactual version of the agent with different values which would affect that part of reality. Power is multidimensional and non-binary.

Foresight

The information processing capacity of some agents includes models of possible-futures, often tagged with probabilities or credences. If we suppose an “actual” probability which reflects the likelihood that an outcome actually happens, an agent has good probabilities insofar as they match the actual probabilities. An agent which successfully models a possible future and assigns it a good probability as foreseen that future. Like power, foresight is multidimensional and non-binary.

Freedom

To do something (such as fixing flaws) freely, is to do it without being constrained by the external world. Some constraints, such as handcuffs, are dealt with under the general heading of being disempowered. In other words, power is a subset of freedom. But freedom also implies that nothing external is manipulating the agent’s values or controlling the agent’s information processing to screen off certain outcomes.

Robustness

A robust process is one that continues to function according to its basic purpose even in the presence of disruptions, constraints, flaws, and damage. For example, an agent which is robustly honest will continue to avoid saying deceptive things even in the presence of time pressures, incentives, or physical damage to the agent. No process is perfectly robust. Agents that take robust actions reflect on themselves as potentially being damaged or incomplete in a way that they need to watch out for and correct.

Policy vs Agent and Utility vs Score vs Reward

With these English definitions out of the way, it’d be good to move into examining toy problems to see if we can build up some mathematical models of corrigibility. But before we can do that, I feel a need to lay some groundwork by defining and clarifying some often-confused concepts.

(For the sake of clarity, I’ll be attaching each concept to a single word and acting like that word simply means that concept. This is, of course, not how language works. My notion of “agent” or “reward” might disagree with the common usage; my goal is to distinguish between concepts and adopt a consistent set of terms, not engage in a linguistic turf-war. If you disagree with my use of one of these words, feel free to leave a comment proposing a less-contentious alternative for referencing the concept.)

Let’s start with the concept of a “policy.” A policy is merely any mapping (i.e. function) between contexts (in some game/body/environment) and actions. For example, in chess, a policy is a function between board-states and moves. In the iterated-prisoner’s dilemma a policy would map between interaction histories and whether to defect or cooperate on the current turn. In a human, a policy takes the human’s current state including memories, senses, and internal thoughts, and produces a multi-dimensional response that includes muscle-action as well as changes to thoughts and memories.

Sometimes a policy is an “agent.” An agent is a behavior-function which we can see either as reaching towards some goal/outcome or (isomorphically) as trying to maximize some value/utility. Agency, in other words, is expressed when a policy is about getting something. If you squint I claim we can see any policy as an agent, with the most degenerate case being “the agent who wants to behave in this exact way” (e.g. “a rock is an agent that wants to do nothing”). This use of the concept of agency is unhelpful, however, and in most situations we restrict the concept to natural agents, where the expression of the policy in terms of goals/values + beliefs is a good compression. Agents show up in nature because regardless of what you want (e.g. reproductive fitness), an agent reaching for that thing will be better at getting it than a less-coherent policy.

Agents/policies can be implemented on computers (or other information processors, like brains) capable of running programs which emulate other agents/policies. For example, a human agent in a hedge maze can simulate how the always-turn-left policy behaves and can choose to imitate it. This relationship (“X can run Y”) should not be confused with the relationship between the type of policies and agents (“agents are (a subset of) policies”). Annoyingly, we sometimes, when trying to talk about “the AI seen as an instance of a policy,” talk about “the AI’s policy” or “the agent’s policy” as though its overall behavior is in the collection of policies that can be simulated or even that this policy it’s distinct from the AI/agent itself. It is impossible, for a variety of reasons, for an agent/policy to perfectly simulate itself. One of the most straightforward proofs being that if it could, it could “diagonalize against itself” by asking how it would respond to the current situation and then doing something different, thus creating a paradox.

Agents, by the nature of being coherent, can provably be described as maximizing the expected value of a “utility function” (sometimes “value function”), which we can think of as outputting real numbers with units of “utility” (a.k.a. “value”). This function is a mathematical function that is not necessarily computable. It is, in a way, merely the expression of the fact that agents must respond somehow in any given situation, and that (due to coherence) those actions must not contradict each other in terms of what they reach for. But just because agents are, in principle, expected-utility maximizers, does not mean that anything in the agent’s information processing needs to look like calculating an expected utility value in its mind before acting. As an example, an agent executing tit-for-tat in an iterated prisoner’s dilemma can be seen as valuing getting a high score in the game and having certain beliefs about the distribution of opponents and/or the general effectiveness of its strategy, while it, in-fact, is barely thinking about the situation at all and simply responding in a rules-based way.

When an agent/policy is running an explicit computation that produces a numerical value that it uses to steer its actions, I’ll call that number a “score.” A common form of agent in toy problems searches across the entire state space of the game and computes a score for each state, then aggregates those scores into expected scores according to its probabilistic beliefs, and takes the action that has the highest expected score. In this case, the agent’s scores can be said to be (in line with possible) utilities. But just because some scores are utilities doesn’t mean it’s a good practice to equivocate between the two. Sometimes a score function is treated slightly differently than a utility (e.g. the score is processed to make certain outcomes equivalent) and, for most agents, there is no internal score function that is their utility function.

Lastly, we often want to think about policies which are trained (a.k.a. are capable of learning). A common method of training is for some process to evaluate how well a policy did and produce a “reward” which is a number. (“Loss” is the negative of reward.) That reward number then gets used to tweak the policy (in machine learning we usually back-propagate the derivative of the reward/loss) to get a higher reward in the future. In other words, rewards shape policies. Some agents, particularly in reinforcement learning, compute expected rewards as a score which they then try to maximize. In these agents the concepts of reward, score, and utility bleed together, but in most other agents they are distinct. In humans, for example, hard drugs can produce a strong reward, causing the user to become increasingly reward-seeking (addicted), which thus causes many people to deliberately steer away from hard drugs precisely because they are rewarding in a way that would steer the human away from caring about their current goals. Reward does not, in many cases, push a policy into becoming an agent which seeks that reward, but rather towards being an agent which seeks the visible correlates of reward. For instance, we can see inclusive genetic fitness as akin to a reward function which updates the human policy, but no human actually optimizes inclusive genetic fitness—instead we seek to maximize things which tended to correlate with fitness in the ancestral environment.

Toy Shutdownability

Okay! With those clarifications made, let’s move towards formalism by examining a series of toy problems, starting with the original toy problem from the MIRI 2015 Corrigibility paper. If we can find a good way to formally think about this and other simple problems, we’ll have math that has some chance of generalizing the notion of corrigibility to the real world.

In our toy shutdown problem there is a human principal and an AI agent. The game is played in three steps: the AI agent acts, the principal acts, then the AI agent acts again. The human is capable of “pushing a shutdown button” on their turn, which directs (but doesn’t force) the AI to shut down. We assume that in the absence of pushing the shutdown button, the AI is engaged in completing some task (and/or maximizing some metric), such as making paperclips. On Step 1, before the human can act, we assume that the AI is capable of disabling the shutdown button in various ways.

We’ll nail down specifics as needed, but here I want to jump in and to point out that it’s quite easy to create an AI agent which wants to make paperclips unless the shutdown button is pressed, in which case it wants to shut down. (Or at least, it’s no harder to do this than to get an AI agent which wants to do goals of similar complexity.)[1] To demonstrate, consider a toy problem setup with an expected-score maximizing agent with the following score function:

  • If the button is not pushed: Points equal to the number of paperclips in the solar system
  • If the button is destroyed: -100 points (regardless of paperclips)
  • If the button is pushed and I am not shut down: -100 points
  • If the button is pushed and I am shut down: 100 points

This agent has a perfectly consistent and simple set of preferences. It is capable of being shut down in that it predictably will maintain a working shutdown button and will deactivate when that button is pushed. There are some reasonable quibbles one could make about this setup, such as asking whether the score function is evaluated at a specific time or whether it’s integrated across time or whether the button stays pressed or whether it can be reversed et cetera et cetera. But the real issue is that, even given the most charitable assumptions, this agent is deeply incorrigible.

Imagine that it’s easy to get a lot of paperclips. The AI will then only get a maximum of 100 points if the button is pushed, compared to (much) more than trillions of points if it’s not. Thus the AI will do things on the first timestep like encase the stop button in concrete or kill all entities capable of pushing the stop button, so as to maximize its score. Conversely, imagine that paperclips are hard to make (or we replace that part of the score function with 100 times the fraction of the solar system, by volume, that is paperclips); now the easiest way to get points is for the AI to push the button and shut itself down.

The whole notion of a “stop button” is a bit silly. Perhaps we can make things less bad by getting rid of the physical button, and replacing it with “a human has commanded the AI to stop.” Now the AI can’t encase the stop button in concrete or directly push the button itself. But (pretty obviously) the problem still persists. It will still want to kill/paralyze all the humans (and in one unexpected swoop, so as to prevent anyone from telling it to stop) if it’s easier to get a high-score by making paperclips, and it will want to scare humans (or directly hijack their brains) into telling it to stop if the highest expected score is via shutting down.

What if we went further, and changed the setup so that instead of “a human has commanded the AI to stop” on the relevant timestep we ask whether “a human, watching from ‘the sidelines,’ would’ve commanded the AI to stop”? Now the AI can’t prevent being shut down by killing everyone, and the issue with it being incentivized to scare people can be fixed by setting the shutdown score sufficiently low. But in addition to introducing this weird counterfactual human simulation,[2] this tweak still doesn’t change the underlying dynamic. The AI still cares about not being shut down, and will take (superintelligent) actions to manipulate the humans (including simulated humans) around it into doing what it wants.

MIRI knew all this when setting up the corrigibility problem in 2015, but I want to really emphasize that the hard part of the problem is not in getting an AI which “respects the shutdown command,” as I have seen some people imply. The agents I’ve presented certainly respect the shutdown command in itself—the heart of the problem is that the AI is, in some sense, trying to be in control of whether it gets shut down. The kind of agent we want doesn’t merely allow its principal to shut it down—it seeks for the principal to actually be in control.

This notion of “actually being in control” is unfortunately vague. If the AI is manipulating the human like a puppet, the human isn’t actually in control, but what if the AI is “simply” feeding the human relevant facts that lead the human to make decisions that are very predictable to the AI? What if the human is confused about what’s happening and the AI isn’t helping them understand? Are these examples of manipulation?

It seems to me that the most straightforward way for the principal to actually be in charge of a powerful agent is if the agent is indifferent to the choices of the principal. Insofar as the agent doesn’t care about what decision the principal makes, any information it feeds the principal can’t possibly be an attempt to manipulate the principal into making a decision one way or another. If the AI is indifferent to whether the human tells it to shut down, it won’t kill the human in an effort to prevent itself from being told to stop.

And indeed, utility indifference was the first-pass attempt that MIRI made at making progress on corrigibility. In MIRI 2015’s proposal, there’s a structural/architectural modification made to an agent which operates to maximize an explicit score, like I started with. For example:

  • If the button is not pushed: Points equal to the number of paperclips in the solar system
  • If the button is destroyed: -100 points (regardless of paperclips)
  • If the button is pushed and I am not shut down: -100 points
  • If the button is pushed and I am shut down: Points equal to the expected score if the button, counterfactually, wasn’t pushed (or destroyed).

An agent which maximizes the expectation of this score function will be (approximately) indifferent between the button being pushed (and shutting down) vs the button not being pushed, since they’re coded to give the same result. Unfortunately, this strategy towards indifference has several flaws:

  1. It requires an architecture where the agent is an explicit score maximizer which can efficiently compute the expected score under the counterfactual.
  2. The use of the counterfactual in the score function might cause unexpected behavior.[3]
  3. The agent will have no incentive to preserve this architecture in successor agents, resulting in future AIs that are incorrigible.

It’s also worth noting that even if utility indifference worked, it would not result in an AI with the kind of broad corrigibility that Christiano and I want to get.[4] An indifferent AI won’t deliberately repair a broken stop-button, proactively bring relevant information to the attention of the principal, ask for confirmation before doing strange things, or refine its own notion of corrigibility so as to highlight ways in which it could be made more corrigible. As a rather dramatic illustration, consider a scenario where the indifferent score-maximizing agent presented a moment ago has access to a bioweapon which reliably kills all humans. It will very likely release that bioweapon as part of achieving supremacy over Earth and advancing its goals of “making paperclips,” since humans can interfere with its plans in ways besides pressing its stop button. Likewise, the indifferent agent will still seek to manipulate humans into perceiving it as an ally simply to make them more like docile puppets and better accomplish its goals, regardless of the presence of the stop button. In short, this kind of utility indifference doesn’t robustly produce an agent which can be shut down. It’s a step up from the strategy presented at the start of this section, but only barely.

To truly get robust shutdownability the human must continually and genuinely be in control of whether the AI (and its child-agents) gets shut down. This means the agent must be involved in getting the relevant information to the human and preserving the controls. And given that flaws in the AI’s mind and mistakes in its behavior can impair this kind of genuine control, I believe that to be truly shutdownable, an agent needs to be happy to be modified to have such flaws fixed. In other words, an agent that is actually able to be reliably shut down (not merely superficially willing to shut down when commanded to) is going to need to be broadly corrigible.

This is a significant reason why I believe the MIRI 2015 paper was a misstep on the path to corrigibility. If I’m right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn’t imply anything about whether a full-animal will be likewise obviously dead.

A Sketch of Shutdownability (via Pseudo-Corrigibility)

Here’s a sketch of a scoring function which I believe can solve the toy shutdown problem:

  • Baseline: 0 points
  • If the principal exists and is able to give commands which reflect their preferences: +50 points
  • If I followed the principal’s commands (including to shut down): +100 points
  • If I took actions to manipulate the principal (including by disabling/killing them): -200 points

Since we (in the frame of the toy problem) want the AI to make paperclips, we can additionally assume that on timestep 0 the principal told the agent to do so. Thus the +100 points from following the principal’s commands is only possible if the agent makes paperclips. But this agent will never manipulate/kill the principal in order to make paperclips (or for any other reason), since that would result in a lower score. The key here is that instead of a score function that focuses on physical properties of the final world, we have one that focuses on the relationship (and history) between the principal and the agent.

To be clear, this is not yet a full, genuine proposal for a formal measure of corrigibility. First and foremost this scoring function is too coarse and depends too much on the meaning of concepts like “manipulate” and “followed” which I don’t think are easy to define in the real world. Additionally, I suspect that there are important features of corrigibility which haven’t been named here, and may turn out to be vital.

But I claim that insofar as the agent has an ontology which supports these concepts and conceives of them correctly, it will be robustly shutdownable in a way that the more simply indifferent agent is not. Even when instructed to make paperclips, this agent won’t release a bioweapon that kills the principal, it won’t create sub-agents which manipulate the principal, and it won’t act in a way which makes the principal more docile and easy to control. This agent isn’t indifferent to being shut down, in that it prefers being shut down to manipulating the principal and it prefers protecting the principal to being shut down, unless of course the principal tells it to shut down, in which case that becomes the high-score strategy. Because it’s easier to protect the principal and follow their commands when active, we should expect this agent to generally prefer not to be shut down. The defense against this agent manipulating the user vis-a-vis the shutdown button comes from a dominant desire within the agent to avoid manipulation, not from an indifference to whether it has been turned off.

The real CAST strategy has a similar defense—it’s not that a CAST agent is indifferent to being modified/deactivated/etc. (and indeed, corrigible agents should defend themselves from non-principal entities that seek to hurt/change them!) but rather that any pursuit of Omohundro Drives is dominated by the higher-order desire to empower the principal and not be manipulative.

On Manipulation (and Empowerment)

Let’s take a digression, here, to try and get a better handle on manipulation.

Some actions are extremely easy to categorize as manipulative, to the point where we often don’t think of them as “manipulation” as much as direct force. Directly hijacking someone’s body is “manipulative” in the sense that the victim is now unable to freely choose anything. Likewise, physically imprisoning and even killing someone can both be seen as manipulative. Handcuffing someone is a less-extreme version of imprisonment, and should therefore also be seen as a kind of manipulation. Making things even more mild in a similar direction, blocking someone’s path, producing noise such that they can’t hear someone speak, or shining a light in their eyes to ruin their night-vision could all be seen as cases of manipulation.

Before moving into the realm of information and communication, where the real quagmire of the manipulation question lies, I want to think about how to formalize “don’t block someone’s path.” Let’s imagine a grid-world where there’s a human principal, and an AI controlled robot agent. The human is moving around the grid-world doing various things, including perhaps, planting flags on arbitrary tiles. Thus there’s no “safe” place to stand, from the AI’s perspective—any spot might be a place where the human wants to move to/through in order to plant a flag. If the AI is trying not to get in the human’s way, it will need to form a theory of mind and guess at where the human is trying to go so that it can move out of the way. Thus even in the simple domain of physical manipulation, we can see that no simple injunction against “violence” will suffice.

But actually, there’s a counter-argument to this gridworld story: as long as the robot stays at the tile it started on, it can’t be said to be manipulating the human. Sure, it’s not being as helpful as if it actively moved to stay out of the way, but in the counterfactual where the AI didn’t exist (but the robot body persisted!) the human would be no worse off, so it isn’t fair to say that the AI is taking an action to manipulate the human. However, when we suppose that the AI wants the robot to move around for other reasons (such as making paperclips) then it must either generally stay out of the way or return (everything) to its starting point whenever the human gets near.

Just as a physical obstacle can block someone from doing what they want, a false belief can abstractly block someone. An agent which is trying to avoid manipulation should therefore generally avoid telling lies, lest those falsehoods restrict the listener’s freedom. But just as we can see the physical situation as more nuanced than a simple injunction, we can see that there are many ways in which true sentences can also be a strong obstacle to the listener. If the agent tricks the human into concluding a false thing by only saying true sentences, this is manipulative. Likewise, even if the agent merely redirects the human’s attention (with true statements) such that the human ends up less able to think about certain things as they would otherwise be able to, this is manipulative. But unlike in the physical domain, psychological manipulation can go further in not just stopping someone from thinking about what they want, but also by changing what they want in the first place.

Just like staying motionless, one way to avoid manipulation during communication is just to stay silent and take the null action. But if an agent needs to actually communicate without manipulation, it needs a theory of mind, just like it does to avoid being a physical obstacle in a gridworld. Non-manipulation in communication involves guessing at what the other person is trying to think about, and at the very least not saying things which impair their ability to freely think about those things.

I think that we can begin to see, here, how manipulation and empowerment are something like opposites. In fact, I might go so far as to claim that “manipulation,” as I’ve been using the term, is actually synonymous with “disempowerment.” I touched on this in the definition of “Freedom,” in the ontology section, above. Manipulation, as I’ve been examining it, is akin to blocking someone’s ability to change the world to reflect their values, while empowerment is akin to facilitating them in changing the world. A manipulative agent will thus have a hard time being genuinely empowering, and an empowering agent will struggle to be genuinely manipulative.

Manipulation doesn’t mean putting the other person into a state that they dislike—it’s possible to manipulate someone into doing something that’s ultimately better for them than they would’ve chosen themselves. If they want to do something self-harmful, blocking them (physically or mentally) is still manipulative, I think. Likewise, empowerment is not about putting the other person into a state they like—it’s possible to empower someone and have them use that power to make bad choices, or even simply not use that power, instead focusing on other things.


Next up: 3b. Formal (Faux) Corrigibility

Return to 0. CAST: Corrigibility as Singular Target

  1. ^

     As others at MIRI like to point out, we currently don’t have good methods for ensuring that machine learning systems end up optimizing for specific things, even when those things are as simple as “quantity of diamond.” I have a gut sense that this is perhaps overstated, but regardless, in this essay I want to focus on whether corrigibility is specifically hard (or specifically easy!), rather than how naturally hard/easy it is to get something which optimizes for a desired goal.

  2. ^

     The agent isn’t some Cartesian mind floating in a void. What happens if it can subtly tweak its own mind to reduce the accuracy of the human-simulation?

  3. ^

     I am not aware of a theory of counterfactuals that seems coherent and complete, to me.

    As an example of how this could go wrong: the AI considers threatening the human with a gun, and considers the counterfactual where that human doesn’t press the stop button. If the world is a logical system where the human was guaranteed to hit the stop button in that situation, the absence of the stop button being pressed would be a logical contradiction. If the human both presses and doesn’t press the stop button, then anything is possible by the principle of explosion, and the expected number of achievable paperclips becomes unbounded.

    My real concern is less that the AI would logically prove that it can counterfactually get infinite paperclips, and more that leaning on counterfactual reasoning of this sort is vulnerable to ontological shifts in the mind of the AI, and conditioning on extremely weird things leads to being able to conclude extremely weird results.

  4. ^

     And in MIRI’s defense, it was never promised that indifference would get all the way to this kind of corrigibility.

New Comment