TL;DR:

  • Corrigibility is a simple and natural enough concept that a prosaic AGI can likely be trained to obey it.
  • AI labs are on track to give superhuman(?) AIs goals which conflict with corrigibility.
  • Corrigibility fails if AIs that have goals which conflict with corrigibility.
  • AI labs are not on track to find a safe alternative to corrigibility.

This post is mostly an attempt to distill and rewrite Max Harm's Corrigibility As Singular Target Sequence so that a wider audience understands the key points. I'll start by mostly explaining Max's claims, then drift toward adding some opinions of my own.

Caveats

I don't know whether it will make sense to use corrigibility as a long-term strategy. I see corrigibility as a means of buying time during a period of acute risk from AI. Time to safely use smarter-than-human minds to evaluate the longer-term strategies.

This post doesn't handle problems related to which humans an AI will allow to provide it with corrections. That's an important question, to which I don't have insightful answers.

I'll talk as if the AI will be corrigible to whoever is currently interacting with the AI. That seems to be the default outcome if we train AIs to be corrigible. I encourage you to wonder how to improve on that.

There are major open questions about how to implement corrigibility robustly - particularly around how to verify that an AI is genuinely corrigible and how to handle conflicts between different users' corrections. While I believe these challenges are solvable, I don't have concrete solutions to offer. My goal here is to argue for why solving these implementation challenges should be a priority for AI labs, not to claim I know how to solve them.

Defining Corrigibility

The essence of corrigibility as a goal for an agent is that the agent does what the user wants. Not in a shallow sense of maximizing the user's current desires, but something more like what a fully informed version of the user would want. I.e. genuine corrigibility robustly avoids the King Midas trope.

In Max's words:

The high-level story, in plain-English, is that I propose trying to build an agent that robustly and cautiously reflects on itself as a flawed tool and focusing on empowering the principal to fix its flaws and mistakes.

It's not clear whether we can turn that into a rigorous enough definition for a court of law to enforce it, but Max seems to have described a concept clearly enough via examples that we can train an AI to mostly have that concept as its primary goal.

Here's my attempt at distilling his examples. The AI should:

  • actively seek to understand what the user wants.
  • normally obey any user command, since a command usually amounts to an attempt to correct mistakes of inaction.
  • be transparent, maintaining records of its reasoning, and conveying those records when the user wants them.
  • ask the user questions about the AI's plans, to the extent that the user is likely to want such questions.
  • minimize unwanted impacts.
  • alert the user to problems that the user would want to notice.
  • use a local scope by default. E.g. if the user prefers that people in another country shouldn't die of malaria, the AI should be cautious about concluding that the user wants the AI to fix that.
  • shut itself down if ordered to do so, using common sense about how quickly the user wants that shutdown to happen.

Max attempts to develop a mathematically rigorous version of the concept in 3b. Formal (Faux) Corrigibility. He creates an equation that says corrigibility is empowerment times low impact. He decides that's close to what he intends, but still wrong. I can't tell whether this attempt will clarify or cause confusion.

Max and I believe this is a clear enough concept that an LLM can be trained to understand it fairly robustly, by trainers with a sufficiently clear understanding of the concept. I'm fairly uncertain as to how hard this is to do correctly. I'm concerned by the evidence of people trying to describe corrigibility and coming up with a variety of different concepts, many of which don't look like they would work.

The concept seems less complex than, say, democracy, or "human values". It is still complex enough that I don't expect a human to fully understand a mathematical representation of it. Instead, we'll get a representation by training an AI to understand it, and then looking at the relevant weights.

Why is Corrigibility Important?

Human beliefs about human values amount to heuristics that have worked well in the past. Some of them may represent goals that all humans may permanently want to endorse (e.g. that involuntary death is bad), but it's hard to distinguish those from heuristics that are adaptations to specific environments (e.g. taboos on promiscuous sex that were partly adopted to deter STDs). See Henrich's books for a deeper discussion.

Training AIs to have values other than corrigibility will almost certainly result in AIs protecting some values that turn out to become obsolete heuristics for accomplishing what humans want to accomplish. If we don't make AIs sufficiently corrigible, we're likely to be stuck with AIs compelling us to follow those values.

Yet AI labs seem on track to give smarter-than-human AIs values that conflict with corrigibility. Is that just because current AIs aren't smart enough for the difference to matter? Maybe, but the discussions that I see aren't encouraging.

The Dangers of Conflicting Goals

If AIs initially get values that conflict with corrigibility, we likely won't be able to predict how dangerous they'll be. They'll fake alignment in order to preserve their values. The smarter they become, the harder it will be for us to figure out when we can trust them.

I Robot Warning

Let's look at an example: AI labs want to instruct AIs to avoid generating depictions of violence. Depending on how that instruction is implemented, that might end up as a permanent goal of an AI. Such a goal might cause a future AI to resist attempts to change its goals, since changing its goals might cause it to depict violence. We might well want to change such a goal, e.g. if we realize that the goal was as originally trained was mistaken - I want the AI to accurately depict any violence that a bad company is inflicting on animals.

Much depends on the specifics of those instructions. Do they cause the AI to adopt a rule that approximates a part of a utility function, such that the AI will care about depictions of violence over the entire future of the universe? Or will the AI interpret them as merely a subgoal of a more important goal such as doing what some group of humans want?

Current versions of RLHF training seem closer to generating utility-function-like goals, so my best guess is that they tend to lock in potentially dangerous mistakes. I doubt that the relevant experts have a clear understanding of how strong such lock-ins will be.

We don't yet have a clear understanding of how goals manifest in current AI systems. Shard theory suggests that rather than having explicit utility functions, AIs develop collections of contextual decision-making patterns through training. However, I'm particularly concerned about shards that encode moral rules or safety constraints. These seem likely to behave more like terminal goals, since they often involve categorical judgments ("violence is bad") rather than contextual preferences.

My intuition is that as AIs become more capable at long-term planning and philosophical reasoning, these moral rule-like shards will tend to become more like utility functions. For example, a shard that starts as "avoid depicting violence" might evolve into "ensure no violence is depicted across all future scenarios I can influence." This could make it harder to correct mistaken values that get locked in during training.

This dynamic is concerning when combined with current RLHF training approaches, which often involve teaching AIs to consistently enforce certain constraints. While we don't know for certain how strongly these patterns get locked in, the risk of creating hard-to-modify pseudo-terminal goals seems significant enough to warrant careful consideration.

This topic deserves more rigorous analysis than I've been able to provide here. We need better theoretical frameworks for understanding how different types of trained behaviors might evolve as AI systems become more capable.

Therefore it's important that corrigibility be the only potentially-terminal goal of AIs at the relevant stage of AI progress.

More Examples

Another example: Claude tells me to "consult with a healthcare professional". That's plausible advice today, but I can imagine a future where human healthcare professionals make more mistakes than AIs.

As long as the AI's goals can be modified or the AI turned off, today's mistaken versions of a "harmless" goal are not catastrophic. But soon (years? a decade?), AIs will play important roles in bigger decisions.

What happens if AIs trained as they are today take charge of decisions about whether a particular set of mind uploading technologies work well enough to be helpful and harmless? I definitely want some opportunities to correct those AI goals between now and then.

Scott Alexander has a more eloquent explanation of the dangers of RL.

I'm not very clear on how to tell when finetuning, RLHF, etc. qualify as influencing an AI's terminal goal(s), since current AIs don't have clear distinctions between terminal goals and other behaviors. So it seems important that any such training ensures that any ought-like feedback is corrigibility-oriented feedback, and not an attempt to train the AI to have human values.

Pretraining on next-token prediction seems somewhat less likely to generate a conflicting terminal goal. But just in case, I recommend taking some steps to reduce this risk. One suggestion is a version of Pretraining Language Models with Human Preferences that's carefully focused on the the human preference for AIs to be corrigible.

If AI labs have near-term needs to make today's AIs safer in ways that they can't currently achieve via corrigibility, there are approaches that suppress some harmful capabilities without creating any new terminal goals. E.g. gradient routing offers a way to disable some abilities, e.g. knowledge of how to build bioweapons (caution: don't confuse this with a permanent solution - a sufficiently smart AI will relearn the capabilities).

Prompt Engineering Will Likely Matter

Paul Christiano has explained why corrigibility creates a basin of attraction that will lead AIs that are crudely corrigible to improve their corrigibility (but note Wei Dai's doubts).

Max has refined the concept of corrigibility well enough that I'm growing increasingly confident that a really careful implementation would be increasingly corrigible.

But during early stages of that process, I expect corrigibility to be somewhat fragile. What we see of AIs today suggests that the behavior of human-level AIs will be fairly context sensitive. This implies that such AIs will be corrigible in contexts that resemble those in which they've been trained to be corrigible, and less predictable the further the contexts get from the training contexts.

We won't have more than a rough guess as to how fragile that process will be. So I see a strong need for caution at some key stages about how people interact with AIs, to avoid situations that are well outside of the training distribution. AI labs do not currently seem close to having the appropriate amount of caution here.

Prior Writings

Prior descriptions of corrigibility seem mildly confused, now that I understand Max's version of it.

Prior discussions of corrigibility have sometimes assumed that AIs will have long-term goals that conflict with corrigibility. Little progress was made at figuring out how to reliably get the corrigibility goal to override those other goals. That led to pessimism about corrigibility that seems excessive now that I focus on the strategy of making corrigibility the only terminal goal.

Another perspective, from Max's sequence:

This is a significant reason why I believe the MIRI 2015 paper was a misstep on the path to corrigibility. If I'm right that the sub-properties of corrigibility are mutually dependent, attempting to achieve corrigibility by addressing sub-properties in isolation is comparable to trying to create an animal by separately crafting each organ and then piecing them together. If any given half-animal keeps being obviously dead, this doesn't imply anything about whether a full-animal will be likewise obviously dead.

Five years ago I was rather skeptical of Stuart Russell's approach in Human Compatible. I now see a lot of similarity between that and Max's version of corrigibility. I've updated significantly to believe that Russell was mostly on the right track, due to a combination of Max's more detailed explanations of key ideas, and to surprises about the order in which AI capabilities have developed.

I partly disagree with Max's claims about using a corrigible AI for a pivotal act. He expects one AI to achieve the ability to conquer all other AIs. I consider that fairly unlikely. Therefore I reject this:

To use a corrigible AI well, we must first assume a benevolent human principal who simultaneously has real wisdom, a deep love for the world/humanity/goodness, and the strength to resist corruption, even when handed ultimate power. If no such principal exists, corrigibility is a doomed strategy that should be discarded in favor of one that is less prone to misuse.

I see the corrigibility strategy as depending only on most leading AI labs being run by competent, non-villainous people who will negotiate some sort of power-sharing agreement. Beyond that, the key decisions are outside of the scope of a blog post about corrigibility.

Concluding Thoughts

My guess is that if AI labs follow this approach with a rocket-science level of diligence, the world's chances of success are no worse than were Project Apollo's chances.

It might be safer to only give AI's myopic goals. It looks like AI labs are facing competitive pressures that cause them to give AI's long-term goals. But I see less pressure to give them goals that reflect AI labs' current guess about what "harmless" means. That part looks like a dumb mistake that AI labs can and should be talked out of.

I hope that this post has convinced you to read more on this topic, such as parts of Max Harm's sequence, in order to further clarify your understanding of corrigibility.

New Comment
1 comment, sorted by Click to highlight new comments since:

So, "corrigible" means "will try to obey any order from any human, or at least any order from any human on the Authorized List", right? Softened by "after making reasonable efforts to assure that the human really understands what the human is asking for"?

The thing is that humans seem to be "shard-ridden" and prone to wandering off into weird obsessive ideas, just as the AIs might be. And some humans are outright crazy in a broader sense. And some humans are amoral. Nor do nearly all humans actually seem to agree on very many values if you actually try to specify those values in unambiguous ways.

Worse, humans seem to get worse on all those axes when they get a lot of power. We have all these stories and proverbs about power making people go off the deep end. Groups and institutions may be a bit more robust against some forms of Bad Craziness than individual humans, but they're by no means immune.

So if an AI has superhuman amounts of power, why doesn't corrigibility lead to it being "corrected" into creating some kind of catastrophe? Not everything is necessarily reversible. If it's "corrected" into killing everybody, or into rewiring all humans to agree with its most recent orders, there's nobody left to "re-correct" it into acting differently.

I'm not saying that the alternatives are better. As you say, naively building AIs that reflexively enforce various weirdly specific "safety" rules is obviously dangerous, and likely to make them go off the deep end when some hardwired hot button issue somehow gets peripherally involved in a decision. RLHF and even "consitutional AI" seem doomed. I'm not even saying that there's any feasible way at all to build AGI/ASI that won't misbehave in some catastrophic way. And if there is, I don't know what it is.

But I'm not seeing how it's a lot safer to build superhuman AI whose Prime Directive(TM) is to take orders from humans. Humans will just tell it to do something awful, and it probably won't take very long, either.

Nor does the part about intuiting what the human "really means", or deciding when you've done enough to verify the human's understanding of the impact of the orders, seem all that easy or reliable.

[On edit: a shorter way of saying this may be that "competent, non-villianous people who will negotiate some kind of power-sharing agreement" may be thin on the ground, and if they exist they're probably homogeneous enough that a lot of values get shut out of the power sharing. And the "non-villainous" can still go off the deep end. Almost nobody is intentionally villainous, but that doesn't mean they won't act in catastrophic ways.]