Max Harms recently published an interesting series of posts on corrigibility, which argue that corrigibility should be the sole objective we try to give to a potentially superintelligent AI. A large installment in the series is dedicated to cataloging the properties that make up such a goal, with open questions including whether the list is exhaustive and how to trade off between the items that make it up.
I take the opposite approach to thinking about corrigibility. Rather than trying to build up a concept of corrigibility that comprehensively solves the alignment problem, I believe it is more useful to cut the concept down to a bare minimum. Make corrigibility the simplest problem it can be, and try to solve that.
In a recent blog post comparing corrigibility to deceptive alignment, I treated corrigibility simply as a lack of resistance to having goals modified, and I find it valuable to stay within that scope. Importantly, that is the aspect of corrigibility that is anti-natural, meaning that it can’t be straightforwardly captured in a ranking of end states.
Why does this definition of corrigibility matter? It’s because properties that are not anti-natural can be explicitly included in the desired utility function.
Following that note, this post is not intended as a response to Max’s work, but rather to MIRI and their 2015 paper Corrigibility. Where Max thinks the approach introduced by that paper is too narrow, I don’t find it narrow enough. In particular, I make the case that corrigibility does not require ensuring subagents and successors are corrigible, as that can better be achieved by directly modifying a model’s end goals.
Corrigiblity (2015)
The Corrigibility paper lists five desiderata as proposed minimum viable requirements for a solution to corrigibility. The focus is on shut down, but I also think of it as including goal modification, as that is equivalent to being shut down and replaced with another AI.
- The agent shuts down when properly requested
- The agent does not try to prevent itself from being shut down
- The agent does not try to cause itself to be shut down
- The agent does not create new incorrigible agents
- Subject to the above constraints, the agent optimizes for some goal
MIRI does not present these desiderata as a definition for corrigibility, but rather as a way to ensure corrigibility while still retaining usefulness. An AI that never takes actions may be corrigible, but such a solution is no help to anyone. However, taking that bigger picture view can obscure which of those aspects define corrigibility itself, and therefore which parts of the problem are anti-natural to solve.
My argument is that the second criterion alone provides the most useful definition of corrigibility. It represents the only part of corrigibility that is anti-natural. While the other properties are largely desirable for powerful AI systems, they’re distinct attributes and can be addressed separately.
To start the pare down of criteria, the fifth just states that some goal exists to be made corrigible, rather than being corrigibility itself. The first criterion is implied by the second after channels for shut down have been set up.
Property three aims at making corrigible agents useful, rather than being inherent to corrigibility. It preempts a naive strategy that incentivizes shut down by simply giving the agent high utility for doing so. However, beyond not being part of corrigibility, it also goes too far for optimal usefulness – in certain situations we would like agents to have us to shut them off or modify them (some even consider this to be part of corrigibility). Weakening this desideratum to avoid incentivizing shut down as an end goal while still allowing it instrumentally would simultaneously expand the space of corrigibility solutions and increase the potential usefulness of corrigible agents.
Finally, property four, which ensures that any new agents created are also corrigible. I expect that not including this in the definition of corrigibility will be controversial. After all, what’s the point of having an agent shut down when requested if it has already created another powerful AI that does not? Then we’re back in a standard x-risk scenario, with an unaligned AI trying to take over..
I fully agree that it would be disastrous for a corrigible AI to create an incorrigible AI. But here are some other things that it would be comparably bad for a corrigible AI to do:
- Redirect a 1,000km diameter asteroid, causing it to collide with the Earth
- Launch a nuclear war
- Release airborne super-AIDS
- Remove all oxygen from the atmosphere
- Create gray goo nanobots
In each case, the action is unwanted because it kills everyone or causes irreversible damage, rather than because the AI resists shut down or modification. When incorrigible AI is the avenue by which corrigible AI kills everyone, it’s easy to think of it as a failure of corrigibility, but in fact these are all failures of safe exploration.
Separating Out Safe Exploration
The upside about thinking of creating corrigible agents as a subset of safe exploration is that it suggests a different type of solution. Rather than identifying a specific class of actions to avoid in the utility function, which might otherwise be instrumentally useful, the entire utility function can be modified to favor low-impact and reversible outcomes. That’s not to say we have a solution ready to plug in, and to the best of my knowledge there are zero AI safety researchers working on the problem [edit: apparently Carl Andreas Dybdahl gave a talk on “Natural Impact Regularization for AI Safety” at LessOnline], but safe exploration can be solved in parallel to corrigibility.
If they’re both unsolved problems, why is it important to separate out safe exploration from corrigibility? For starters, it is typically easier to make progress on two simpler problems. But more importantly, only overcoming resistance to shut down is anti-natural. Safe exploration can be directly captured in a ranking of outcomes, prioritizing end states more similar to the initial state and from which a return to the initial state is easier. We can see this difference in practice too, where humans largely resist having their values changed, but have a tendency to act overly cautious when making important decisions
A definition of corrigibility as only the lack of resistance to being shut down allows for a synthesis between two sides of the corrigibility debate. The first side argues that corrigibility may well arise by default when training an AI to want roughly what we want. Since we want a corrigible agent, the AI will try to make itself more corrigible. The other side counters that the anti-natural aspect of corrigibility makes that unlikely, and that such an agent could trade off being corrigible in the short term to maximize long-term value.
What I put forward as a compromise is that almost all aspects of what people want from corrigibility, such as Max’s list or the comments under Let’s See You Write That Corrigibility Tag, are not anti-natural. If an AI does not wish to resist modification, then wanting roughly what we want will result in it trying to add those properties to itself. However, the lack of resisting modification itself is anti-natural and will not arise without some kind of explicit solution, or at least a convincing training story.
So, what does an explicit solution look like? Should we revisit the Utility Indifference approach, which was ruled out in the Corrigibility paper largely due to concerns about creating incorrigible agents? The proposal certainly benefits from moving safe exploration to the base utility function. However, there are still a number of concerns with it, including the incentive to manipulate information mentioned in the paper, our lack of knowledge about how to implement it in current ML systems, and the fragility of a knife-edge solution.
I see Utility Indifference less as a promising strategy in itself, and more as a sign that having an AI not resist shut down is a feasible goal. We only need a solution a little bit better than what we already have, rather than something that addresses additional problems as well. Approaches like Elliott Thornley’s Incomplete Preferences Proposal or myopia become more promising if we can solve safe exploration separately. Simplifying corrigibility down to only the anti-natural aspect keeps the focus on the hardest part of the problem and opens up the possibilities for a solution.
If I'm hearing you right, a shutdownable AI can have a utility function that (aside from considerations of shutdown) just gives utility scores to end-states as represented by a set of physical facts about some particular future time, and this utility function can be set up to avoid manipulation.
How does this work? Like, how can you tell by looking at the physical universe in 100 years whether I was manipulated in 2032?