last updated

Corrigibility is an AI system's capacity to be safely and reliably modified, corrected, or shut down by humans after deployment, even if doing so conflicts with its current objectives.

'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

  • If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. This is not something that an AI is automatically incentivized to let us do, since if it is shut down, it will be unable to fulfill what would usually be its goals.
  • If we try to reprogram the AI, a corrigible AI will not resist this change and will allow this modification to go through. If this is not specifically incentivized, an AI might attempt to fool us into believing the utility function was modified successfully, while actually keeping its original utility function as obscured functionality. By default, this deception could be a preferred outcome according to the AI's current preferences.

Corrigibility is also used in a broader sense, something like a helpful agent. Paul Christiano has defined corrigibility as an agent that will help me:

  • Figure out whether I built the right AI and correct any mistakes I made
  • Remain informed about the AI’s behavior and avoid unpleasant surprises
  • Make better decisions and clarify my preferences
  • Acquire resources and remain in effective control of them
  • Ensure that my AI systems continue to do all of these nice things
  • …and so on

See also: