You are viewing version 1.0.0 of this page. Click here to view the latest version.

Directing, vs. limiting, vs. opposing

Written by Eliezer Yudkowsky last updated
You are viewing revision 1.0.0, last edited by Eliezer Yudkowsky

'Aligning' versus 'limiting' versus 'opposing' is a proposed conceptual distinction between ways of getting good outcomes and avoiding bad outcomes, when running a sufficiently advanced Artificial General Intelligence:

  • Alignment means the AGI wants to do the right thing in a domain;
  • Limitation is having the AGI only think or act in safe domains;
  • Opposition is when we try to prevent the AGI from acting successfully, assuming that it would act and act wrongly given the power to do so.

For example:

  • A successfully aligned AI, given full Internet access, will do good things rather than bad things using Internet access;
  • A limited AI, suddenly given an Internet feed, will not do anything with that Internet access, because its programmers haven't whitelisted this new domain for being okay to think about;
  • Opposition is airgapping the AI from the Internet and then putting the AI's processors inside a Faraday cage, in the hope that even if the AI wants to get to the Internet, the AI won't be able to e.g. produce GSM cellphone signals by cleverly modulating its memory accesses.

Not running adversarial searches

Under the Agents_as_searches view, we can see an agent as embodying computations that search out actions within a domain. The non-adversarial principle then says that we should not be running a search which is looking for ways to hurt us.

We can achieve this desideratum in one of two ways:

  • The AI is sufficiently nice / value-aligned that it's okay for the AI to think about this domain / run searches here; whatever the AI is searching for, we don't mind if that search succeeds.
  • The AI is not aligned in this domain and therefore should not be running searches here--we didn't finish the alignment here, and therefore, we shouldn't be running this computation.

If both of these measures fail, then as a fallback second line of defense, we could try to make that search not find anything.

For example: The non-adversarial principle implies that if you build a shutdown button for the AI that suspends the AI to disk when pressed, you must also ensure:

  • That the AI wants there to be a shutdown button;
  • That the AI wants the state of this shutdown switch to reflect the dynamic decisions of the human operators;
  • That the AI does not want to influence the operators to press the switch or not press it;
  • That the AI does not want to cause anything besides an orderly suspend-to-disk to happen after this switch is pressed.

Or:

The first set of measures would be 'alignment'--if those measures work, it's okay for the AI to think here because we think those thoughts are pointed in the right direction. The second approach would be limitation--we don't think the AI's computations are aligned inside this domain yet, so until then, we shouldn't be running computations there.

As a final, strictly fallback line of defense, you could put a big red lever in the computing center that shut off all the electricity even if the AI was failing to suspend to disk. (But you're not relying on this working; it's just there as a maybe-it-might-work measure that might buy you a little more probable safety even if the primary lines of defense failed.)

Relation to other non-adversarial ideas

The alignment/limitation/opposition distinction can help state other ideas from the AI safety mindset. For example:

The principle niceness is the first line of defense can be rephrased as follows: When designing an AGI, we should imagine that all 'oppositional' measures are absent or failed, and think only about 'alignment' and 'limitation'. Any oppositional measures are then added on top of that, just in case.

Similarly, the Omnipotence test for AI safety says that when thinking through our primary design for alignment, we should think as if the AGI just will get Internet access on some random Tuesday. This says that we should design an AGI that is limited by not wanting to act in newly opened domains without some programmer action, rather than relying on the AI to be unable to reach the Internet until we've finished aligning it--even if we plan to later align it sufficiently.