AI alignment

Written by Eliezer Yudkowsky, et al. last updated

The "alignment problem for advanced agents" or "AI alignment" is the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world.

Both 'advanced agent' and 'good' should be understood as metasyntactic placeholders for complicated ideas still under debate. The term 'alignment' is intended to convey the idea of pointing an AI in a direction--just like, once you build a rocket, it has to be pointed in a particular direction.

"AI alignment theory" is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.

Other terms that have been used to describe this research problem include "robust and beneficial AI" and "Friendly AI". The term "value alignment problem" was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences with (potentially idealized) human preferences.

Some alternative terms for this general field of study, such as 'control problem', can sound adversarial--like the rocket is already pointed in a bad direction and you need to wrestle with it. Other terms, like 'AI safety', understate the advocated degree to which alignment ought to be an intrinsic part of building advanced agents. E.g., there isn't a separate theory of "bridge safety" for how to build bridges that don't fall down. Pointing the agent in a particular direction ought to be seen as part of the standard problem of building an advanced machine agent. The problem does not divide into "building an advanced AI" and then separately "somehow causing that AI to produce good outcomes", the problem is "getting good outcomes via building a cognitive agent that brings about those good outcomes".

A good introductory article or survey paper for this field does not presently exist. If you have no idea what this problem is about, consider reading Nick Bostrom's popular book Superintelligence.

You can explore this Arbital domain by following this link. See also the List of Value Alignment Topics on Arbital although this is not up-to-date.