You are viewing revision 1.23.0, last edited by Eliezer Yudkowsky

Summary: Advanced machine intelligences are the subjects of value alignment theory: human-created agents with enough of the advanced agent properties to (1) be dangerous if mishandled, and (2) be relevant to our larger dilemmas. Some examples of advanced agency are "the AI can learn new domains that humans don't understand", "the AI can act as an economic agent in the real world", "the AI can understand human minds well enough to devise manipulation strategies", "the AI can invent new technologies". An example of a more formal property is "epistemic efficiency" (the AI's estimates are always at least as good as our own estimates).

(For an overview of other agent properties besides the advanced agent properties, see standard agent properties.)

Introduction: 'Advanced' as an informal property, or metasyntactic placeholder.

Advanced agents are the subjects of value alignment theory; machine intelligences powerful enough that the safety paradigms for advanced agents become relevant.

We can think of several ways of describing an AI that is cognitively powerful in this sense, some of these properties being neatly crisp and formal, and others less mathy. At this stage of our understanding, it doesn't seem wise to try to crisply define an exactly necessary and sufficient property for an agent such that we ought to be doing value alignment theory to it.

An example of a relatively crisp property of an advanced AI is epistemic efficiency: an agent appears 'epistemically efficient' to us if we can't predict any directional error in its estimates. E.g., we can't expect a superintelligence to precisely estimate the exact number of hydrogen atoms in the Sun, but it would be very odd if we could predict in advance that the superintelligence would overestimate this number by 10%. It seems very reasonable to expect that sufficiently advanced superintelligences would have this particular property of advanced agency (even human stock markets have this property in the short run for the relative prices of highly liquid assets).

But epistemic efficiency isn't a necessary property for advanced safety to be relevant - we can conceive scenarios where an AI is not epistemically efficient, and yet we still need to deploy parts of value alignment theory. We can imagine, e.g., a Limited Genie that is extremely good with technological designs, smart enough to invent its own nanotechnology, but has been forbidden to model human minds in deep detail (e.g. to avert programmer manipulation). We can also imagine that an accident in specifying this Genie's goals results in it wiping out humanity via material nanotechnology, without the AI ever having been human-competitive at modeling and predicting human minds. This Genie did not universally meet our criterion of epistemic efficiency (this agent could be predictably wrong, relative to our own guesses, about human minds). But the agent was clearly, in some critical sense, cognitively powerful enough to be relevant, and a fitting subject for some parts of value alignment theory.

For this reason, the term 'advanced agent' serves as something of a metasyntactic placeholder in value alignment theory, referring to multiple advanced agent properties, some crisp and some less crisp. Agents with some, many, or all of these properties increasingly become high-impact on astronomical stakes, require the use of an advanced-agent safety paradigm, and make the subtopics of value alignment theory increasingly urgent and relevant. (Obviously, where possible, a subtopic should identify the particular advanced agent properties that seem to most strongly make an agent its relevant subject matter.)

Advanced agent properties:

Todo

  • generality
    • cross-domain consequentialism
    • learning of non-preprogrammed domains
      • learning of human-unknown facts
    • Turing-complete fact and policy learning
  • dangerous domains
    • human modeling
      • social manipulation
      • realization of programmer deception incentive
      • anticipating human strategic responses
    • rapid infrastructure
  • potential
    • self-improvement
    • suppressed potential
  • epistemic efficiency
  • instrumental efficiency
  • cognitive uncontainability
  • improvement beyond well-tested phase (from any source of improvement)
  • self-modification
    • code inspection
    • code modification
    • consequentialist programming
      • cognitive programming
    • cognitive capability goals (being pursued effectively)
  • speed surpassing human reaction times in some interesting domain
    • socially, organizationally, individually, materially

Todo, write out a set of final dangerous abilities use/cases and then link up the cognitive abilities with which potentially dangerous scenarios they create.

Cognitive uncontainability.

On complex tractable problems, especially those that involve real-world rich problems, a human will not be able to cognitively 'contain' the space of possibilities searched by an advanced agent; the agent will consider some possibilities (or classes of possibilities) that the human did not think of.

The key premise is the 'richness' of the problem space, i.e., there is a fitness landscape on which adding more computing power will yield improvements (large or small) relative to the current best solution. Tic-tac-toe is not a rich landscape because it is fully explorable (unless we are considering the real-world problem "tic-tac-toe against a human player" who might be subornable, distractable, etc.) A computationally intractable problem whose fitness landscape looks like a computationally inaccessible peak surrounded by a perfectly flat valley is also not 'rich' in this sense, and an advanced agent might not be able to achieve a relevantly better outcome than a human.

The 'cognitive uncontainability' term in the definition is meant to imply:

  • Vingean unpredictability.
  • Creativity that goes outside all but the most abstract boxes we imagine (on rich problems).
  • The expectation that we will be surprised by the strategies the superintelligence comes up with because its best solution was one we didn't consider.

Strong cognitive uncontainability

Particularly surprising solutions might be yielded if the superintelligence has acquired domain knowledge we lack. In this case the agent's strategy search might go outside causal events we know how to model, and the solution might be one that we wouldn't have recognized in advance as a solution.

By definition, a strongly uncontainable agent can conceive strategies that go through causal domains you can't currently model, and it has options accessing those strategies; therefore it may execute high-value solutions such that you would not assign those solutions high expected value without being told further background facts.

In intuitive terms, this is meant to reflect, e.g., "What would have happened if the 10th century had tried to use their understanding of the world and their own thinking abilities to upper-bound the technological capabilities of the 20th century?" Even if they somehow randomly imagined the exact blueprint of an air conditioner, they couldn't recognize it as a solution to the room-cooling problem before learning further background facts about pressure, temperature, and electromagnetism.