(For the general concept of an agent, see standard agent properties.)

Introduction: 'Advanced' as an informal property, or metasyntactic placeholder

Advanced agents are the subjects of AI alignment theory; machine intelligences potent enough that (a) the safety paradigms for advanced agents become relevant, and (b) they can be decisive in the big-picture scale of events.

Some examples of properties that might make an agent this powerful:

The AI can learn new domains besides those built into it.

The AI can act as an economic agent in the real world.

The AI can understand human minds well enough to manipulate us.

The AI can devise real-world strategies we didn't foresee in advance.

The AI's performance is strongly superhuman or optimal across all cognitive domains.

Since there's multiple avenues we can imagine for how an AI could start to be this powerful, "advanced agent" doesn't have a neat necessary-and-sufficient definition. Similarly, some of the advanced agent properties are easier to formalize or pseudoformalize than others.

One example of a relatively definable property is cognitive uncontainability within a domain - the agent searches a broad-enough space of options that we can't predict what its best option will look like or how much of the agent's expected utility will be available to it. This kind of uncontainability is impossible in narrow, perfectly known spaces like logical Tic-Tac-Toe, but can start to manifest as early as the domain of Go - AlphaGo played moves that human champions initially found puzzling and unexpected, because the logical Go rules encompass sufficient game complexity that you can start to have "weird" moves. Real-world domains, where a falling leaf (physics and botany) can be nudged by a flying bee (biology) and both are far more complicated than a Go board and without completely-known-to-humans axiomatized rules, would be even richer than Go.

Cognitive uncontainability can potentially happen when an AI searches a different style of solution, not just when an AI searches a strictly larger set of solutions. Even if an AGI is, in some sense, still infrahuman, advanced-safety considerations might start to be relevant if the AGI is searching 'weird' parts of solution space and hence is cognitively uncontainable on the real-world domain. This would already start to bring in considerations like edge instantiations, unforeseen maximums, nearest unblocked strategies, and context-change disasters.

An example of a less crisp advanced-agent property might be "generality" and its correlates: "Can learn and interrelate many new domains rather than needing to be programmed for them", "Can learn subjects unknown to the programmers", or "Can start to learn about human psychology", or "Can learn about and understand the strategic bigger picture."

One reason to keep the term 'advanced' on an informal basis, or even as something of a placeholder, is that in an intuitive sense we want it to mean "AI we need to take seriously" in a way independent of particular architectures or particular accomplishments. To the philosophy undergrad who 'proves' that AI can never be 'truly intelligent' because it is 'merely deterministic and mechanical', one possible reply is, "Look, if it's building a Dyson Sphere, I don't care if you define it as 'intelligent' or not." Similarly, the term 'advanced agent' or 'sufficiently advanced' should be understood in a background context of "Look, if a computer program is doing X, it doesn't matter if we define that as 'intelligent' or 'general' or even as 'agenty', what matters is that it's doing X."

The point is not to generate a philosophically perfect definition of some platonic ideal of Advancement, but rather to think about which cognitive thresholds AI development could lead to which kinds of real-world power and corresponding alignment issues starting to become relevant (or needing to have already been solved before that point).

Advanced agent properties

A short summary of some properties that might lead into advanced agency.

Human psychological modeling

Sufficiently sophisticated models and predictions of human minds potentially leads to:

Getting sufficiently good at human psychology to realize the humans want/expect a particular kind of behavior, and will modify the AI's preferences or try to stop the AI's growth if the humans realize the AI will not engage in that type of behavior later. This creates an instrumental incentive for programmer deception or cognitive steganography.

Being able to psychologically and socially manipulate humans in general, as a real-world capability.

Being at risk for mindcrime.

Contrast behaviorism.

Cross-domain, real-world consequentialism

Probably requires generality (see below). To grasp a concept like "If I escape from this computer by hacking my RAM accesses to imitate a cellphone signal, I'll be able to secretly escape onto the Internet and have more computing power", an agent needs to grasp the relation between its internal RAM accesses, and a certain kind of cellphone signal, and the fact that there are cellphones out there in the world, and the cellphones are connected to the Internet, and that the Internet has computing resources that will be useful to it, and that the Internet also contains other non-AI agents that will try to stop it from obtaining those resources if the AI does so in a detectable way.

Contrasting this to non-primate animals where, e.g., a bee knows how to make a hive and a beaver knows how to make a dam, but neither can look at the other and figure out how to build a stronger dam with honeycomb structure. Current, 'narrow' AIs are like the bee or the beaver; they can play chess or Go, or even learn a variety of Atari games by being exposed to them with minimal setup, but they can't learn about RAM, cellphones, the Internet, Internet security, or why being run on more computers makes them smarter; and they can't relate all these domains to each other and do strategic reasoning across them.

So compared to a bee or a beaver, one shot at describing the potent 'advanced' property would be cross-domain real-world consequentialism. To get to a desired Z, the AI can mentally chain backwards to modeling W, which causes X, which causes Y, which causes Z; even though W, X, Y, and Z are all in different domains and require different bodies of knowledge to grasp.

Grasping the big picture

Many dangerous-seeming convergent instrumental strategies pass through what we might call a rough understanding of the 'big picture'; there's a big environment out there, the programmers have power over the AI, the programmers can modify the AI's utility function, future attainments of the AI's goals are dependent on the AI's continued existence with its current utility function.

It might be possible to develop a very rough grasp of this bigger picture, but sufficiently so to motivate instrumental strategies, in advance of being able to model things like cellphones and Internet security. Thus, "roughly grasping the bigger picture" may be worth conceptually distinguishing from "being good at doing consequentialism across real-world things" or "having a detailed grasp on programmer psychology".

Pivotal material capabilities

An AI that can crack the protein structure prediction problem (which seems speed-uppable by human intelligence); invert the model to solve the protein design problem (which may select on strong predictable folds, rather than needing to predict natural folds); and solve engineering problems well enough to bootstrap to molecular nanotechnology; is already possessed of potentially pivotal capabilities regardless of its other cognitive performance levels.

Other material domains besides nanotechnology might be pivotal. E.g., self-replicating ordinary manufacturing could potentially be pivotal given enough lead time; molecular nanotechnology is distinguished by its small timescale of mechanical operations and by the world containing an infinitely stock of perfectly machined spare parts (aka atoms). Any form of cognitive adeptness that can lead up to rapid infrastructure or other ways of quickly gaining a decisive real-world technological advantage would qualify.

Rapid capability gain

If the AI's thought processes and algorithms scale well, and it's running on resources much smaller than those which humans can obtain for it, or the AI has a grasp on Internet security sufficient to obtain its own computing power on a much larger scale, then this potentially implies rapid capability gain and associated context changes. Similarly if the humans programming the AI are pushing forward the efficiency of the algorithms along a relatively rapid curve.

In other words, if an AI is currently being improved-on swiftly, or if it has improved significantly as more hardware is added and has the potential capacity for orders of magnitude more computing power to be added, then we can potentially expect rapid capability gains in the future. This makes context disasters more likely and is a good reason to start future-proofing the safety properties early on.

Cognitive uncontainability

On complex tractable problems, especially those that involve real-world rich problems, a human will not be able to cognitively 'contain' the space of possibilities searched by an advanced agent; the agent will consider some possibilities (or classes of possibilities) that the human did not think of.

The key premise is the 'richness' of the problem space, i.e., there is a fitness landscape on which adding more computing power will yield improvements (large or small) relative to the current best solution. Tic-tac-toe is not a rich landscape because it is fully explorable (unless we are considering the real-world problem "tic-tac-toe against a human player" who might be subornable, distractable, etc.) A computationally intractable problem whose fitness landscape looks like a computationally inaccessible peak surrounded by a perfectly flat valley is also not 'rich' in this sense, and an advanced agent might not be able to achieve a relevantly better outcome than a human.

The 'cognitive uncontainability' term in the definition is meant to imply:

Vingean unpredictability.

Creativity that goes outside all but the most abstract boxes we imagine (on rich problems).

The expectation that we will be surprised by the strategies the superintelligence comes up with because its best solution was one we didn't consider.

Particularly surprising solutions might be yielded if the superintelligence has acquired domain knowledge we lack. In this case the agent's strategy search might go outside causal events we know how to model, and the solution might be one that we wouldn't have recognized in advance as a solution. This is Strong cognitive uncontainability.

In intuitive terms, this is meant to reflect, e.g., "What would have happened if the 10th century had tried to use their understanding of the world and their own thinking abilities to upper-bound the technological capabilities of the 20th century?"

(Work in progress)

generality
- cross-domain consequentialism
- learning of non-preprogrammed domains
  - learning of human-unknown facts
- Turing-complete fact and policy learning

dangerous domains
- human modeling
  - social manipulation
  - realization of programmer deception incentive
  - anticipating human strategic responses
- rapid infrastructure

potential
- self-improvement
- suppressed potential

epistemic efficiency

instrumental efficiency

cognitive uncontainability
- operating in a rich domain
- Vingean unpredictability
- strong cognitive uncontainability

improvement beyond well-tested phase (from any source of improvement)

self-modification
- code inspection
- code modification
- consequentialist programming
  - cognitive programming
- cognitive capability goals (being pursued effectively)

speed surpassing human reaction times in some interesting domain
- socially, organizationally, individually, materially

(For the general concept of an agent, see standard agent properties.)

Introduction: 'Advanced' as an informal property, or metasyntactic placeholder

Some examples of properties that might make an agent this powerful:

The AI can learn new domains besides those built into it.

The AI can act as an economic agent in the real world.

The AI can understand human minds well enough to manipulate us.

The AI can devise real-world strategies we didn't foresee in advance.

The AI's performance is strongly superhuman or optimal across all cognitive domains.

Advanced agent properties

A short summary of some properties that might lead into advanced agency.

Human psychological modeling

Sufficiently sophisticated models and predictions of human minds potentially leads to:

Getting sufficiently good at human psychology to realize the humans want/expect a particular kind of behavior, and will modify the AI's preferences or try to stop the AI's growth if the humans realize the AI will not engage in that type of behavior later. This creates an instrumental incentive for programmer deception or cognitive steganography.

Being able to psychologically and socially manipulate humans in general, as a real-world capability.

Being at risk for mindcrime.

Contrast behaviorism.

Cross-domain, real-world consequentialism

Grasping the big picture

Pivotal material capabilities

Rapid capability gain

Cognitive uncontainability

The 'cognitive uncontainability' term in the definition is meant to imply:

Vingean unpredictability.

Creativity that goes outside all but the most abstract boxes we imagine (on rich problems).

The expectation that we will be surprised by the strategies the superintelligence comes up with because its best solution was one we didn't consider.

(Work in progress)

generality
- cross-domain consequentialism
- learning of non-preprogrammed domains
  - learning of human-unknown facts
- Turing-complete fact and policy learning

dangerous domains
- human modeling
  - social manipulation
  - realization of programmer deception incentive
  - anticipating human strategic responses
- rapid infrastructure

potential
- self-improvement
- suppressed potential

epistemic efficiency

instrumental efficiency

cognitive uncontainability
- operating in a rich domain
- Vingean unpredictability
- strong cognitive uncontainability

improvement beyond well-tested phase (from any source of improvement)

self-modification
- code inspection
- code modification
- consequentialist programming
  - cognitive programming
- cognitive capability goals (being pursued effectively)

speed surpassing human reaction times in some interesting domain
- socially, organizationally, individually, materially

LESSWRONG
LW

LESSWRONG
LW

Advanced agent properties

Introduction: 'Advanced' as an informal property, or metasyntactic placeholder

Advanced agent properties

Human psychological modeling

Cross-domain, real-world consequentialism

Grasping the big picture

Pivotal material capabilities

Rapid capability gain

Cognitive uncontainability

Advanced agent properties

Introduction: 'Advanced' as an informal property, or metasyntactic placeholder

Advanced agent properties

Human psychological modeling

Cross-domain, real-world consequentialism

Grasping the big picture

Pivotal material capabilities

Rapid capability gain

Cognitive uncontainability