Definition of alignment science I like

quetzal_rainbow

There were many attempts to define alignment and derive from it definitions of alignment work/research/science etc. For example, Rob Bensinger:

Back in 2001, we defined "Friendly AI" as "The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals."
<...>Creating Friendly AI 1.0 had been very explicit that "friendliness" was about good behavior, regardless of how that's achieved. MIRI's conception of "the alignment problem" (like Bostrom's "control problem") included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires<...>
In practice, we started using "aligned" to mean something more like "aimable" (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just "getting the AI to predictably tile the universe with smiley faces rather than paperclips").

In contrast, Paul Christiano:

When I say an AI A is aligned with an operator H, I mean:
A is trying to do what H wants it to do.
The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. <...>
I use alignment as a statement about the motives of the assistant, not about their knowledge or ability.

Recently, Richard Ngo provided less outcome-centered definition of alignment research:

it’s research that focuses *either* on worst-case misbehavior *or* on the science of AI cognition.

I like MIRI's definition of alignment the most because it keeps eyes on the ball: we don't really care about the internals of AI as long as the outcomes are good. Alas, "everything that provides the necessary result" is not a very precise definition for an object of study.

Intent alignment and value alignment are more concrete in their object definition, but MIRI and people who agree with them are skeptical that ambitious value alignment is workable in near-term and concentrating on value alignment neglects less ambitious approaches.

And I don't like heavily compounded definitions of alignment science, because it’s often unclear what exactly unifies these multiple components.

The definition I came up to use:

Alignment science is a discipline that studies behavioral invariants of intelligent systems.

The reasons I like it:

It continues to keep eyes on the ball, preserving primary importance of behavior.
Unlike "value alignment" it doesn't have political/ethicals overtones, sidestepping questions like "whose values?".
Unlike "value/intent alignment" it doesn’t provoke (often counterproductive) philosophical skepticism. Philosophical skepticism often takes form of:
1. Skepticism about human values, like whether humans have values in consequentialist/utility function sense
2. Skepticism about values in modern AI systems, like "does LLM really want to deceive you when it outputs false statements?"

Behavioral invariants adress both problems. Humans surely have behavioral invariants: we certainly do not want to maximize paperclips, we do not want to eat babies, we want to socialize with other humans, we are curious, et cetera, et cetera, and while for each case we can find contrieved conditions under which those statements are not true, we are trying to avoid such conditions on reflection. While it's hard to say whether LLMs want anything, their behavior surely has stable properties with mechanisms behind those properties.

Because behavioral invariants certainly exist, instead of asking "But what does GPT-3 want, really?" and then feeling dread about "How can we even say anything about wants of LLM?", you can proceed to "what are the stable properties of behavior of GPT-3?" and "which mechanism creates these properties?"

More of that, it maps nicely to outer/inner alignment distinction without conceptual problems of outer/inner alignment as they were originally defined. Outer alignment is about "which behavioral invariants do we want and why?" and inner alignment "how can we be sure that mechanisms of functioning of AI actually create behavioral invariant that we want?"

Why can't we just take definition "alignment science studies cognition/intelligent behavior"? I link it back to technical problem of superintelligence alignment: the problem of full scientific comprehension of superintelligence is unlikely to be solved within resource bounds, but realistically we probably can be sure that certain behavioral properties quarantee safety and certain behavioral properties hold for a given system, even if we don't understand system behavior fully.

Unless I find reasons why behavioral invariants are bad as an object of alignment research, this is the definition of alignment science I'm going to use from now on.

LESSWRONG
LW

19

Definition of alignment science I like

19

19