Why do we care about agency for alignment?

Chris_Leong

Many people believe that understanding "agency" is crucial for alignment, but as far as I know, there isn't a canonical list of reasons why we care about agency. Please describe any reasons why we might care about the concept of agency for understanding alignment below. If you have multiple reasons, please list them in separate answers below.

Please also try to be specific as possible about what our goal is in the scenario. For example:

We want to know what an agent is so that we can determine whether or not a given AI is a dangerous agent

Whilst useful isn't quite as good as:

We have an AI which may or may not have goals aligned with us and we want to know what will happen if these goals aren't aligned. In particular, we are worried that the AI may develop instrumental incentives to seek power and we want to use interpretability tools to help us figure out how worried we should be for a particular system.

We can imagine a scale of such systems according to how it behaves in novel situations:
On the low end of this scale, we have a system that only follows extremely broad heuristics that it learned during training
On the high end of this scale, we have a system that uses general reasoning capabilities to discover new and innovative strategies even if it has never used anything like this strategy before, or seen it in the training data
We can think of such a scale as a concretisation of agency. This scale provides an indication of both:
How dangerous a system is likely to be: though a system with excellent heuristics and very little agency could be dangerous as well
What kind of precautions we might want to take: for example, how much can we rely on our off-switch to disable the agent

It's plausible that interpretability tools might be able to give us some idea of where an agent is on this scale just by looking at the weights. So having a clear definition of this scale could help by clarifying what we are looking for.

In a few days, I'll add any use cases I'm aware of myself that either haven't been covered or that I don't think have been adequately explained by different answers.

Agency may be a convergent property of most AI systems (or at least, of many systems people are likely to try to build), once those systems reach a certain capability level. The simplest and most useful way to predict the behavior of such systems may therefore be to model them as agents.

Perhaps we can avoid the problems posed by agency by building only tool AI. In that case, we probably still need a deep understanding of agency to make sure we avoid building an agent by accident. Instrumental convergence may imply that all sufficiently powerful AI systems start looking like agents eventually, past a certain point. Though, when a particular system is best modeled as an agent may depend on the particulars of that system, and we may want to push that point out as far as possible.

Boiling this down to a single specific reason about why we should care about agency: the concept of agency is likely to be key for creating simple, predictively accurate models of many kinds of powerful AI systems, regardless of whether the builders of those systems:

deeply understand the concepts of agency (or alignment) themselves
are deliberately trying to build an agent, or deliberately trying to avoid that, or just trying to build the most powerful system as fast as possible, without explicitly trying to avoid or target agency at all. (We seem to be in a world in which different people are trying all three of these things simultaneously.)

A few arguments or stubs of arguments for why the bolded claim is correct and important:

Agency is already a useful model of humans and human behavior in many situations.
Agency is already a useful model of some current AI systems: Mu Zero, Dreamer, Stockfish in the domains of their respective game worlds. It might soon be a useful model of constructions like Auto-GPT, in the domain of the real world.
The hypothesis that agency is instrumentally convergent means that it will be important in understanding all AI systems above a certain capability level.

Is there a fixed, concrete, definition of "convergent property" or "instrumentally convergent" that most folks can agree on?

From what I see it's more loosely and vaguely defined then "agency" itself, so it's not really dispelling the confusion anyone may have.

2Max H3y

I don't know if there's a standard definition or reference for instrumental convergence other than the LW tag, but convergence in general is a pretty well-known phenomenon. For example, many biological mechanisms which evolved independently end up looking remarkably similar, because that just happens to be the locally-optimal way of doing things, if you're in the design space of of iterative mutation of DNA. Similarly in all sorts of engineering fields, methods or tools or mechanisms are often re-invented independently, but end up converging on very functionally or even visually similar designs, because they are trying to accomplish or optimize for roughly the same thing, and there's only so many ways of doing so optimally, given enough constraints. Instrumental convergence in the context of agency and AI is just that principle applied to strategic thinking and mind design specifically. In that context, it's more of a hypothesis, since we don't actually have more than one example of a human-level intelligence being developed. But even if mind designs don't converge on some kind of optimal form, artificial minds could still be efficient w.rt. humans, which would have many of the same implications.

This answer is a little bit confusing to me. You say that ”agency” may be an important concept even if we don’t have a deep understanding of what it entails. But how about a simple understanding?

I thought that when people spoke about ”agency” and AI, they meant something like ”a capacity to set their own final goals”, but then you claim that Stockfish could best be understood by using the concept of ”agency”. I don’t see how.

I myself kind of agree with the sentiment in the original post that ”agency” is a superfluous concept, but want to understand the opposite view?

2Chris_Leong3y

To clarify, I wasn't arguing it was a superficial concept, just trying to collect all the possible reasons together and nudge people towards making the use cases/applications more precise.

1Max H3y

I didn't claim that Stockfish was best understood by using the concept of agency. I claimed agency was one useful model. Consider the first diagram in this post on embedded agency: you can regard Stockfish as Alexi playing a chess video game. By modeling Stockfish as an agent in this situation, you can abstract its internal workings somewhat and predict that it will beat you at chess, even if you can't predict the exact moves it makes, or why it makes those moves. > I thought that when people spoke about ”agency” and AI, they meant something like ”a capacity to set their own final goals” I think this is not what is meant by agency. Do I have the capacity to set (i.e. change) my own final goals? Maybe so, but I sure don't want to. My final goals are probably complex and I may be uncertain about what they are, but one day I might be able to extrapolate them into something concrete and coherent. I would say that an agent is something which is usefully modeled as having any kind of goals at all, regardless of what those goals are. As the ability of a system to achieve those goals increases, the model as an agent gets more useful and predictive relative to other models based on (for example) the system's internal workings. If I want to know whether Kasparov is going to beat me at chess, a detailed model of neuroscience and a scan of his brain is less useful than modeling him as an agent whose goal is to win the game. Similarly, for Mu Zero, a mechanistic understanding of the artificial neural networks it is comprised of is probably less useful than modeling it as a capable Go agent, when predicting what moves it will make.

Summary: John describes the problems of inner and outer alignment. He also describes the concept of True Names - mathematical formalisations that hold up under optimisation pressure. He suggests that having a "True Name" for optimizers would be useful if we wanted to inspect a trained system for an inner optimiser and not risk missing something.

He further suggests that the concept of agency breaks down into lower-level components like "optimisation", "goals", "world models", ect. It would be possible to make further arguments about how these lower-level concepts are important for AI safety.

Also important to note:

The phenomenon you call by names like "goals" or "agency" is one possible shadow of the deep structure of optimization - roughly, preimaging outcomes onto choices by reversing a complicated transformation.
- @esyudkowsky

i.e. if we were to pin-down something we actually care about, that'd be "a system exhibiting consequentialism", because those are the kind of systems that will end up shaping our lightcone and more. Consequentialism is convergent in an optimization process, i.e. the "deep structure of optimization". Terms like "goals" or "agency" are shadows of consequentialism, finite approximations of this deep structure.

And by the virtue of being finite approximations (eg they're embedded), these "agents" have a bunch of convergent properties that makes it easy for us to reason about the "deep structure" themselves, like eg modularity, having a world-model, etc (check johnswentworth's comment).

Edit: Also the following quote

it is relatively unimportant to understand agency for its own sake or intelligence for its own sake or optimization for its own sake. Instead we should remember that these are frames for understanding these patterns that exert influence over the future

Could you please clarify in which sense you use the word "agency"?

One sense is the technicality of setting oneself problems and choosing things to pursue instead of sitting IDLE waiting for commands. For example, the difference between AutoGPT and ChatGPT. (Except, AutoGPT doesn't choose for oneself the highest-level problems, but it could be engineered to do so trivially as well.)

I think that the presence or absence of this kind of agency is not relevant to technical alignment:

First, we also want to protect from AI misuse (by humans) and having "non-agentic" but extremely smart AI doesn't solve the problem. If you try to extrapolate GPT capabilities, it's obvious that superintelligent GPT-n (or other systems surrounding it, e.g., content filters, but without loss of generality we can consider it a single AI system) should be extremely well aligned with humanity so that it itself chooses to solve certain problems and refuse to solve others, and also choosing the specific way of solving this or that problem. Even though these problems were given to it by the users.
Second, even "non-agentic" systems like ChatGPT tend to create effectively agentic entities on cultural-techno-evolutionary timescales: see "Why Simulator AIs want to be Active Inference AIs".

The second meaning of "agency" is synonymous with "resourcefulness", "intelligence" (in some sense), the ability to overcome obstacles and not back down in the face of challenges. I don't see how this meaning of "agency" is directly relevant to the alignment question.

The third possible meaning of "agency" is having some intrinsic opinion, volition, tendencies, values, or emotions. The opposite is being completely "neutral", in some sense. I think complete neutrality just doesn't physically exist. Every intelligence is biased, and things like inductive biases are not categorically distinct from moral biases, they actually lie on a continuum.

For this notion of agency, of course, we actually care about which exact opinions, tendencies, biases, and values AIs have: that's the essence of alignment. But I suspect you didn't have this meaning in mind.

"Could you please clarify in which sense you use the word "agency"?" - I guess I'm pretty confused by hearing you ask the question because I guess my whole point with this question was to clarify what is meant by "agency".

It's a bit like if I asked "What do we mean by subjective and objective?" and you asked "Could you please clarify 'subjective' and 'objective'?" that would seem rather strange to me.

The first sense seems relevant to alignment in that the kinds of worries we might have and the kinds of things that would reassure us regarding these worries ... (read more)

Let me ask a different question: why should I care about alignment for systems that are not agentic in the context of x-risk?

Could you clarify why you're asking this question? Is it because you're suggesting that defining "agency" would give us a way to restrict the kinds of systems that we need to consider?

I see some boundary cases: for example, I could theoretically imagine gradient descent creating something that mostly just uses heuristics and isn't a full agent, but which nonetheless poses an x-risk. That said, this seems somewhat unlikely to me.

AI misuse
(Perhaps unavoidable) agent-like entities in the space of technology, culture, and memes, as I elaborated a bit more in my answer. These are egregores, religions, evolutionary lineages of technology, self-replicating AutoGPT-like viral entities, self-replicating prompts, etc.

To protect from both, even "non-agentic" AI must be aligned.

I spent a few hours today just starting to answer this question, and only got as far as walking through what this "agency" thing is which we're trying to understand. Since people have already asked for clarification on that topic, I'll post it here as a standalone mini-essay. Things which this comment does not address, which I may or may not get around to writing later:

Really there should be a few more multi-agent phenomena at the end - think markets, organizations/firms, Schelling points, governance, that sort of thing. I ran out of steam before getting to those.
What might "understanding" each of these phenomena look like?
How might it all fit together into a coherent whole picture? (Though hopefully the parts below are enough to start to see the unifying structure.)
How would better understanding of each of these phenomena individually yield incremental progress on various alignment subgoals? (Basically any of them would be incrementally useful for multiple alignment approaches/subproblems.)
How would a unified understanding of all these pieces address the hard parts of alignment? In particular, how could they rule out large classes of potential unknown unknowns?

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

So there's this thing where everything interacts with everything else, but mostly not directly. A sled's motion down a hill is influenced, to varying degrees, by motions of far-off stars or by magma flows in the earth's crust or by the fashion choices of teenagers at a nearby high school. But those effects are some combination of (a) small, and (b) mediated by things which interact with the sled more directly, like its weight or the coefficient of friction between sled and hill. This phenomenon - most interactions being mediated by a few factors - is a necessary precondition to science working at all in our high-dimensional world. Otherwise, reproducible outcomes would require that we control way too many things to ever realistically achieve reproducibility.

Building on that, there's also this thing where a biological cell interacts with its surroundings mostly via specific sensors/channels on the membrane, despite all sorts of complex stuff happening inside. Or a deposit bank interacts with its customers mostly via fancy versions of "you put money in and take money out, the bank says 'no' if the amount you try to take out is greater than the amount you put in", despite lots of complex stuff going on behind the scenes at the bank to make it work.

These are "boundaries": some relatively-large/complex systems interact with the rest of the world only through relatively-narrow/simple information-channels. We need some notion of boundaries, and of interactions flowing across those boundaries, in order to carve out some subsystem to call an "agent".

The Basics: Agency

So there's this thing where a thermostat senses the initial temperature of a room, and then does different things (like e.g. activating heating or cooling) depending on the initial temperature, in such a way that the final temperature consistently ends up roughly the same, for many different initial temperatures?

Or a bacterium senses how sugar concentrations change as it swims along, and then does different things (like e.g. continuing forward or tumbling around to face a random new direction) depending on how the sugar concentration changes, in such a way that it ends up in an area with lots of sugar, for many different initial positions or sugar concentration landscapes?

Or most animals will look and listen and smell around themselves, and then do different things (like e.g. run or fly or swim different directions, or bite, or stay very still, or...) depending on what they see/hear/smell, in such a way that they end up eating food and not being eaten themselves (mostly, over short time horizons) and having children, for many different configurations of the trees and rocks and plants and animals around them?

That's the most basic form of "agency": taking different actions depending on observations, in order to achieve a consistent outcome (or class of outcomes).

The next few phenomena follow from that basic idea: they either allow a system to achieve a consistent outcome more robustly (i.e. across more initial conditions), or to achieve a more specific consistent outcome, or they're the "easiest" way (in a statistical sense) to achieve a consistent outcome across many different conditions.

Modules

So there's this thing where an animal or plant develops specialized organs, which interact with the rest of the organism only in relatively simple, specialized ways. Or an organization has many departments, which specialize in particular roles and present a simplified API to the rest of the company.

These are "modules": subsystems with boundaries of their own, interacting with the rest of the system through relatively-limited/simple information channels.

Factorization

So there's this thing where a human wants milk for their coffee and doesn't have any, and they break this problem up into subproblems. One subproblem is to drive to the store. Another is to find the milk within the store. A third is to make enough money to pay for the milk. These subproblems are mostly-independent: the human mostly doesn't need to think about the details of finding milk within the store in order to drive to the store, nor do they need to think about driving to the store in order to make money.

Or, an organism has organs/organelles with specialized roles which interact in relatively-limited/simple ways. In order for those organs to solve the organism's problems, they must each handle subproblems which are mostly-independent of the others (else the organs would need to pass a lot more information between themselves to solve the organism's top-level problems.) Same with departments of a company.

This is "factorization", a dual in some sense to modules: when faced with a problem, break it up into subproblems which can be solved mostly-independently.

Coherence

So there's this thing where a biological cell needs a handful of different metabolic resources - most obviously energy (i.e. ATP), but also amino acids, membrane lipids, etc. And often cells can produce some metabolic resources via multiple different paths, including cyclical paths - e.g. it's useful to be able to turn A into B but also B into A, because sometimes the environment will have lots of B and other times it will have lots of A. But we also expect that the cell usually won't spend energy to turn B into A and spend energy to turn A into B at the same time; energy is a scarce resource, and we expect that the bacterium can produce more progeny (on average) if it doesn't waste resources that way. So, the bacterium can achieve higher fitness if it represses either the A -> B pathway or the B -> A pathway at any given time, depending on which metabolite is more abundant. (See here for more detail on this example and how it relates to utility maximization, plus a bunch of meta discussion.)

Or, consider the toy example of a hospital administrator budgeting to save as many lives as possible. If the administrator spends $1M on a liver for someone who needs a transplant, but does not spend $100k on a dialysis machine which will save 3 lives, then the administrator has failed to budget in a way which saves as many lives as possible. They could save strictly more lives on the same budget by taking the dialysis machine over the liver.

That's "coherence": taking multiple actions in different times/places, in such a way that the actions together are pareto optimal with respect to scarce resources.

World Models

So there's this thing where a human keeps a map in their head of the stuff around them, including outside their direct line of sight. (You can tell humans do this because if they turn around and see some big obvious thing behind them which was not there last time they looked, they will be surprised, whereas if they see some big obvious thing which was there, they will not be surprised.) And that map constantly updates as new information comes in, in such a way that the map continues to track stuff around the human pretty robustly, even if there's weird stuff which messes it up for a little while.

Even e-coli, when swimming along a sugar gradient, have an internal molecular species whose concentration roughly tracks the rate-of-change of the external sugar concentration as the bacterium swims. It's a tiny internal model of the e-coli's environment. More generally, cells often use some internal molecular species to track some external state, and update that internal representation as new information comes in.

That's a "world model": some internal stuff which consistently tracks the state of (some parts of) the external world, and updates to continue tracking that external state as new information comes in.

General-Purpose Search

So there's this thing with humans where you can give them pretty arbitrary tasks, from assembling some furniture to coding an app to planning an invasion, and they'll go figure out how to do it. In particular, humans can come up with plans to do pretty arbitrary tasks, before actually starting the tasks. (And of course competent humans usually iteratively update those plans as they try stuff and new information changes their world-model.) This is in contrast to fixed strategies, which can't update to many new tasks or adjust as new information comes in.

The part which comes up with the plan, and updates it in tandem with changes to the world-model, is "general-purpose search": some internal method which can find strategies to achieve a wide variety of goals across a wide variety of (modeled) external world-states. (More on what general purpose search is/isn't here.)

Reflection

So there's this thing where some animals recognize themselves in a mirror, and some don't. (You can tell this from the animal e.g. trying to fight with the reflection or scare it away, vs e.g. noticing something sneaking up behind the reflection and then turning to see what's behind them.)

Or humans explicitly think about themselves, and talk about themselves, their own thought processes, how they're perceived by others, yada yada yada. Indeed, it's hard to get humans to stop thinking about themselves for a short while.

This is "reflection": a system represents itself, not just in the trivial way that everything "represents itself", but within its own world-model, including representations of relationships to all the external stuff represented in the world model.

Language

So there's this thing where you can show a toddler an apple and say "apple", repeat with maybe three different apples, and from then on the toddler will mostly interpret "apple" the same way most other humans do. In the minds of two different humans, the word will map to internal representations of roughly-the-same stuff in the environment. Furthermore, words can be composed together in an exponentially huge variety of ways, and different humans will still end up mapping the words to internal representations of roughly-the-same stuff in the environment. (Not super consistently, unfortunately, but enough that humans are able to communicate at all, which is rather remarkable when dealing with an exponentially large space of potential meanings.)

This is "language": two systems coordinate to pass signals between them which map to internal representations of roughly-the-same stuff in the environment.

Thanks for your response. There's a lot of good material here, although some of these components like modules or language seem less central to agency, at least from my perspective. I guess you might see these are appearing slightly down the stack?

They fit naturally into the coherent whole picture. In very broad strokes, that picture looks like selection theorems starting from selection pressures for basic agency, running through natural factorization of problem domains (which is where modules and eventually language come in), then world models and general purpose search (which finds natural factorizations dynamically, rather than in a hard-coded way) once the environment and selection objective has enough variety.

Even e-coli, when swimming along a sugar gradient, have an internal molecular species whose concentration roughly tracks the rate-of-change of the external sugar concentration as the bacterium swims. It's a tiny internal model of the e-coli's environment. More generally, cells often use some internal molecular species to track some external state, and update that internal representation as new information comes in.

Woah, this sounds incredibly fascinating, I've never heard of this — do you have a link to more info, or terminology to google?

"Chemotaxis" is the main relevant jargon.

Really there should be a few more multi-agent phenomena at the end - think markets, organizations/firms, Schelling points, governance, that sort of thing. I ran out of steam before getting to those.
What might "understanding" each of these phenomena look like?
How might it all fit together into a coherent whole picture? (Though hopefully the parts below are enough to start to see the unifying structure.)
How would better understanding of each of these phenomena individually yield incremental progress on various alignment subgoals? (Basically any of them would be incrementally useful for multiple alignment approaches/subproblems.)
How would a unified understanding of all these pieces address the hard parts of alignment? In particular, how could they rule out large classes of potential unknown unknowns?

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

The Basics: Agency

That's the most basic form of "agency": taking different actions depending on observations, in order to achieve a consistent outcome (or class of outcomes).

Modules

These are "modules": subsystems with boundaries of their own, interacting with the rest of the system through relatively-limited/simple information channels.

Factorization

This is "factorization", a dual in some sense to modules: when faced with a problem, break it up into subproblems which can be solved mostly-independently.

Coherence

That's "coherence": taking multiple actions in different times/places, in such a way that the actions together are pareto optimal with respect to scarce resources.

World Models

General-Purpose Search

Reflection

Language

This is "language": two systems coordinate to pass signals between them which map to internal representations of roughly-the-same stuff in the environment.

Even e-coli, when swimming along a sugar gradient, have an internal molecular species whose concentration roughly tracks the rate-of-change of the external sugar concentration as the bacterium swims. It's a tiny internal model of the e-coli's environment. More generally, cells often use some internal molecular species to track some external state, and update that internal representation as new information comes in.

Woah, this sounds incredibly fascinating, I've never heard of this — do you have a link to more info, or terminology to google?

"Chemotaxis" is the main relevant jargon.

LESSWRONG
LW

LESSWRONG
LW

22

[ Question ]

Why do we care about agency for alignment?

22

Ω 6

22

Ω 6

5 Answers sorted by
top scoring

Apr 23, 2023

Apr 23, 2023

May 11, 2023

Apr 23, 2023

Apr 23, 2023

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

The Basics: Agency

Modules

Factorization

Coherence

World Models

General-Purpose Search

Reflection

Language

22

Ω 6

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

The Basics: Agency

Modules

Factorization

Coherence

World Models

General-Purpose Search

Reflection

Language

22

[ Question ]

Why do we care about agency for alignment?

22

Ω 6

22

Ω 6

5 Answers sorted by top scoring

Apr 23, 2023

Apr 23, 2023

May 11, 2023

Apr 23, 2023

Apr 23, 2023

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

The Basics: Agency

Modules

Factorization

Coherence

World Models

General-Purpose Search

Reflection

Language

22

Ω 6

What "agenty" phenomena are we talking about?

Prerequisite: Boundaries

The Basics: Agency

Modules

Factorization

Coherence

World Models

General-Purpose Search

Reflection

Language

5 Answers sorted by
top scoring