The Problem With the Word ‘Alignment’

peligrietzer; particlemania

This post was written by Peli Grietzer, inspired by internal writings by TJ (tushita jha), for AOI^[1]. The original post, published on Feb 5, 2024, can be found here: https://ai.objectives.institute/blog/the-problem-with-alignment.

The purpose of our work at the AI Objectives Institute (AOI) is to direct the impact of AI towards human autonomy and human flourishing. In the course of articulating our mission and positioning ourselves -- a young organization -- in the landscape of AI risk orgs, we’ve come to notice what we think are serious conceptual problems with the prevalent vocabulary of ‘AI alignment.’ This essay will discuss some of the major ways in which we think the concept of ‘alignment’ creates bias and confusion, as well as our own search for clarifying concepts.

At AOI, we try to think about AI within the context of humanity’s contemporary institutional structures: How do contemporary market and non-market (eg. bureaucratic, political, ideological, reputational) forces shape AI R&D and deployment, and how will the rise of AI-empowered corporate, state, and NGO actors reshape those forces? We increasingly feel that ‘alignment’ talk tends to obscure or distort these questions.

The trouble, we believe, is the idea that there is a single so-called Alignment Problem. Talk about an ‘Alignment Problem’ tends to conflate a family of related but distinct technical and social problems, including:

P1: Avoiding takeover from emergent optimization in AI agents

P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us

P3: Ensuring AIs are good at solving problems as specified (by user or designer)

P4: Ensuring AI systems enhance, and don’t erode, human agency

P5: Ensuring that advanced AI agents learn a human utility function

P6: Ensuring that AI systems lead to desirable systemic and long term outcomes

Each of P1-P6 is known as ‘the Alignment Problem’ (or as the core research problem in ‘Alignment Research’) to at least some people in the greater AI Risk sphere, in at least some contexts. And yet these problems are clearly not simply interchangeable: placing any one of P1-P6 at the center of AI safety implies a complicated background theory about their relationship, their relative difficulty, and their relative significance.

We believe that when different individuals and organizations speak of the ‘Alignment Problem,’ they assume different controversial reductions of the P1-P6 problems network to one of its elements. Furthermore, the very idea of an ‘Alignment Problem’ precommits us to finding a reduction for P1-P6, obscuring the possibility that this network of problems calls for a multi-pronged treatment.

One surface-level consequence of the semantic compression around ‘alignment’ is widespread miscommunication, as well as fights over linguistic real-estate. The deeper problem, though, is that this compression serves to obscure some of a researcher’s or org’s foundational ideas about AI by ‘burying’ them under the concept of alignment. Take a familiar example of a culture clash within the greater AI Risk sphere: many mainstream AI researchers identify ‘alignment work’ with incremental progress on P3 (task-reliability), which researchers in the core AI Risk community reject as just safety-washed capabilities research. We believe working through this culture-clash requires that both parties state their theories about the relationship between progress on P3 and progress on P1 (takeover avoidance).

In our own work at AOI, we’ve had occasion to closely examine a viewpoint we call the Berkeley Model of Alignment -- a popular reduction of P1-P6 to P5 (agent value-learning) based on a paradigm consolidated at UC Berkeley’s CHAI research group in the late ‘10s. While the assumptions we associate with the Berkeley Model are no longer as dominant in technical alignment research^[2] as they once were, we believe that the Berkeley Model still informs a great deal of big-picture and strategic discourse around AI safety.

Under the view we call the Berkeley Model of Alignment, advanced AIs can be naturally divided into two kinds: AI agents possessing a human utility function (‘aligned AIs’) and AI agents motivated to take over or eliminate humanity (‘unaligned AIs’). Within this paradigm, solving agent value-learning is effectively necessary for takeover avoidance and effectively sufficient for a systematically good future, making the relationship between observable progress on task-reliability and genuine progress on agent value-learning the central open question in AI safety and AI policy. This model of alignment is, of course, not simply arbitrary: it’s grounded in well-trodden arguments about the likelihood of emergent general-planner AGI and its tendency towards power-seeking. Nevertheless, we think the status of the Berkeley Model in our shared vocabulary blends these arguments into the background in ways that support imprecise, automatic thought-patterns instead of precise inferences.

The first implicit pillar of the Berkeley Model that we want to criticize is the assumption of content indifference: The Berkeley Model assumes we can fully separate the technical problem of aligning an AI to some values or goals and the governance problem of choosing what values or goals to target. While it is logically possible that we’ll discover some fully generic method of pointing to goals or values (e.g. brain-reading), it’s equally plausible that different goals or values will effectively have different ‘type-signatures’: goals or values that are highly unnatural or esoteric given one training method or specification-format may be readily accessible given another training method or specification-format, and vice versa. This issue is even more pressing if we take a sociotechnical viewpoint that considers the impact of early AI technology on the epistemic, ideological, and economic conditions under which later AI development and deployment takes place.

The second implicit pillar that we want to criticize is the assumption of a value-learning bottleneck: The Berkeley Model assumes that the fundamental challenge in AI safety is teaching AIs a human utility function. We want to observe, first of all, that value learning is neither clearly necessary nor clearly sufficient for either takeover avoidance or a systematically good future. Consider that we humans ourselves manage to be respectful, caring, and helpful to our friends despite not fully knowing what they care about or what their life plans are -- thereby providing an informal human proof for the possibility of beneficial and safe behavior without exhaustive learning of the target’s values. And as concerns sufficiency, the recent literature on deceptive alignment vividly demonstrates that value learning by itself can’t guarantee the right relationship to motivation: understanding human value and caring about values are different things.

Perhaps more important, the idea of a value-learning bottleneck assumes that AI systems will have a single ‘layer’ of goals or values. While this makes sense within the context of takeover scenarios where an AI agent directly stamps its utility function on the world, the current advance of applied AI suggests that near-future, high-impact AI systems will be composites of many AI and non-AI components. Without dismissing takeover scenarios, we at AOI believe that it’s also critical to study and guide the collective agency of composite, AI-driven sociotechnical systems. Consider, for example, advanced LLM-based systems: although we could empirically measure whether the underlying LLM can model human values by testing token completion over complex ethical statements, what’s truly impact-relevant are the patterns of interaction that emerge at the conjunction of the base LLM, RLHF regimen, prompting wrapper and plugins, interface design, and user-culture.

This brings us to our final, central problem with the Berkeley Model: the assumption of context independence. At AOI, we are strongly concerned with how the social and economic ‘ambient background’ to AI R&D and deployment is likely to shape future AI. Our late founder Peter Eckerlsey was motivated by the worry that market dynamics favor the creation of powerful profit-maximizing AI systems that trample the public good: risks from intelligent optimization in advanced AI, Eckersley thought, are a radical new extension of optimization risks from market failures and misaligned corporations that already impact human agency in potentially catastrophic ways. Eckersely hoped that by restructuring the incentives around AI R&D, humanity could wrest AI from these indifferent optimization processes and build AI institutions sensitive to the true public good. In Eckersley's work at AOI and AOI's work after his passing we continue to expand this viewpoint, incorporating a plethora of other social forces: bureaucratic dynamics within corporations and states, political conflicts, ideological and reputational incentives. We believe that in many plausible scenarios these forces will both shape the design of future AI technology itself, and guide the conduct of future AI-empowered sociotechnical intelligences such as governments and corporations.

This sociotechnical perspective on the future of AI does, of course, makes its own hidden assumptions: In order to inherit or empower the profit-motive of corporations, advanced AI must be at least minimally controllable. While on the Berkeley Model of Alignment one technical operation (‘value alignment’) takes care of AI risk in its entirety, our sociotechnical model expects the future of AI to be determined by two complementary fronts: technical AI safety engineering, and design and reform of institutions that develop, deploy, and govern AI. We believe that without good institutional judgment, many of the most likely forms of technically controllable AI may end up amplifying current harms, injustices, and threats to human agency. At the same time, we also worry that exclusive focus on current harms and their feedback loops can blind researchers and policy-makers to more technical forms of AI risk: Consider, for example, that researchers seeking to develop AI systems’ understanding of rich social contexts may produce new AI capabilities with ‘dual use’ for deception and manipulation.

It may seem reasonable, at first glance, to think about our viewpoint as simply expanding the alignment problem -- adding an ‘institutional alignment problem’ to the technical AI alignment problem. While this is an approach some might have taken in the past, we’ve grown suspicious of the assumption that technical AI safety will take the form of an ‘alignment’ operation, and wary of the implication that good institutional design is a matter of inducing people to collectively enact some preconceived utility function. As we’ll discuss in our next post, we believe Martha Nussbuam’s and Amartya Sen’s ‘capabilities’ approach to public benefit gives a compelling alternative framework for institutional design that applies well to advanced AI and to the institutions that create and govern it. For now, we hope we’ve managed to articulate some of the ways in which ‘alignment’ talk restricts thought about AI and its future, as well as suggest some reasons to paint outside of these lines.

^{^}
This post's contents were drafted by Peli and TJ, in their former capacity as Research Fellow and Research Director at AOI. They are currently research affiliates collaborating with the organization.
^{^}
We believe there is an emerging paradigm that seeks to reduce P1-P6 to P2 (human intelligibility), but this new paradigm has so far not consolidated to the same degree as the Berkeley Model. Current intelligibility-driven research programs such as ELK and OAA don’t yet present themselves as ‘complete’ strategies for addressing P1-P6.

I think you're right about these drawbacks of using the term "alignment" so broadly. And I agree that more work and attention should be devoted to specifying how we suppose these concepts relate to each other. In my experience, far too little effort is devoted to placing scientific work within its broader context. We cannot afford to waste effort in working on alignment.

I don't see a better alternative, nor do you suggest one. My preference in terminology is to simply use more specification, rather than trying to get anyone to change the terminology they use. With that in mind, I'll list what I see as the most common existing terminology for each of the sub-problems.

P1: Avoiding takeover from emergent optimization in AI agents

Best term in use: AInotkilleveryoneism. I disagree that alignment is commonly misused for this.

I don't think I've heard these termed alignment, outside of the assumption you mention in the Berkeley model of value alignment (P5) as the only way of avoiding takeover (P1). P1 has been termed "the control problem" which encompasses value alignment. Which is good. This does not fit the intuitive definition of alignment. The deliberately clumsy term "AInotkilleveryoneism" seems good for this, in any context you can get away with it. Your statement seems good otherwise.

P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us

Best term in use: interpretability

This is more commonly called interpretability, but I agree that it's commonly lumped into "alignment work" without carefully examining just how it fits in. But it does legitimately fit into P1 (which shouldn't be called alignment, as well as (what I think you mean by) P3, P5, and P6, which do fit the intuitive meaning of "alignment." Thus, it does seem like this deserves the term "alignment work" as well as its more precise term of interpretability. So this seems about right, with the caveat of wanting more specificity. As it happens, I just now published a post on exactly this.

P3: Ensuring AIs are good at solving problems as specified (by user or designer)

Best term in use: None. AI safety?

I think you mean to include ensuring AIs also do not do things their designers don't want. I suggest changing your description, since that effort is more often called alignment and accused of safety-washing.

This is the biggest offender. The problem is that "alignment is intuitively appealing. I'd argue that this is completely wrong: you can't align a system with goals (humans) with a tool without goals (LLMs). A sword or an axe are not aligned with their wielders; they certainly lead to more trees cut down and people stabbed, but they do not intend those things, so there's a type error in saying they are aligned with their users goals.

But this is pedantry that will continue to be ignored. I don't have a good idea for making this terminology clear. The term AGI at one point was used to specify AI with agency and goals, and thus which would be alignable with human goals, but it's been watered down. We need a replacement. And we need a better term for "aligning" AIs that are not at all dangerous in the severe way the "alignment problem" terminology was intended to address. Or a different term for doing the important work of aligning agentic, RSI-capable AGI.

P4: Ensuring AI systems enhance, and don’t erode, human agency

What? I'd drop this and just consider it a subset of P6. Maybe this plays a bigger role and gets the term alignment more than I know? Do you have examples?

P5: Ensuring that advanced AI agents learn a human utility function

Best term in use: value alignment OR technical alignment.

I think these deserve their own categories in your terminology, because they overlap - technical alignment could be limited to making AGIs that follow instructions. I have been thinking about this a lot. I agree with your analysis that this is what people will probably do, for economic reasons; but I think there are powerful practical reasons that this is much easier than full value alignment, which will be a valuable excuse to align it to follow instructions from its creators. I recently wrote up that logic. This conclusion raises another problem that I think deserves to join the flock of related alignment problems: the societal alignment problem. If some humans have AGIs aligned to their values (likely through their intent/instructions), how can we align society to avoid resulting disasters from AGI-powered conflict?

P6: Ensuring that AI systems lead to desirable systemic and long term outcomes

Best term in use: I don't think there is one. Any ideas?

The deliberately clumsy term "AInotkilleveryoneism" seems good for this, in any context you can get away with it.

Hard disagree. The position "AI might kill all humans in the near future" is still quite some inferential distance away from the mainstream even if presented in a respectable academic veneer.

We do not have weirdness points to spend on deliberately clumsy terms, even on LW. Journalists (when they are not busy doxxing people) can read LW too, and if they read that the worry about AI as an extinction risk is commonly called notkilleveryoneism they are orders of magnitude less likely to take us serious, and being taken serious by the mainstream might be helpful for influencing policy.

We could probably get away with using that term ten pages deep into some glowfic, but anywhere else 'AI as an extinction risk' seems much better.

I think you're right. Unfortunately I'm not sure "AI as an extinction risk" is much better. It's still a weird thing to posit, by standard intuitions.

I think you're right about these drawbacks of the widespread use of the term alignment for

I think that "AI Alignment" is a useful label for the somewhat related problems around P1-P6. Having a term for the broader thing seems really useful.

Of course, sometimes you want labels to refer to a fairly narrow thing, like the label "Continuum Hypothesis". But broad labels are generally useful. Take "ethics", another broad field label. Nominative ethics, applied ethics, meta-ethics, descriptive ethics, value theory, moral psychology, et cetera. I someone tells me "I study ethics" this narrows down what problems they are likely to work on, but not very much. Perhaps they work out a QALY-based systems for assigning organ donations, or study the moral beliefs of some peoples, or argue if moral imperatives should have a truth value. Still, the label confers a lot of useful information over a broader label like "philosophy".

By contrast, "AI Alignment" still seems rather narrow. P2 for example seems a mostly instrumental goal: if we have interpretability, we have better chances to avoid a takeover of an unaligned AI. P3 seems helpful but insufficient for good long term outcomes: an AI prone to disobeying users or interpreting their orders in a hostile way would -- absent some other mechanism -- also fail to follow human values more broadly, but an P3-aligned AI in the hand of a bad human actor could still cause extinction, and I agree that social structures should probably be established to ensure that nobody can unilaterally assign the core task (or utility function) of an ASI.

I would agree that it would be good and reasonable to have a term to refer to the family of scientific and philosophical problem spanned by this space. At the same time, as the post says, the issue is when there is semantic dilution, people talking past each other, and coordination-inhibiting ambiguity.

P3 seems helpful but insufficient for good long term outcomes

Now take a look at something I could check with a simple search: an ICML Workshop that uses the term alignment mostly to mean P3 (task-reliability) https://arlet-workshop.github.io/

One might want to use alignment one way or the other, and be careful of the limited overlap with P3 in our own registers, but by the time the larger AI community has picked up on the use-semantics of 'RLHF is an alignment technique' and associated alignment primarily with task-reliability, you'd need some linguistic interventions and deliberation to clear the air.

For clarity, how do you distinguish between P1 & P4?

First of all, these are all meant to denote very rough attempts at demarcating research tastes.

It seems possible to be aiming to solve P1 without thinking much of P4, if a) you advocate ~Butlerian pause, or b) if you are working on aligned paternalism as the target behavior (where AI(s) are responsible for keeping humans happy, and humans have no residual agency or autonomy remaining).

Also a lot of people who focus on the problem from a P4 perspective tend to focus on the human-AI interface, where most of the relevant technical problems lie, but this might reduce their attention on issues of mesa-optimizers or emergent agency despite the massive importance of those issues to their project in the long run.

P1: Avoiding takeover from emergent optimization in AI agents

Best term in use: AInotkilleveryoneism. I disagree that alignment is commonly misused for this.

P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us

Best term in use: interpretability

P3: Ensuring AIs are good at solving problems as specified (by user or designer)

Best term in use: None. AI safety?

P4: Ensuring AI systems enhance, and don’t erode, human agency

What? I'd drop this and just consider it a subset of P6. Maybe this plays a bigger role and gets the term alignment more than I know? Do you have examples?

P5: Ensuring that advanced AI agents learn a human utility function

Best term in use: value alignment OR technical alignment.

P6: Ensuring that AI systems lead to desirable systemic and long term outcomes

Best term in use: I don't think there is one. Any ideas?

The deliberately clumsy term "AInotkilleveryoneism" seems good for this, in any context you can get away with it.

Hard disagree. The position "AI might kill all humans in the near future" is still quite some inferential distance away from the mainstream even if presented in a respectable academic veneer.

We could probably get away with using that term ten pages deep into some glowfic, but anywhere else 'AI as an extinction risk' seems much better.

I think you're right. Unfortunately I'm not sure "AI as an extinction risk" is much better. It's still a weird thing to posit, by standard intuitions.

I think you're right about these drawbacks of the widespread use of the term alignment for

I think that "AI Alignment" is a useful label for the somewhat related problems around P1-P6. Having a term for the broader thing seems really useful.

P3 seems helpful but insufficient for good long term outcomes

For clarity, how do you distinguish between P1 & P4?

LESSWRONG
LW

LESSWRONG
LW

64

The Problem With the Word ‘Alignment’

64

Ω 22

64

Ω 22

64

Ω 22