I think that the concept of "alignment research" (and the distinction between that and "capabilities research") is currently a fairly confused one. In this post I’ll describe some of the problems with how people typically think about these terms, and offer replacement definitions.

“Alignment” and “capabilities” are primarily properties of AIs not of AI research

The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community’s focus on general capabilities, and now it’s much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI.

Insofar as the ML community has thought about alignment, it has mostly focused on aligning models’ behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans.

However, extending “alignment” and “capabilities” from properties of AIs to properties of different types of research is a fraught endeavor. It’s tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems.

Firstly, in general it’s very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin’s study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways.

More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include:

  • RLHF makes language models more obedient, but also more capable of coherently carrying out tasks.
  • Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly. E.g. this paper finds that "LLM critics can successfully identify hundreds of errors in ChatGPT training data rated as 'flawless'".
  • Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via activation steering).
  • Techniques for Eliciting Latent Knowledge will plausibly be important for allowing AIs to make better use of implicit knowledge (e.g. knowledge about protein folding that’s currently hidden inside AlphaFold’s weights).
  • MIRI thought that their agent foundations research could potentially advance AI capabilities, which motivated their secrecy about it.

Secondly, not only is it difficult to predict the effects of any given piece of research, it’s also difficult for alignment researchers to agree on which effects are good or bad. This is because there are deep disagreements in the field about the likelihoods of different threat models. The more difficult you think the alignment problem is, the more likely you are to consider most existing research useless or actively harmful (as Yudkowsky does). By contrast, Christiano has written in defense of research developing RLHF and language model agents; and many alignment researchers who are more closely-linked to mainstream ML have even broader views of what research is valuable from an alignment perspective.

Thirdly, for people concerned about AGI, most ML research should count as neither alignment nor capabilities—because it focuses on improving model performance in relatively narrow domains, in a way that is unlikely to generalize very far. I’ll call this type of research applications research. The ubiquity of applications research (e.g. across the many companies that are primarily focused on building ML products) makes some statistics that have been thrown around about the relative numbers of alignment researchers versus capabilities researchers (e.g. herehere) very misleading.

What types of research are valuable for preventing misalignment?

If we'd like to help prevent existential risk from misaligned AGI, but can't categorize research on a case-by-case basis, we'll need to fall back on higher-level principles about which research is beneficial. Specifically, I'll defend two traits which I think should be our main criteria for prioritizing research from an alignment-focused perspective:

Valuable property 1: worst-case focus

Most ML research focuses on improving the average performance of models (whether on a narrow set of tasks or a broad range of tasks). By contrast, alignment researchers are primarily interested in preventing models’ worst-case misbehavior, which may arise very rarely (and primarily in situations where models expect it won’t be detected). While there’s sometimes overlap between work on improving average-case behavior and work on improving worst-case behavior, in general we should expect them to look fairly different.

We see a similar dynamic play out in cybersecurity (as highlighted by Yudkowsky’s writing on security mindset). In an ideal world, we could classify most software engineering work as cybersecurity work, because security would be built into the design by default. But in practice, creating highly secure systems requires a different skillset from regular software engineering, and typically doesn’t happen unless it’s some team’s main priority. Similarly, even if highly principled capabilities research would in theory help address alignment problems, in practice there's a lot of pressure to trade off worst-case performance for average-case performance.

These pressures are exacerbated by the difficulty of addressing worst-case misbehavior even in current models. Its rarity makes it hard to characterize or study. Adversarial methods (like red-teaming) can find some examples, but these methods are bottlenecked by the capabilities of the adversary. A more principled approach would involve formally verifying safety properties, but formally specifying or verifying non-trivial properties would require significant research breakthroughs. The extent to which techniques for eliciting and addressing worst-case misbehavior of existing models will be helpful for more capable models is an open question.

Valuable property 2: scientific approach

Here’s another framing: a core barrier to aligning AGI is that we don’t understand neural networks well enough to say many meaningful things about how they function. So we should support research that helps us understand deep learning in a principled way. We can view this as a distinction between science and engineering: engineering aims primarily to make things work, science aims primarily to understand how they work. (This is related to Nate Soares’ point in this post.)

Thinking of AGI alignment as being driven by fundamental science highlights that big breakthroughs are likely to be relatively simple and easy to recognize—more like new theories of physics that make precise, powerful predictions than a big complicated codebase that we need to scrutinize line-by-line. This makes me optimistic about automating it in a way that humans can verify.

However, “trying to scientifically understand deep learning” is too broad a criterion to serve as a proxy for whether research will be valuable from an alignment perspective. For example, I expect that most work on scientifically understanding optimizers will primarily be useful for designing better optimizers, rather than understanding the models that result from the optimization process. So can we be more precise about what aspect of a “scientific approach” is valuable for alignment? My contention: a good proxy is the extent to which the research focuses on understanding cognition rather than behavior—i.e. the extent to which it takes a cognitivist approach rather than a behaviorist approach.

Some background on this terminology: the distinction between behaviorism and cognitivism comes from the history of psychology. In the mid-1900s, behaviorists held that the internal mental states of humans and animals couldn’t be studied scientifically, and therefore that the only scientifically meaningful approach was to focus on describing patterns of behavior. The influential behaviorist B. F. Skinner experimented with rewarding and punishing animal behavior, which eventually led to the modern field of reinforcement learning. However, the philosophical commitments of behaviorism became increasingly untenable. In the field of ethology, which studies animal behavior, researchers like Jane Goodall and Frans de Waal uncovered sophisticated behaviors inconsistent with viewing animals as pure reinforcement learners. In linguistics, Chomsky wrote a scathing critique of Skinner’s 1957 book Verbal Behavior. Skinner characterized language as a set of stimulus-response patterns, but Chomsky argued that this couldn’t account for human generalization to a very wide range of novel sentences. Eventually, psychology moved towards a synthesis in which study of behavior was paired with study of cognition.

ML today is analogous to psychology in the 1950s. Most ML researchers are behaviorists with respect to studying AIs. They focus on how training data determines behavior, and assume that AI behavior is driven by “bundles of heuristics” except in the cases where it’s demonstrated otherwise. This makes sense on narrow tasks, where it’s possible to categorize and study different types of behavior. But when models display consistent types of behavior across many different tasks, it becomes increasingly difficult to predict that behavior without reference to the underlying cognition going on inside the models. (Indeed, this observation can be seen as the core of the alignment problem: we can’t deduce internal motivations from external behavior.)

ML researchers often shy away from studying model cognition, because the methodology involved is often less transparent and less reproducible than simply studying behavior. This is analogous to how early ethologists who studied animals “in the wild” were disparaged for using unrigorous qualitative methodologies. However, they gradually collected many examples of sophisticated behavior (including tool usepower struggles, and cultural transmission) which eventually provided much more insight than narrow, controlled experiments performed in labs.

Similarly, I expect that the study of the internal representations of neural networks will gradually accumulate more and more interesting data points, spark more and more concrete hypotheses, and eventually provide us with principles for understanding neural networks’ real-world behavior that are powerful enough to generalize even to very intelligent agents in very novel situations.

A better definition of alignment research

We can combine the two criteria above to give us a two-dimensional categorization of different types of AI research. In the table below, I give central examples of each type (with the pessimistic-case category being an intermediate step between average-case and worst-case):

 Average-casePessimistic-caseWorst-case
EngineeringScalingRLHFAdversarial robustness
Behaviorist scienceOptimization scienceScalable oversightAI control
Cognitivist scienceConcept-based interpretabilityMechanistic interpretabilityAgent foundations

There’s obviously a lot of research that I’ve skipped over, and exactly where each subfield should be placed is inherently subjective. (For example, there’s been a lot of rigorous scientific research on adversarial examples, but in practice it seems like the best mitigations are fairly hacky and unprincipled, which is why I put adversarial robustness in the “engineering” row.) But I nevertheless think that these are valuable categories for organizing our thinking.

We could just leave it here. But in practice, I don’t think people will stop talking about “alignment research” and “capabilities research”, and I’d like to have some definition of each that doesn’t feel misguided. So, going forward, I’ll define alignment research and capabilities research as research that’s close to the bottom-right and top-left corners respectively. This defines a spectrum between them; I’d like more researchers to move towards the alignment end of the spectrum. (Though recall, as I noted above, that most ML research is neither alignment nor capabilities research, but instead applications research.)

Lastly, I expect all of this to change over time. For example, the central examples of adversarial attacks used to be cases where image models gave bad classifications. Now we also have many examples of jailbreaks which make language models ignore developers’ instructions. In the future, central examples of adversarial attacks will be ones which make models actively try to cause harmful outcomes. So I hope that eventually different research directions near the bottom-right corner of the table above will unify into a rigorous science studying artificial values. And further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds. For now, though, when I promote alignment, it means that I’m trying to advance worst-case and/or cognitivist AI research.

New Comment
24 comments, sorted by Click to highlight new comments since:
[-]habrykaΩ61414

To be fair, the alignment community has caused some confusion by describing models as more or less “aligned”

I am curious what you are thinking about. My sense is the trend of calling models "aligned" started with OpenAI and ChatGPT, and wasn't really driven by anyone in the AI Alignment community (and is something that I complained a lot about at the time, because it did indeed seem like a thing that predictably would lead to confusion). 

It's plausible to me that Paul also contributed to this, but my sense is most senior alignment people have been very hesitant to use "alignment" as a generalized term to describe the behavior of present-day models.

Fair point. I've now removed that section from the post (and also, unrelatedly, renamed the post).

I was trying to make a point about people wanting to ensure that AI in general (not just current models) is "aligned", but in hindsight I think people usually talk about alignment with human values or similar. I have some qualms about that but will discuss in a different post.

[+][comment deleted]Ω220
[-]LawrenceCΩ450

Pretty sure Anthropic's early assistant stuff used the word this way too: See e.g. Bai et al https://arxiv.org/abs/2204.05862

But yes, people complained about it a lot at the time

[-]habrykaΩ440

Yeah, my sense is others (like Anthropic) followed along after OpenAI did that, though it seemed to me mostly to be against consensus in the alignment field (though I agree it's messy).

[-]LawrenceCΩ330

(The Anthropic paper I cited predates ChatGPT by 7 months)

[-]habrykaΩ231

Huh, interesting. Maybe the OpenAI statements about their models being "more aligned" came earlier than that in the context of Instruct-GPT? I definitely feel like I remember some Twitter threads and LW comment threads about it in the context of OpenAI announcements, and nothing in the context of Anthropic announcements.

[-]leogaoΩ6102

This is likely not the first instance, but OpenAI was already using the word "aligned" in this way in 2021 in the Codex paper.

https://arxiv.org/abs/2107.03374 (section 7.2)

[-]LawrenceCΩ440

Ah, you're correct, it's from the original instructGPT release in Jan 2022:
https://openai.com/index/instruction-following/

I like my own definition of alignment vs. capabilities research better:

"Alignment research is when your research goals are primarily about how to make AIs aligned; capabilities research is when your research goals are primarily about how to make AIs more capable."

I think it's very important that lots of people currently doing capabilities research switch to doing alignment research. That is, I think it's very important that lots of people who are currently waking  up every day thinking 'how can I design a training run that will result in AGI?' switch to waking up every day thinking 'Suppose my colleagues do in fact get to AGI in something like the current paradigm, and they apply standard alignment techniques -- what would happen? Would it be aligned? How can I improve the odds that it would be aligned?'

Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)

What if your research goal is "I'd like to understand how neural networks work?" This is not research primarily about how to make AIs aligned. We tend to hypothesize, as a community, that it will help with alignment more than it helps with capabilities. But that's not an inherent part of the research goal for many interpretability researchers.

(Same for "I'd like to understand how agency works", which is a big motivation for many agent foundations researchers.)

Conversely, what if your research goal is "I'm going to design a training run that will produce a frontier model, so that we can study it to advance alignment research"? Seems odd, but I'd bet that (e.g.) a chunk of Anthropic's scaling team thinks this way. Counts as alignment under your definition, since that's the primary goal of the research.

More generally, I think it's actually a very important component of science that people judge the research itself, not the motivations behind it—since historically scientific breakthroughs have often come from people who were disliked by establishment scientists. A definition that basically boils down to "alignment research is whatever research is done by the people with the right motivations" makes it very easy to prioritize the ingroup. I do think that historically being motivated by alignment has correlated with choosing valuable research directions from an alignment perspective (like mech interp instead of more shallow interp techniques) but I think we can mostly capture that difference by favoring more principled, robust, generalizable research (as per my definitions in the post).

Whereas I don't think it's particularly important that e.g. people switch from scalable oversight to agent foundations research. (In fact it might even be harmful lol)

I agree. I'll add a note in the post saying that the point you end up on the alignment spectrum should also account for feasibility of the research direction.

Though note that we can interpret your definition as endorsing this too: if you really hate the idea of making AIs more capable, then that might motivate you to switch from scalable oversight to agent foundations, since scalable oversight will likely be more useful for capabilities progress.

Answering your first question. If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science. Good for you. I still think it would be better if you switched to doing alignment research (e.g. you could switch to 'i want to understand how neural networks work... so that I can understand how a prosaic AGI system being RLHF'd might behave when presented with genuine credible opportunities for takeover + lots of time to think about what to do) but I don't feel so strongly about it as I would if you were doing capabilities research.

re: judging the research itself rather than the motivations: idk I think it's actually easier, and less subjective, to judge the motivations, at least in this case. People usually just state what their motivations are. Also I'm not primarily trying to judge people, I'm trying to exhort people -- I'm giving people advice about what they should do to make the world a better place, I'm not e.g. talking about which research should be published and which should be restricted (I'm generally in favor of publishing research with maybe a few exceptions, but I think corporations on the margin should publish more) Moreover it's much easier for the researcher to judge their own motivations, than for them to judge the long-term impact of their work or to fit it into your diagram. 

If you truly aren't trying to make AGI, and you truly aren't trying to align AGI, and instead are just purely intrinsically interested in how neural networks work (perhaps you are an academic?) ...great! That's neither capabilities nor alignment research afaict, but basic science.

Consider Chris Olah, who I think has done more than almost anyone else to benefit alignment. It would be very odd if we had a definition of alignment research where you could read all of Chris's interpretability work and still not know whether or not he's an "alignment researcher". On your definition, when I read a paper by a researcher I haven't heard of, I don't know anything about whether it's alignment research or not until I stalk them on facebook and find out how socially proximal they are to the AI safety community. That doesn't seem great.

Back to Chris. Because I've talked to Chris and read other stuff by him, I'm confident that he does care about alignment. But I still don't know whether his actual motivations are more like 10% intrinsic interest in how neural networks work and 90% in alignment, or vice versa, or anything in between. (It's probably not even a meaningful thing to measure.) It does seem likely to me that the ratio of how much intrinsic interest he has in how neural networks work, to how much he cares about alignment, is significantly higher than that of most alignment researchers, and I don't think that's a coincidence—based on the history of science (Darwin, Newton, etc) intrinsic interest in a topic seems like one of the best predictors of actually making the most important breakthroughs.

In other words: I think your model of what produces more useful research from an alignment perspective overprioritizes towards first-order effects (if people care more they'll do more relevant work) and ignores the second-order effects that IMO are more important (1. Great breakthroughs seem, historically, to be primarily motivated by intrinsic interest; and 2. Creating research communities that are gatekept by people's beliefs/motivations/ideologies is corrosive, and leads to political factionalism + ingroupiness rather than truth-seeking.)

I'm not primarily trying to judge people, I'm trying to exhort people

Well, there are a lot of grants given out for alignment research. Under your definition, those grants would only be given to people who express the right shibboleths.

I also think that the best exhortation of researchers mostly looks like nerdsniping them, and the way to do that is to build a research community that is genuinely very interested in a certain set of (relatively object-level) topics. I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting). But any step in the pipeline that prioritizes "alignment researchers" (like: who gets invited to alignment workshops, who gets alignment funding or career coaching, who gets mentorship, etc) will prioritize the latter over the former if they're using your definition.

[-]BuckΩ7105

I'd much rather an interpretability team hire someone who's intrinsically fascinated by neural networks (but doesn't think much about alignment) than someone who deeply cares about making AI go well (but doesn't find neural nets very interesting).

I disagree, I'd rather they'd hire someone who cares about making AI go well. E.g. I like Sam Marks's work on making interpretability techniques useful (e.g. here), and I think he gets a lot of leverage compared to most interpretability researchers via trying to do stuff that's in the direction of being useful. (Though note that his work builds on the work of non-backchaining interpretability researchers.)

[-]habrykaΩ57-7

(FWIW I think Chris Olah's work is approximately irrelevant to alignment and indeed this is basically fully explained by the motivational dimension)

Whose work is relevant, according to you?

[-]habrykaΩ340

Lots of people's work:

  • Paul's work (ELK more than RLHF though it was useful to see what happens when you throw RL at LLMs in a way that's kind of similar to how I do get some value out of Chris's work)
  • Eliezer's work
  • Nate's work
  • Holden's writing on cold takes
  • Ajeya's work
  • Wentworth's work
  • The debate stuff
  • Redwood's work
  • Bostrom's work
  • Evan's work
  • Scott and Abram's work

There is of course still huge variance in how relevant and how much for the throat these different people's work is going for, but all of these seem more relevant to AI Alignment/AI-not-kill-everyonism than Chris's work (which again, I found interesting, but not like super interesting).

Do you mean Evan Hubinger, Evan R. Murphy, or a different Evan? (I would be surprised and humbled if it was me, though my priors on that are low.)

Hubinger

Definitely not trying to put words in Habryka's mouth, but I did want to make a concrete prediction to test my understanding of his position; I expect he will say that: 

  • the only work which is relevant is the one that tries to directly tackle what Nate Soares described as "the hard bits of the alignment challenge" (the identity of which Habryka basically agrees with Soares about)
  • nobody is fully on the ball yet 
  • but agent foundations-like research by MIRI-aligned or formerly MIRI-aligned people (Vanessa Kosoy, Abram Demski, etc.) is the one that's most relevant, in theory
  • however, in practice, even that is kinda irrelevant because timelines are short and that work is going along too slowly to be useful even for deconfusion purposes

Edit: I was wrong.

To clarify: are you saying that since you perceive Chris Olah as mostly intrinsically caring about understanding neural networks (instead of mostly caring about alignment), you conclude that his work is irrelevant to alignment?

No, I have detailed inside view models of the alignment problem, and under those models consider Chris Olah's work to be interesting but close to irrelevant (or to be about as relevant as the work of top capability researchers, whose work, to be clear, does have some relevance since of course understanding how to make systems better is relevant for understanding how AGI will behave, but where the relevance is pretty limited).

I feel like we are misunderstanding each other, and I think it's at least in large part my fault.

I definitely agree that we don't want to be handing out grants or judging people on the basis of what shibboleths they spout or what community they are part of. In fact I agree with most of what you've said above, except for when you start attributing stuff to me.

I think that grantmakers should evaluate research proposals not on the basis of the intentions of the researcher, but on the basis of whether the proposed research seems useful for alignment. This is not your view though right? You are proposing something else?

 

The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans.

 

I agree that this distinction is critical.

My brief attempt at explaining the importance of internally-represented goals (nice phrasing btw!) for capabilities and alignment is Steering subsystems: capabilities, agency, and alignment.