Instead of talking about alignment research, we should differentiate conceptual alignment research and applied alignment research.
I expect these two categories to be quite obvious to most people around here: conceptual alignment research includes deconfusion, for example work on deception, HCH, universality, abstraction, power-seeking, as well as work searching for approaches and solutions for the problems raised by this deconfusion; whereas applied alignment research focuses on experimentally testing these ideas and adapting already existing fields like RL and DL to be more relevant to alignment questions. Some work will definitely fall in the middle, but that’s not a problem, because the point isn’t to separate “good” from “bad” alignment research or “real” from “fake” alignment research, just to be able to frame these clusters and think about them meaningfully.
Isn’t that decomposition trivial though? It’s indeed obvious, just like the result of almost any deconfusion. And yet, committing to that framing clears the air about so many internal issues of the alignment research community and guards discussions of field-building against frustrating everyone involved.
Thanks to Connor Leahy, Logan Smith and Rob Miles for feedback on a draft.
(Note that the AI tag has a distinction that sounds similar but with in my opinion far worse names (and weird classification). It might make sense to change the names following this post if the distinction makes sense to enough people.)
It’s all a Common-Payoff Game
Another obvious yet fundamental point is that neither conceptual alignment research nor applied alignment research is enough by itself -- we need both. Only succeeding in conceptual alignment research would result in a very good abstract understanding of what we should do, but no means of concretely translating it in time; only succeeding in applied alignment research would result in very good ability to build and understand models that are susceptible to alignment yet no principled understanding of what to be careful of and how to steer them.
So it doesn’t make any sense to say that one is more important than the other, or that only one is “real alignment research”. Yet frustration can so easily push one to make this mistake. I personally said in multiple conversations that I was talking about “real alignment research” when talking about conceptual research, only to realize hours later, when the pressure was down, that I didn’t believe at all that the applied part was fake alignment research.
Keeping this split in mind and the complementarity of both approaches definitely helped me avoid frustration-induced lapses where I cast alignment as a zero-sum game between conceptual alignment research and applied alignment research. Instead, I remember the obvious: alignment is a common-payoff game where we all either win or lose together.
Field-Building Confusions
Keeping this distinction in mind also helps with addressing some frustrating confusions in field-building discussions.
Take my post about creating an alignment research journal/conference: I should have made it clear that I meant conceptual alignment research, but I hadn’t internalized this distinction at the time. Objections like this comment from David Manheim or this comment from Ryan Carey saying that work can get published in ML/CS venues and we should thus push there didn’t convince me at all, without me being able to put a finger on where my disagreement lay.
The conceptual/applied alignment research distinction instead makes it jump out of the page: ML conferences and journals almost never publish conceptual alignment research AFAIK (some of it is published in workshops, but these don’t play the same role with regard to peer review and getting jobs and tenure). Take this quote from Ryan’s comment:
Currently, the flow of AIS papers into the likes of Neurips and AAAI (and probably soon JMLR, JAIR) is rapidly improving. New keywords have been created there at several conferences, along the lines of "AI safety and trustworthiness" (I forget the exact wording) so that you can nowadays expect, on average, to receive reviewer who average out to neutral, or even vaguely sympathetic to AIS research. Ten or so papers were published in such journals in the last year, and all these authors will become reviewers under that keyword when the conference comes around next year. Yes, things like "Logical Inductors" or "AI safety via debate" are very hard to publish. There's some pressure to write research that's more "normie". All of that sucks, but it's an acceptable cost for being in a high-prestige field. And overall, things are getting easier, fairly quickly.
Applying the conceptual/applied distinction makes obvious that the argument only applies for applied alignment research. He literally gives two big examples of conceptual alignment research as the sort of things that can’t get published.
This matters because none of this is making it easier to peer-review/scale/publish conceptual alignment research. It’s not a matter of “normies” research versus “real alignment research”, but that almost all gains in field building and scaling in the last few years are for applied alignment research!
Incentives against Conceptual Alignment Research
I started thinking about all of this because I was so frustrated with having to always push/defend conceptual alignment research in discussions. Why were so many people pushing back against it/not seeing the problem that few people work on it? The answer seems obvious in retrospect: because years ago you actually had to push massively against the pull and influence of conceptual alignment research to do anything else.
7 years ago, when Superintelligence was published, MIRI was pretty much the whole field of alignment. And I don’t need to explain to anyone around here how MIRI is squarely in the conceptual part. As such, I’m pretty sure that many researchers who wanted a more varied field or to work on applied alignment research had to constantly deal with the fact that the big shots in town were not convinced/interested in what they were doing. I expect that much of the applied alignment researchers positioned themselves as an alternative to MIRI’s work.
It made sense at the time, but now applied alignment research has completely dwarfed its conceptual counterpart in terms of researchers, publications, prestige. Why? Because applied alignment research is usually:
- Within ML compared to the weird abstract positioning of conceptual alignment research
- Experimental and sometimes formal, compared to the weird philosophical aspects of conceptual alignment research
- Able to leverage skills that are actually taught in universities and which people are hyped about (programming, ML, data science)
Applied alignment research labs made and are making great progress by leveraging all of these advantages. On the other hand, conceptual alignment research relies entirely on a trickle of new people who bashed their heads long enough against the Alignment Forum to have an idea of what’s happening.
This is not the time for being against conceptual alignment research. Instead, we should all think together about ways of improving this part of the field. And when conceptual alignment research thrives, there will be so many more opportunities for collaborations: experiments on some conceptual alignment concepts, criticism of conceptual ideas from a more concrete perspective, conceptual failures modes for the concrete applied proposals...
Conclusion
You don’t need me to realize that there are two big clusters in alignment research: I call them conceptual alignment research and applied alignment research. But despite how obvious this distinction feels, keeping it in mind is crucial for the only thing that matters: solving the alignment problem.
Appendix: Difference with MIRI’s Philosophy -> Maths -> Engineering
A commenter pointed out to me that my distinction reminded them of the one argued in this MIRI blogpost. In summary, the post argues that Friendly AI (the old school name for alignment) should be tackled in the same way that causality was: starting at philosophy, then formalizing it into maths, and finally implementing these insights through engineering.
I don’t think this fits with the current state of the field, for the following reasons:
- Nowadays, the vast majority of the field disagree that there’s any hope of formalizing all of philosophy and then just implementing that to get an aligned AGI.
- Because of this perspective that the real problem is to formalize philosophy, the MIRI post basically reduces applied alignment research to exact implementation of the formal models from philosophy. Whereas applied alignment research contains far more insights than that.
- Conceptual alignment research isn’t just turning philosophy into mathematics. This is a failure mode I warned against recently: what matters is deconfusion, not formalization. Some non formal deconfusion can (and so often do) help far more than formal work. This doesn’t mean that having a formalization wouldn’t be great; just that this is in no way a requirement for making progress.
So all in all, I feel that my distinction is both more accurate and more fruitful.
I think it's becoming less clear to me what you mean by deconfusion. In particular, I don't know what to make of the following claims:
I don't [presently] think these claims dovetail with my understanding of deconfusion. My [present] understanding of deconfusion is that (loosely speaking) it's a process for taking ideas from the [fuzzy, intuitive, possibly ill-defined] sub-cluster and moving them to the [concrete, grounded, well-specified] sub-cluster.
I don't think this process, as I described it, entails having an application in mind. (Perhaps I'm also misunderstanding what you mean by application!) It seems to me that, although many attempts at deconfusion-style alignment research (such as the three examples I gave in my previous comment) might be ultimately said to have been motivated by the "application" of aligning superhuman agents, in practice they happened more because somebody noticed that whenever some word/phrase/cluster-of-related-words-and-phrases came up in conversation, people would talk about them in conflicting ways, use/abuse contradictory intuitions while talking about them, and just in general (to borrow Nate's words) "continuously accidentally spout nonsense".
But perhaps from your perspective, that kind of thing also counts as an application, e.g. the application of "making us able to talk about the thing we actually care about". If so, then:
I agree that it's possible to make progress towards this goal without performing steps that look like formalization. (I would characterize this as the "philosophy" part of Luke's post about going from philosophy to math to engineering.)
Conversely, I also agree that it's possible to perform formalization in a way that doesn't perfectly capture the essence of "the thing we want to talk about", or perhaps doesn't usefully capture it in any sense at all; if one wanted to use unkind words, one could describe the former category as "premature [formalization]", and the latter category as "unnecessary [formalization]". (Separately, I also see you as claiming that TurnTrout's work on power-seeking and Scott's work on Cartesian frames fall somewhere in the "premature" category, but this may simply be me putting words in your mouth.)
And perhaps your contention is that there's too much research being done currently that falls under the second bullet point; or, alternatively, that too many people are pursuing research that falls (or may fall) under the second bullet point, in a way that they (counterfactually) wouldn't if there were less (implicit) prestige attached to formal research.
If this (or something like it) is your claim, then I don't think I necessarily disagree; in fact, it's probably fair to say you're in a better position to judge than I am, being "closer to the ground". But I also don't think this precludes my initial position from being valid, where--having laid the groundwork in the previous two bullet points--I can now characterize my initial position as [establishing the existence of] a bullet point number 3:
A successful, complete deconfusion of a concept will, almost by definition, admit to a natural formalization; if one then goes to the further step of producing such a formalism, it will be evident that the essence of the original concept is present in said formalism.
(Or, to borrow Eliezer's words this time, "Do you understand [the concept/property/attribute] well enough to write a computer program that has it?")
And yes, in a certain sense perhaps there might be no point to writing a computer program with the [concept/property/attribute] in question, because such a computer program wouldn't do anything useful. But in another sense, there is a point: the point isn't to produce a useful computer program, but to check whether your understanding has actually reached the level you think it has. If one further takes the position (as I do) that such checks are useful and necessary, then [replacing "writing a computer program" with "producing a formalism"] I claim that many productive lines of deconfusion research will in fact produce formalisms that look "premature" or even "unnecessary", as a part of the process of checking the researchers' understanding.
I think that about sums up [the part of] the disagreement [that I currently know how to verbalize]. I'm curious to see whether you agree this is a valid summary; let me know if (abstractly) you think I've been using the term "deconfusion" differently from you, or (concretely) if you disagree with anything I said about "my" version of deconfusion.