I wonder if we couldn't convert this into some kind of community wiki, so that the people represented in it can provide endorsed representations of their own work, and so that the community as a whole can keep it updated as time goes on.
Obviously there's the problem where you don't want random people to be able to put illegitimate stuff on the list. But it's also hard to agree on a way to declare legitimacy.
...Maybe we could have a big post like lukeprog's old textbook post, where researchers can make top-level comments describing their own research? And then others can up- or down-vote the comments based on the perceived legitimacy of the research program?
There's also a much harder and less impartial option, which is to have an extremely opinionated survey that basically picks one lens to view the entire field and then describes all agendas with respect to that lens in terms of which particular cruxes/assumptions each agenda runs with. This would necessarily require the authors of the survey to deeply understand all the agendas they're covering, and inevitably some agendas will receive much more coverage than other agendas.
This makes it much harder than just stapling together a bunch of people's descriptions of their own research agendas, and will never be "the" alignment survey because of the opinionatedness. I still think this would have a lot of value though: it would make it much easier to translate ideas between different lenses/notice commonalities, and help with figuring out which cruxes need to be resolved for people to agree.
Relatedly, I don't think alignment currently has a lack of different lenses (which is not to say that the different lenses are meaningfully decorrelated). I think alignment has a lack of convergence between people with different lenses. Some of this is because many cruxes are very hard to resolve experimentally today. However, I think even despite that it should be possible to do much better than we currently are--often, it's not even clear what the cruxes are between different views, or whether two people are thinking about the same thing when they make claims in different language.
Fwiw I think "deep" reviews serve a very different purpose from shallow reviews so I don't think you should let the existence of shallow reviews prevent you from doing a deep review
Promoted to curated. I think this kind of overview is quite valuable, and I think overall this post did a pretty good job of a lot of different work happening in the field. I don't have a ton more to say, I just think posts like this should come out every few months, and the takes in this one overall seemed pretty good to me.
Thanks!
I think there's another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:
[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]
[Edit: sequence on this control agenda.]
Under "Understand cooperation", you should add Metagov (many relevant projects under this umbrella, please visit the website, in particular, DAO Science), "ecosystems of intelligence" agenda (itself pursued by Verses, Active Inference Institute, Gaia Consortium, Digital Gaia, and Bioform Labs). This is more often more practical than theoretical work though, so the category names ("Theory" > "Understanding cooperation") wouldn't be totally reasonable for it, but this is also true for a lot of entires already on the post.
In general, the science of coopera...
Thanks for making this map 🙏
I expect this is a rare moment of clarity because maintaining updates takes a lot of effort and is now subject to optimization pressure.
Also imo most of the "good" alignment work in terms of eventual impact is being done outside the alignment label (eg as differential geometry or control theory) and will be merged in later once the connection is recognized. Probably this will continue to become more true over time.
Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!
An (intentionally) shallow collection of AI alignment agendas that different organizations are working on.
This is the post I come back to when I want to remind myself what agendas different organizations are pursuing.
Overall, it is a solid and comprehensive post that I found very useful.
Just waited to point out that my algorithm distillation thing didn't actually get funded by ligthspeed and I have in fact received no grant so far(while the post says I have 68k for some reason? might be getting mixed up with someone else).
I'm also currently working on another interpretability project with other people that will be likely published relatively soon.
But my resources continue being 0$ and haven't managed to get any grant yet.
Nice work.
Regarding the Learning-Theoretic Agenda:
Regarding mild optimisation: https://www.pik-potsdam.de/en/institute/futurelabs/gane also doing this (see SatisfIA project).
Another agenda not covered: Self-Other Overlap.
Some outputs in 2023: catastrophic Goodhart?
This was not funded by MIRI. It was inspired by a subproblem we ran into, I reduced my MIRI hours to work on it, then it was retroactively funded by LTFF several months later. Nor do I consider it to be part of the project of understanding consequentialist cognition, it's more about understanding optimization.
Very useful post! Here are some things that could go under corrigibility outputs in 2023: AI Alignment Awards entry; comment. I'm also hoping to get an updated explanation of my corrigibility proposal (based on this) finished before the end of the year.
Activation engineering (as unsupervised interp)
Much of this is now supervised, [Roger questions how much value the unsupervised part brings](https://www.lesswrong.com/posts/bWxNPMy5MhPnQTzKz/what-discovering-latent-knowledge-did-and-did-not-find-4). So it might make sense to merge with model edits in the next one.
See also Holtman’s neglected result.
Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.
Reverse engineering. Unclear if this is being pushed much anymore. 2022: Anthropic circuits, Interpretability In The Wild, Grokking mod arithmetic
FWIW, I was one of Neel's MATS 4.1 scholars and I would classify 3/4 of Neel's scholar's outputs as reverse engineering some component of LLMs (for completeness, this is the other one, which doesn't nicely fit as 'reverse engineering' imo). I would also say that this is still an active direction of research (lots of ground to cover with MLP neurons, polysemantic heads, and more)
Hey, great stuff -- thank you for sharing! I especially found this useful as somebody who has been "out" of alignment for 6 months and is looking to set up a new research agenda.
I am very surprised that "Iterated Amplification" appears nowhere on this list. Am I missing something?
Zac Hatfield-Dobbs
Almost but not quite my name! If you got this from somewhere else, let me know and I'll go ping them too?
Thanks for noticing and including a link to my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom). I'm not sure I'd describe it as primarily a critique of mild optimization/satisficing: it's more pointing out a slightly larger point, that any value learner foolish enough to be prone to Goodharting, or unable to cope with splintered models or Knightian uncertainty in its Bayesian reasoning is likely to be bad at STEM, limiting how dangerous it can be (so fixing this is capabilities work as well as alignment work). But yes, that i...
The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.
Soares's comment which you link as a critique of Open Agency Architecture is almost certainly not a critique. It's more of an overview/distillation.
He actually explicitly says: 1) "Here's a recent attempt of mine at a distillation of a fragment of this plan" and 2) "note: I'm simply attempting to regurgitate the idea here".
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Immitation learning. One-sentence summary: train models on human behaviour (such as monitoring which keys a human presses when in response to what happens on a computer screen); contrast with Strouse.
Reward learning. One-sentence summary: People like CHAI are still looking at reward learning to “reorient the general thrust of AI research towards provably beneficial systems”. (They are also doing a lot of advocacy, like everyone else.)
I question whether this captures the essence of proponent's hope for either reward learning or imitation learning...
Wow, high praise for MATS! Thank you so much :) This list is also great for our Summer 2024 Program planning.
try to formalise a more realistic agent, understand what it means for it to be aligned with us, […], and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.
Finally, people are writing good summaries of the learning-theoretic agenda!
One big omission is Bengio's new stuff, but the talk wasn't very precise. Sounds like Russell:
With a causal and Bayesian model-based agent interpreting human expressions of rewards reflecting latent human preferences, as the amount of compute to approximate the exact Bayesian decisions increases, we increase the probability of safe decisions.
Another angle I couldn't fit in is him wanting to make microscope AI, to decrease our incentive to build agents.
Summary
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”.
This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one.
Our taxonomy:
Please point out if we mistakenly round one thing off to another, miscategorise someone, or otherwise state or imply falsehoods. We will edit.
Unlike the late Larks reviews, we’re not primarily aiming to direct donations. But if you enjoy reading this, consider donating to Manifund, MATS, or LTFF, or to Lightspeed for big ticket amounts: some good work is bottlenecked by money, and you have free access to the service of specialists in giving money for good work.
Meta
When I (Gavin) got into alignment (actually it was still ‘AGI Safety’) people warned me it was pre-paradigmatic. They were right: in the intervening 5 years, the live agendas have changed completely.[1] So here’s an update.
Chekhov’s evaluation: I include Yudkowsky’s operational criteria (Trustworthy command?, closure?, opsec?, commitment to the common good?, alignment mindset?) but don’t score them myself. The point is not to throw shade but to remind you that we often know little about each other.
See you in 5 years.
Editorial
Agendas
1. Understand existing models
characterisation
Evals
(Figuring out how a trained model behaves.)
Various capability evaluations
Various red-teams
Eliciting model anomalies
Alignment of Complex Systems: LLM interactions
The other evals (groundwork for regulation)
Much of Evals and Governance orgs’ work is something different: developing politically legible metrics, processes / shocking case studies. The aim is to motivate and underpin actually sensible regulation.
But this is a technical alignment post. I include this section to emphasise that these other evals (which seek confirmation) are different from understanding whether dangerous capabilities have or might emerge.
Interpretability
(Figuring out what a trained model is actually computing.)[2]
Ambitious mech interp
Concept-based interp
Causal abstractions
EleutherAI interp
Activation engineering (as unsupervised interp)
Leap
Understand learning
(Figuring out how the model figured it out.)
Timaeus: Developmental interpretability & singular learning theory
Various other efforts:
2. Control the thing
(Figuring out how to predictably affect model behaviour.)
Prosaic alignment / alignment by default
Redwood: control evaluations
Safety scaffolds
Prevent deception
Through methods besides mechanistic interpretability.
Redwood: mechanistic anomaly detection
Indirect deception monitoring
Anthropic: externalised reasoning oversight
Surgical model edits
(interventions on model internals)
Weight editing
Activation engineering
Getting it to learn what we want
(Figuring out how to control what the model figures out.)
Social-instinct AGI
Imitation learning
Reward learning
Goal robustness
(Figuring out how to make the model keep doing ~what it has been doing so far.)
Measuring OOD
Concept extrapolation
Mild optimisation
3. Make AI solve it
(Figuring out how models might help figure it out.)
Scalable oversight
(Figuring out how to help humans supervise models. Hard to cleanly distinguish from ambitious mechanistic interpretability.)
OpenAI: Superalignment
Supervising AIs improving AIs
Cyborgism
See also Simboxing (Jacob Cannell).
Task decomp
Recursive reward modelling is supposedly not dead but instead one of the tools Superalignment will build.
Another line tries to make something honest out of chain of thought / tree of thought.
Elicit (previously Ought)
Adversarial
Deepmind Scalable Alignment
Anthropic / NYU Alignment Research Group / Perez collab
See also FAR (below).
4. Theory
(Figuring out what we need to figure out, and then doing that. This used to be all we could do.)
Galaxy-brained end-to-end solutions
The Learning-Theoretic Agenda
Open Agency Architecture
Provably safe systems
Conjecture: Cognitive Emulation (CoEms)
Question-answer counterfactual intervals (QACI)
Understanding agency
(Figuring out ‘what even is an agent’ and how it might be linked to causality.)
Causal foundations
Alignment of Complex Systems: Hierarchical agency
The ronin sharp left turn crew
Shard theory
boundaries / membranes
A disempowerment formalism
Performative prediction
Understanding optimisation
Corrigibility
(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight and superalignment are ~atheoretical approaches to this.)
Behavior alignment theory
The comments in this thread are extremely good – but none of the authors are working on this!! See also Holtman’s neglected result. See also EJT (and formerly Petersen). See also Dupuis.
Ontology identification
(Figuring out how superintelligent agents think about the world and how we get superintelligent agents to actually tell us what they know. Much of interpretability is incidentally aiming at this.)
ARC Theory
Natural abstractions
Understand cooperation
(Figuring out how inter-AI and AI/human game theory should or would work.)
CLR
FOCAL
See also higher-order game theory. We moved CAIF to the “Research support” appendix. We moved AOI to “misc”.
5. Labs with miscellaneous efforts
(Making lots of bets rather than following one agenda, which is awkward for a topic taxonomy.)
Deepmind Alignment Team
Apollo
Anthropic Assurance / Trust & Safety / RSP Evaluations / Interpretability
FAR
Krueger Lab
AI Objectives Institute (AOI)
More meta
We don’t distinguish between massive labs, individual researchers, and sparsely connected networks of people working on similar stuff. The funding amounts and full time employee estimates might be a reasonable proxy.
The categories we chose have substantial overlap and see the “see also”s for closely related work.
I wanted this to be a straight technical alignment doc, but people pointed out that would exclude most work (e.g. evals and nonambitious interpretability, which are safety but not alignment) so I made it a technical AGI safety doc. Plus ça change.
The only selection criterion is “I’ve heard of it and >= 1 person was recently working on it”. I don’t go to parties so it’s probably a couple months behind.
Obviously this is the Year of Governance and Advocacy, but I exclude all this good work: by its nature it gets attention. I also haven’t sought out the notable amount by ordinary labs and academics who don’t frame their work as alignment. Nor the secret work.
You are unlikely to like my partition into subfields; here are others.
No one has read all of this material, including us. Entries are based on public docs or private correspondence where possible but the post probably still contains >10 inaccurate claims. Shouting at us is encouraged. If I’ve missed you (or missed the point), please draw attention to yourself.
If you enjoyed reading this, consider donating to Lightspeed, MATS, Manifund, or LTFF: some good work is bottlenecked by money, and some people specialise in giving away money to enable it.
Conflicts of interest: I wrote the whole thing without funding. I often work with ACS and PIBBSS and have worked with Team Shard. Lightspeed gave a nice open-ended grant to my org, Arb. CHAI once bought me a burrito.
If you’re interested in doing or funding this sort of thing, get in touch at hi@arbresearch.com. I never thought I’d end up as a journalist, but stranger things will happen.
Thanks to Alex Turner, Neel Nanda, Jan Kulveit, Adam Gleave, Alexander Gietelink Oldenziel, Marius Hobbhahn, Lauro Langosco, Steve Byrnes, Henry Sleight, Raymond Douglas, Robert Kirk, Yudhister Kumar, Quratulain Zainab, Tomáš Gavenčiak, Joel Becker, Lucy Farnik, Oliver Hayman, Sammy Martin, Jess Rumbelow, Jean-Stanislas Denain, Ulisse Mini, David Mathers, Chris Lakin, Vojta Kovařík, Zach Stein-Perlman, and Linda Linsefors for helpful comments.
Appendices
Appendix: Prior enumerations
Appendix: Graveyard
Appendix: Biology for AI alignment
Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.
Human enhancement
Merging
As alignment aid
Appendix: Research support orgs
One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.
CAIF
AISC
See also:
Appendix: Meta, mysteries, more
Unless you zoom out so far that you reach vague stuff like “ontology identification”. We will see if this total turnover is true again in 2028; I suspect a couple will still be around, this time.
> one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. – Niplav