Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers
Isn't this entirely usual? Like, I'd assume that there are more readers of popular physics books than working physicists. Similarly for nature documentary viewers vs biologists.
Maybe in contrast to other fields of ML? (Though that's definitely stopped being true for eg LLMs)
I think the deciding difference is that the amount of fans and supporters who want to be actively involved and who think the problem is the most important in the world is much larger than the number of researchers; while popular physics book readers and nature documentary viewers are plentiful, I doubt most of them feel a compelling need to become involved!
By contrast, some lines of research where I’ve seen compelling critiques (and haven’t seen compelling defences) of their core intuitions, and therefore don't recommend to people:
- Cooperative inverse reinforcement learning (the direction that Stuart Russell defends in his book Human Compatible); critiques here and here.
- John Wentworth’s work on natural abstractions; exposition and critique here, and another here.
The first critique of natural abstractions says:
Concluding thoughts on relevance to alignment: While we’ve made critical remarks on several of the details, we also want to reiterate that overall, we think (natural) abstractions are an important direction for alignment and it’s good that someone is working on them! In particular, the fact that there are at least four distinct stories for how abstractions could help with alignment is promising.
The second says:
I think this is a fine dream. It’s a dream I developed independently at MIRI a number of years ago, in interaction with others. A big reason why I slogged through a review of John's work is because he seemed to be attempting to pursue a pathway that appeals to me personally, and I had some hope that he would be able to go farther than I could have.
Neither of them seemed, to me, to be critiques of the "core intuitions"; rather, the opposite: both suggested that the core intuitions seemed promising; the weaknesses were elsewhere. That suggests that natural abstractions might be a better than average target for incoming researchers, not a worse one.
I have some other disagreements, but those are model-level disagreements; that piece of advice in particular seems to be misguided even under your own models. I think I agree with the overall structure and most of the prioritization (though would put scalable oversight lower, or focus on those bits that Joe points out are the actual deciding factors for whether that entire class of approaches is worthwhile - that seems more like "alignment theory with respect to scalable oversight").
Some recommended ways to upskill at empirical research (roughly in order):
For people specifically interested in getting into mechanistic interpretability, my guide to getting started may be useful - it's much more focused on the key, relevant parts of deep learning, with a bunch more interpretability specific stuff
For mechanistic interpretability research, we just released a new paper on neuron interpretability in LLMs, with a large discussion on superposition! See
Paper: https://arxiv.org/abs/2305.01610
Summary: https://twitter.com/wesg52/status/1653750337373880322
Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal.
I relate a lot with this, this feels like one of the clearer markers internally for me of what becoming good at interpretability research felt like - there's so much low hanging fruit! Why aren't other people plucking it?
There's also just some internal sense of "I kind of know what I'm doing, and have ideas for what to do next", though this is much clearer to me when mentoring and advising other people, where I have strong opinions, than when applying it to myself, where I can sometimes pull it off but find it easily to fall into random spirals of doubt
This is interesting; I'm still looking for my own (I think?) "comparative advantage" in this area. Some mental motions are very easy, while some "trivial" tasks feel harder (or would require me to already be involved full-time, leading to a chicken-and-egg problem).
(Pasting this exchange from a comment thread on the EA Forum; bolding added)
Peter Park:
Thank you so much for your insightful and detailed list of ideas for AGI safety careers, Richard! I really appreciate your excellent post.
I would propose explicitly grouping some of your ideas and additional ones under a third category: “identifying and raising public awareness of AGI’s dangers.” In fact, I think this category may plausibly contain some of the most impactful ideas for reducing catastrophic and existential risks, given that alignment seems potentially difficult to achieve in a reasonable period of time (if ever) and the implementation of governance ideas is bottlenecked by public support.
For a similar argument that I found particularly compelling, please check out Greg Colbourn’s recent post: https://forum.effectivealtruism.org/posts/8YXFaM9yHbhiJTPqp/agi-rising-why-we-are-in-a-new-era-of-acute-risk-and
Richard:
I don't actually think the implementation of governance ideas is mainly bottlenecked by public support; I think it's bottlenecked by good concrete proposals. And to the extent that it is bottlenecked by public support, that will change by default as more powerful AI systems are released.
Akash:
I appreciate Richard stating this explicitly. I think this is (and has been) a pretty big crux in the AI governance space right now.
Some folks (like Richard) believe that we're mainly bottlenecked by good concrete proposals. Other folks believe that we have concrete proposals, but we need to raise awareness and political support in order to implement them.
I'd like to see more work going into both of these areas. On the margin, though, I'm currently more excited about efforts to raise awareness [well], acquire political support, and channel that support into achieving useful policies.
I think this is largely due to (a) my perception that this work is largely neglected, (b) the fact that a few AI governance professionals I trust have also stated that they see this as the higher priority thing at the moment, and (c) worldview beliefs around what kind of regulation is warranted (e.g., being more sympathetic to proposals that require a lot of political will).
Scalable oversight: finding ways to leverage more powerful models to produce better reward signals
It might be worth clarifying how you expect this to help, and to make clear where you'd expect other researchers to disagree.
For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find make progress towards an alignment solution.
2) Debate is a plausible basis for an alignment solution.
To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).
Viewing it as a question-answering system is similarly confused: it's an [output whatever text is selected by the debate process] system.
We can't have both [debaters optimise for a debate win] and [debate robustly remains a question-answering system] - at least without making obviously false assumptions about a human-based judge system.
Could Debate be a component of an alignment solution? Sure.
Is it the part that seems hard/neglected? No.
On (1) I'm less clear, however here the case that needs to be made is that debate approaches will be more useful before they become dangerous than e.g. simulators or conditioning predictive models (which I agree will also break at some point).
This is not obviously false, but I don't see a good argument for it. If I have to bet which of these approaches has the lowest [capability before deceptive alignment] (cbda) threshold, my money is currently on debate (and indeed RRM). Imitative amplification seems plausibly safer, but only to the degree that it's less efficient - so still unclear it gets higher cbda (if distillation ends up buying efficiency, I expect it to throw out the imitative rationale for safety in the process).
To me, most of the value to a new researcher in studying debate would lie in:
And as Eliezer/Nate/John... would point out, this doesn't require getting into the details of the mechanism design - only to notice that the mechanism is doing nothing to address the fundamentals of the problem.
I'd be genuinely interested if I'm wrong on any of this - it'd be nice if debate were actually useful! (I don't claim to be making all the necessary arguments above - just pointing out my current belief)
To me (2) seems fairly clearly false - at the very least it's not doing anything about inner alignment (debate on weights/activations does nothing to address this, since there's still no [debaters are aiming to win the game] starting point).
Why do you believe this? It's fairly plausible to me that "train an AI to use interpretability tools to show that this other AI is being deceptive" is the kind of scalable oversight approach that might work, especially for detecting inner misalignment, if you can get the training right and avoid cooperation. But that seems like a plausibly solvable problem to me
The problem is robustly getting the incentive to show that the other AI is being deceptive.
Giving access to the weights, activations and tools may give debaters the capability to expose deception - but that alone gets you nothing.
You're still left saying:
So long as we can get the AI to robustly do what we want (i.e. do its best to expose deception), we can get the AI to robustly do what we want.
Similarly, "...and avoid cooperation" is essentially the entire problem.
To be clear, I'm not saying that an approach of this kind will never catch any instances of an AI being deceptive. (this is one reason I'm less certain on (1))
I'm am saying that there's no reason to predict anything along these lines should catch all such instances.
I see no reason to think it'll scale.
Another issue: unless you have some kind of true name of deception (I see no reason to expect this exists), you'll train an AI to detect [things that fit your definition of deception], and we die to things that didn't fit your definition.
These are all arguments about the limit; whether or not they're relevant depends on whether they apply to the regime of "smart enough to automate alignment research".
Agreed.
Are you aware of any work that attempts to answer this question?
Does this work look like work on debate?
(not rhetorical questions!)
My guess is that work likely to address this does not look like work on debate.
Therefore my current position remains: don't bother working on debate; rather work on understanding the fundamentals that might tell you when it'll break.
The world won't be short of debate schemes.
It'll be short of principled arguments for their safe application.
For instance, for debate, one could believe:
1) Debate will work for long enough for us to use it to help find an alignment solution.
2) Debate is a plausible basis for an alignment solution.
I generally don't think about things in terms of this dichotomy. To me, an "alignment solution" is anything that will align an AGI which is then capable of solving alignment for its successor. And so I don't think you can separate these two things.
(Of course I agree that debate is not an arbitrarily scalable alignment solution in the sense that you can just keep training models using debate without adding any more techniques; but I don't think that really matters. We need to get to the moon, not to Andromeda.)
Oh, to be clear, with "to help find" I only mean that we expect to make significant progress using debate. If we knew we'd safely make enough progress to get to a solution, then you're quite right that that would amount to (2). (apologies for lack of clarity if this was the miscommunication)
That's the distinction I mean to make between (1) and (2): we need to get to the moon safely
With (1) we have no idea when our rocket will explode.
Similarly, we have no idea whether the moon will be far enough to know when our next rocket will explode. (not that I'm knocking robustly getting to the moon safely)
If we had some principled argument telling us how far we could push debate before things became dangerous, that'd be great. I'm claiming that we have no such argument, and that all work on debate (that I'm aware of) stands near-zero chance of finding one.
Of course I'm all for work "on debate" that aims at finding that kind of argument - however, I would expect that such work leaves the specifics of debate behind pretty quickly.
Thanks Richard for this post and prior advice!
I was planning to make a post at some point with some advice that's closely related to this post but I will share it here as a preview. Take note that I don't yet have strong evidence that my work is good or has mattered (and I was going to write a full post once I had more evidence for that). I think Richard's advice above is really good and I'll try to take some of the ideas more on board with my own work.
Last year I quit my job and upskilled for 6 months and now I'm doing independent research which might turn out to be valuable. Regardless of its value, I've learnt a lot and it's created many opportunities for me. I went to EAG and Richard's talk there and a conversation later in a group where he was talking about this mentorship constraint deal. This left a strong impression on me leading me to take some degree of pride in my attempts to be independent and not rely as strongly on individual mentorship. However, there are just a bunch of caveats/perspectives that I have currently which relate to this.
All of these relate to empirical alignment research and not governance or other forms of research. I'm mostly focussed on providing advice for how to be more productive independently of other people but that shouldn't be your preference and I suspect people are more productive at orgs/in groups.
So a bunch of ideas on the topic:
I hope this is useful for people!
I think work on the study of abstraction, one way or another, will be essential to AI alignment. Even "just" being able to make very precise high-level predictions of (an AI's behavior FROM its internal state) or (human values FROM measured neurological data), requires enough abstraction-understanding to know whether the simplification is really capturing what we want.
I don't know if the natural abstractions hypothesis is really necessary for this. But something like a more developed/complete version of Wentworth's "minimal maps" representation of abstraction, seems more needed.
Maybe if it's "direct" enough, we just get mech. interp. again? In my head, some kind of abstraction is necessary if we go by the "Rocket Alignment" analogy.
Here I proposed a systematic framework for classifying AI safety work. This is a matrix, where one dimension is the system level:
Another dimension is the "time" of consideration:
There would be 6*4 = 24 slots in this matrix, and almost all of them have something interesting to research and design, and none of them is "too early" to consider.
Scalable oversight: (monolithic) AI system * manufacturing time
Mechanistic interpretability: (monolithic) AI system * manufacturing time, also design time (e.g., in the context of the research agenda of weaving together theories of cognition and cognitive development, ML, deep learning, and interpretability through the abstraction-grounding stack, interpretability plays the role of empirical/experimental science work)
Alignment theory: Richard phrases it vaguely, but referencing primarily MIRI-style work reveals that he means primarily "(monolithic) AI system * design, manufacturing, and operations time".
Evaluations, unrestricted adversarial training: (monolithic) AI system * manufacturing, operations time
Threat modeling: system of AIs (rarely), human + AI group, whole civilisation * deployment time, operations time, evolutionary time
Governance research, policy research: human + AI group, whole civilisation * mostly design and operations time.
To me, it seems almost certain that many current governance institutions and democratic systems will not survive the AI transition of civilisation. Bengio recently hinted at the same conclusion.
Human+AI group design (scale-free: small group, org, society) and the civilisational intelligence design must be modernised.
Richard mostly classifies this as "governance research", which has a connotation that this is a sort of "literary" work and not science, with which I disagree. There is a ton of cross-disciplinary hard science to be done about group intelligence and civilisational intelligence design: game theory, control theory, resilience theory, linguistics, political economy (rebuild as hard science, of course, on the basis of resource theory, bounded rationality, economic game theory, etc.), cooperative reinforcement learning, etc.
I feel that the design of group intelligence and civilisational intelligence is an under-appreciated area by the AI safety community. Some people do this (Eric Drexler, davidad, the cip.org team, ai.objectives.institute, the Digital Gaia team, and the SingularityNET team, although the latter are less concerned about alignment), but I feel that far more work is needed in this area.
There is also a place for "literary", strategic research, but I think it should mostly concern deployment time of group and civilisational intelligence designs, i.e., the questions of transition from the current governance systems to the next-generation, computation and AI-assisted systems.
Also, operations and evolutionary time concerns of everything (AI systems, systems of AIs, human+AI groups, civilisation) seem to be under-appreciated and under-researched: alignment is not a "problem to solve", but an ongoing, manufacturing-time and operations-time process.
I would be interested in some advice going a step further -- assuming a roughly sufficient technical skill level (in my case, soon-to-be PhD in an application of ML), as well as an interest in the field, how to actually enter the field with a full-time position? I know independent research is one option, but it has its pros and cons. And companies which are interested in alignment are either very tiny (=not many positions), or very huge (like OpenAI et al., =very selective)
Case studies: finding algorithms inside networks that implement specific capabilities. My favorite papers here are Olsson et al. (2022), Nanda et al. (2023), Wang et al. (2022) and Li et al. (2022); I’m excited to see more work which builds on the last in particular to find world-models and internally-represented goals within networks.
If you want to build on Li et al (the Othello paper), my follow-up work is likely to be a useful starting point, and then the post I wrote about future directions I'm particularly excited about
Preventing neural network weight exfiltration (by third parties or an AI itself)
This is really really interesting; a fairly "normal" infosec concern to prevent IP/PII theft, plus a (necessary?) step in many AGI risk scenarios. Is the claim that one could become a "world expert" specifically in this (ie without becoming an expert in information security more generally)?
People often ask me for career advice related to AGI safety. This post (now also translated into Spanish) summarizes the advice I most commonly give. I’ve split it into three sections: general mindset, alignment research and governance work. For each of the latter two, I start with high-level advice aimed primarily at students and those early in their careers, then dig into more details of the field. See also this post I wrote two years ago, containing a bunch of fairly general career advice.
General mindset
In order to have a big impact on the world you need to find a big lever. This document assumes that you think, as I do, that AGI safety is the biggest such lever. There are many ways to pull on that lever, though—from research and engineering to operations and field-building to politics and communications. I encourage you to choose between these based primarily on your personal fit—a combination of what you're really good at and what you really enjoy. In my opinion the difference between being a great versus a mediocre fit swamps other differences in the impactfulness of most pairs of AGI-safety-related jobs.
How should you find your personal fit? To start, you should focus on finding work where you can get fast feedback loops. That will typically involve getting hands-on or doing some kind of concrete project (rather than just reading and learning) and seeing how quickly you can make progress. Eventually, once you've had a bunch of experience, you might notice a feeling of confusion or frustration: why is everyone else missing the point, or doing so badly at this? (Though note that a few top researchers commented on a draft to say that they didn't have this experience.) For some people that involves investigating a specific topic (for me, the question “what’s the best argument that AGI will be misaligned?“); for others it's about applying skills like conscientiousness (e.g. "why can't others just go through all the obvious steps?") Being excellent seldom feels like you’re excellent, because your own abilities set your baseline for what feels normal.
What if you have that experience for something you don't enjoy doing? I expect that this is fairly rare, because being good at something is often very enjoyable. But in those cases, I'd suggest trying it until you observe that even a string of successes doesn't make you excited about what you're doing; and at that point, probably trying to pivot (although this is pretty dependent on the specific details).
Lastly: AGI safety is a young and small field; there’s a lot to be done, and still very few people to do it. I encourage you to have agency when it comes to making things happen: most of the time the answer to “why isn’t this seemingly-good thing happening?” or “why aren’t we 10x better at this particular thing?” is “because nobody’s gotten around to it yet”. And the most important qualifications for being able to solve a problem are typically the ability to notice it and the willingness to try. One anecdote to help drive this point home: a friend of mine has had four jobs at four top alignment research organizations; none of those jobs existed before she reached out to the relevant groups to suggest that they should hire someone with her skillset. And this is just what’s possible within existing organizations—if you’re launching your own project, there are far more opportunities to do totally novel things. (The main exception is when it comes to outreach and political advocacy. Alignment is an unusual field because the base of fans and supporters is much larger than the number of researchers, and so we should be careful to avoid alignment discourse being dominated by advocates who have little familiarity with the technical details, and come across as overconfident. See the discussion here for more on this.)
Alignment research
I’ll start with some high-level recommendations, then give a brief overview of how I see the field.
Some recommended ways to upskill at empirical research (roughly in order):
Each of these teaches you important skills for good research: how to implement algorithms, how to debug code and experiments, how to interpret results, etc. Once you’ve implemented an algorithm or replicated a paper, you can then try to extend the results by improving the techniques somehow.
Alignment research directions
From my perspective, the most promising alignment research falls into three primary categories. I outline those below, as well as three secondary categories I think are valuable. Note that I expect the boundaries between all of these to blur over time as research on them progresses, and as we automate more and more things.
Three other research areas that seem important, but less central:
By contrast, some lines of research which I think are overrated by many newcomers to the field, along with some critiques of them:
Governance work
I mentally split this into three categories: governance research, lab governance, and policy jobs. A few high-level takeaways for each:
List of governance topics
Here are some topics where I wish we had a world expert on applying it to AGI safety. One example of what great work on one of these topics might look like: Baker’s paper on lessons from nuclear arms control (a topic which would have been on this list if he hadn’t written that).
One cluster of topics can be described roughly as “anything mentioned in Yonadav Shavit’s compute governance paper”, in particular:
Another cluster: security-related topics such as
And a more miscellaneous (and less technical) third category: