New roles on my team: come build Open Phil's technical AI safety program with me!

Ajeya Cotra

Open Phil announced two weeks ago that we’re hiring for over 20 roles across our teams working on global catastrophic risk reduction — and we’ll answer questions at our AMA starting tomorrow. Ahead of that, I wanted to share some information about the roles I’m hiring for on my team (Technical AI Safety). This team is aiming to think through what technical research could most help us understand and reduce AI x-risk, and build thriving fields in high priority research areas by making grants to great projects and research groups.

First of all — since we initially listed roles on Sep 29, we’ve added three new roles in Technical AI Safety that you might not have seen yet if you only saw the original announcement! In addition to the (Senior) Program Associate role that was there originally, we added an Executive Assistant role last week — and yesterday we added a (Senior) Research Associate role and a role for a Senior Program Associate specializing in a particular subfield of AI safety research (e.g. interpretability, alignment theory, etc). Check those out if they seem interesting! The Executive Assistant role in particular requires a very different, less technical skill set.

Secondly, before starting to answer AMA questions, I wanted to highlight that our technical AI safety giving is far away from where it should be at equilibrium, there is considerable room to grow, and hiring more people is likely to lead quickly to more and better grants. My estimate is that last year, we recommended around ~$25M in grants to technical AI safety,^[1] and so far this year I’ve recommended a similar amount. With more capacity for grant evaluation, research, and operations, we think this could pretty readily double or more.

All of our GCR teams (Technical AI Safety led by me, Capacity Building led by Claire Zabel, AI Governance and Policy led by Luke Muehlhauser, and Biosecurity led by Andrew Snyder-Beattie) are heavily capacity constrained right now — especially the teams that do work related to AI, given the recent boom in interest and activity in that area. I think my team currently faces even more severe constraints than other program teams. Compared to other teams, my team:

Is much smaller: Until literally last week, it was just me focusing primarily on technical AI safety (although Claire’s team sometimes funds technical AI safety work, primarily upskilling). Last week, Max Nadeau joined as my first Program Associate. In contrast, the capacity building team has eight people, and the biosecurity and AI governance teams each have five people.
Likely has worse “coverage” of its field:
- Ideally, a robust and committed grantmaking team in a given field would:
  - Maintain substantive relationships with the most impactful / promising (say) 5-30% of existing grantees, potential grantees, and key non-grantee players (e.g. people working on AI safety in industry labs) in their field.
  - Have pretty robust systems for hearing about most of the plausible potential new grantees in their field (via e.g. application forms or strong referral networks).
  - Have the bandwidth to give non-trivial consideration to a large fraction of plausible potential grantees, in order to make an informed, explicit decision about whether to fund them and how much.
  - Have the bandwidth to retrospectively evaluate what came out of large grants or important categories of grant.
- My team has absolutely nowhere near that level of coverage (for example, we haven’t had the time to open application forms or to get to know academics who could work on safety). While all our GCR program areas could use a lot more “field coverage,” my guess is that our coverage in technical AI safety is considerably worse than the coverage that at least Claire and Andrew get in their fields. Not only does this team have fewer people to cover its field with, the set of plausible potential players feels like it could well be larger, since large numbers of technical people have started to get a lot more interested in AI safety recently.
Has a more nascent strategy: While we’ve been funding technical AI safety research in one form or another since 2015, the program area has switched leadership and strategic direction multiple times,^[2] and the current iteration is pretty close to a fresh slate — we’ve closed out most of our old programs and are looking to build out a fresh stable of grantmaking initiatives from the ground up.
- One reason our strategy is up in the air is that the team in its current iteration is very new, and advances in AI capabilities are rapidly changing the landscape of tractable research projects. I’ve led the program area for less than a year, and most of the grants I’ve made have been to new groups that didn’t exist before 2021 and/or to research projects that weren’t even practically feasible to do before the last couple of years. In contrast, other program leads have been building out a strategy for a few years or more.
- Another big reason is that we have a huge number of unanswered questions about what technical projects we most want to see, what kind of results would most change our mind about key questions or move the needle on key safety techniques, and how we should prioritize between different streams of object-level work. For example, better answers to questions like these could change what research areas we go big on and what we pitch to potential grantees:
  - How can we tell how promising an interpretability technique is? What are the best “internal validity” measures of success? What are the best downstream tasks to measure?
  - What are the elements of an ideal model organism for misalignment, and what are the challenges to creating such a model?
  - What is the most compelling theory of change / path to impact for research on adversarial attacks and defenses, and what is the most exciting version of that kind of research?
  - Are there some empirical research directions inspired by the assistance games / reward uncertainty tradition which could be helpful even in a language model paradigm?

If you join the technical AI safety team in this round, you could help relieve some severe bottlenecks while building this new iteration of the program area from the ground up. If this sounds exciting to you, I strongly encourage you to apply!

^{^}
Interestingly, these figures are actually considerably larger than annual technical AI safety giving in the several years before that, even though we had fewer full-time-equivalent staff working in the area in 2022 and 2023 compared to 2015-2021.
^{^}
Initially, our program was led by Daniel Dewey. By around 2019, Catherine Olsson had joined the team, and eventually (I think by 2020-2021) it transitioned to being a team of three run by Nick Beckstead, who managed Catherine and Daniel, as well as Asya Bergal at half her time. In 2021, all three of Daniel, Catherine, and Nick left for other roles. For an interim period, there was no single point person: Holden was personally handling bigger grants (e.g. Redwood Research), and Asya was handling smaller grants (e.g. an RFP that Nick originally started and our PhD fellowship). Holden then moved on to direct work and Asya went full-time on capacity building. I began doing grantmaking in Oct 2022, and quickly ended up full-time handling FTXFF bailout grants. Since late January 2023 or so, I’ve been presiding over a more normal program area.

Here's some (hopefully useful) context on why I (SERI MATS 4.0, independent alignment researcher) feel helplessness at the idea of applying: I expect to not actually make a difference by working as a part of your team, because I don't expect my model of the alignment problem [which is essentially that of MIRI and John Wentworth] to be shared by you or the OpenPhil leadership.

From your updated timelines post:

I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence; I’m not totally sure I understand it but I probably don’t expect a sharp left turn.

This is probably our biggest crux. To me it seems pretty clear that "capabilities generalize farther than alignment". The existence of a "deep core of intelligence" also seems very obvious to me, although behaviorally I'm currently uncertain as to whether we could see a discontinuous jump in AI systems' generality.

Overall I sense that what is being selected for may be close to "be as epistemically accurate in decisions and communications as possible given our constraints of moving fast", and it makes sense; but I expect that this also selects for people who are less comfortable with the sort of non-verbal epistemic reasoning heuristics that seem crucial to not slide your attention away from noticing that one is confused, and by extension, the "hard parts of the problem". I think the former is very useful when dealing with problems in domains where we have a clear idea of the problem (rocket engineering), but probably net negative when dealing with a domain we are still confused about.

Can you say more about what you mean by "capabilities generalize further than alignment?"

Sure. I've made two attempts to point at what I mean: one Yudkowsky-like, and the other Nate-like. I'm hoping that the combination should at least make someone get what I'm pointing at.

Attempt 1

There is a 'ground truth' for capabilities, and that is our universe. Our universe is coherent and Lawful -- 2+2=4, and <the fundamental physics laws governing our universe> hold at every moment, every where. Every piece of data given to an optimizer tells the optimizer about these things. You can learn arithmetic from a thousand different examples of data drawn from the real world, none of which need to be explicitly about arithmetic. You can detect the shape of the physics laws constraining our universe via a myriad of ways, none of which can make it seem obvious to you or I as to how we can infer these laws from the data. Combine that with highly focused optimization pressure, and what you get is a system that is incredibly capable.

There are an infinite paths to the truth of reality, and that is reflected in the data we provide an optimizer. This is not the case for our values. Human values are very complex and arbitrary -- a result of the specific brain architecture we seem to have evolved.

Every data point an optimizer is provided from the real world tells it the same thing about 'capabilities': 2+2 = 4, for example. Even inaccurate datapoints are causally upstream of a coherent universe, and therefore provide the optimizer information about the causes of these inaccurate data points. If an optimizer uses 'proxy correlates' of reality, it shall soon start to converge to understanding the actual structure of reality.

In contrast, it does not seem to be the case that we know how to get an optimizer to converge to understanding the actual goal (even if it is something as simple as "maximize the amount of diamonds in the universe"). All we seem to know how to do is to train proxy correlates into a model. These proxy correlates do not generalize out of distribution, and once an optimizer 'groks' reality, it shall see the ways it can achieve the outcomes it is meant to achieve using paths other than the ones it was shaped to follow by the stupider optimizers that built it.

Attempt 2

Until now, all SOTA AI systems we can see are limited-domain consequentialists (or approximations thereof). None of them are truly general in the sense that they seem to be able to chain actions across multiple wildly differing domains (social, programming, cognitive heuristics improvement, maintenance and upgrade of the infrastructure the AI system is running on -- to give a few examples) to achieve whatever outcomes they could be perceived as aiming towards^[1]. GPT-4 is a predictor that can be prompted to simulate a consequentialist (such as a human being), but GPT-4 is not capable enough to simulate the cross-domain capabilities of such an agent, at least as far as I know.

When your 'alignment' techniques involve training an AI system to behave in ways you like when the AI system is restricted to these isolated domains, all you are doing is teaching your system decision-making influences that are proxies of the actual values you wish the system would have. These decision-making influences will not hold across all domains that an AI might chain their actions across -- and this will especially be true in the case of the domains that enable an AI system to chain actions across multiple widely differing domains, such as abstract reasoning. Since the specific ontology that an AI system uses itself changes what inputs and outputs an AI system has for its abstract reasoning algorithms, you cannot use external behavioral outputs as evidence to be able to shape an AI's reasoning^[2].

Note, this does not necessarily mean that we shall see systems with cleanly describable internals that seem to contain a concrete 'outcome' that the AI is 'intentionally' trying to achieve! I'm describing what we can infer based on the observed behavior of such an AI system -- it seems far more likely that such systems will likely not have such clean 'outcomes' in mind that they are deliberately aiming towards, even if one can easily imagine evidence of easily detectable convergent instrumental goals (which do not provide us much evidence for whether or not a model is aligned). ↩︎
Which is why, it seems, a lot of people working on AGI alignment are converging to ontology identification as the goal of their research agendas. ↩︎

Thanks!

I wasn't expecting such a detailed answer, I guess I should have asked a more specific question. This is great though. The thing I was confused about was: "Capabilities generalize further than alignment" makes it sound like capabilities-properties or capabilities-skills (such as accurate beliefs, heuristics for arriving at accurate beliefs, useful convergent instrumental strategies, etc.) will work in a wider range of environments than alignment-properties like honesty, niceness, etc. But I don't think what you've said establishes that.

But I think what you mean is different -- you mean "If you train an AI using human feedback on diverse tasks, hoping it'll acquire both general-purpose capabilities and also robust alignment properties, what'll happen by default is that it DOES acquire the former but it does not acquire the latter." (And the reason for this is basically that capabilities properties are more simple/natural/universal/convergent than alignment properties; with alignment properties there are all sorts of other similarly-simple properties that perform just as well in training, but for capabilities properties there generally aren't (at least not for sufficiently diverse challenging environments; in simple environments they just 'memorize' or otherwise learn 'simple heuristics that don't generalize'))

Is this an accurate summary of your view?

Yes, as far as I can tell. "Alignment properties" do not seem to me to be convergent or universal in any way.

I would definitely apply if I didn't feel like my personal priority should be on focusing on my object level Safety Eval Red Teaming work, and pursuing my AI Alignment Research Agenda.

From your updated timelines post:

I don’t expect a discontinuous jump in AI systems’ generality or depth of thought from stumbling upon a deep core of intelligence; I’m not totally sure I understand it but I probably don’t expect a sharp left turn.

Can you say more about what you mean by "capabilities generalize further than alignment?"

Sure. I've made two attempts to point at what I mean: one Yudkowsky-like, and the other Nate-like. I'm hoping that the combination should at least make someone get what I'm pointing at.

Attempt 1

Attempt 2

Note, this does not necessarily mean that we shall see systems with cleanly describable internals that seem to contain a concrete 'outcome' that the AI is 'intentionally' trying to achieve! I'm describing what we can infer based on the observed behavior of such an AI system -- it seems far more likely that such systems will likely not have such clean 'outcomes' in mind that they are deliberately aiming towards, even if one can easily imagine evidence of easily detectable convergent instrumental goals (which do not provide us much evidence for whether or not a model is aligned). ↩︎
Which is why, it seems, a lot of people working on AGI alignment are converging to ontology identification as the goal of their research agendas. ↩︎

Yes, as far as I can tell. "Alignment properties" do not seem to me to be convergent or universal in any way.

I would definitely apply if I didn't feel like my personal priority should be on focusing on my object level Safety Eval Red Teaming work, and pursuing my AI Alignment Research Agenda.

LESSWRONG
LW

LESSWRONG
LW

83

New roles on my team: come build Open Phil's technical AI safety program with me!

83

83

Attempt 1

Attempt 2

83

Attempt 1

Attempt 2