Big Picture AI Safety: Introduction

EuanMcLean

46 Big Picture AI Safety: Introduction

23rd May 2024

6 min read

46

tldr: I conducted 17 semi-structured interviews of AI safety experts about their big picture strategic view of the AI safety landscape: how will human-level AI play out, how things might go wrong, and what should the AI safety community be doing. While many respondents held “traditional” views (e.g. the main threat is misaligned AI takeover), there was more opposition to these standard views than I expected, and the field seems more split on many important questions than someone outside the field may infer.

What do AI safety experts believe about the big picture of AI risk? How might things go wrong, what we should do about it, and how have we done so far? Does everybody in AI safety agree on the fundamentals? Which views are consensus, which are contested and which are fringe? Maybe we could learn this from the literature (as in the MTAIR project), but many ideas and opinions are not written down anywhere, they exist only in people’s heads and in lunchtime conversations at AI labs and coworking spaces.

I set out to learn what the AI safety community believes about the strategic landscape of AI safety. I conducted 17 semi-structured interviews with a range of AI safety experts. I avoided going into any details of particular technical concepts or philosophical arguments, instead focussing on how such concepts and arguments fit into the big picture of what AI safety is trying to achieve.

This work is similar to the AI Impacts surveys, Vael Gates’ AI Risk Discussions, and Rob Bensinger’s existential risk from AI survey. This is different to those projects in that both my approach to interviews and analysis are more qualitative. Part of the hope for this project was that it can hit on harder-to-quantify concepts that are too ill-defined or intuition-based to fit in the format of previous survey work.

Questions

I asked the participants a standardized list of questions.

What will happen?
- Q1 Will there be a human-level AI? What is your modal guess of what the first human-level AI (HLAI) will look like? I define HLAI as an AI system that can carry out roughly 100% of economically valuable cognitive tasks more cheaply than a human.
  - Q1a What’s your 60% or 90% confidence interval for the date of the first HLAI?
- Q2 Could AI bring about an existential catastrophe? If so, what is the most likely way this could happen?
  - Q2a What’s your best guess at the probability of such a catastrophe?
What should we do?
- Q3 Imagine a world where, absent any effort from the AI safety community, an existential catastrophe happens, but actions taken by the AI safety community prevent such a catastrophe. In this world, what did we do to prevent the catastrophe?
- Q4 What research direction (or other activity) do you think will reduce existential risk the most, and what is its theory of change? Could this backfire in some way?
What mistakes have been made?
- Q5 Are there any big mistakes the AI safety community has made in the past or are currently making?

These questions changed gradually as the interviews went on (given feedback from participants), and I didn’t always ask the questions exactly as I’ve presented them here. I asked participants to answer from their internal model of the world as much as possible and to avoid deferring to the opinions of others (their inside view so to speak).

Participants

Adam Gleave is the CEO and co-founder of the alignment research non-profit FAR AI. (Sept 23)
Adrià Garriga-Alonso is a research scientist at FAR AI. (Oct 23)
Ajeya Cotra leads Open Philantropy’s grantmaking on technical research that could help to clarify and reduce catastrophic risks from advanced AI. (Jan 24)
Alex Turner is a research scientist at Google DeepMind on the Scalable Alignment team. (Feb 24)
Ben Cottier is a researcher specializing in key trends and questions that will shape the trajectory and governance of AI at Epoch AI. (Oct 23)
Daniel Filan is a PhD candidate at the Centre for Human-Compatible AI under Stuart Russell and runs the AXRP podcast. (Feb 24)
David Krueger is an assistant professor in Machine Learning and Computer Vision at the University of Cambridge. (Feb 24)
Evan Hubinger is an AI alignment stress-testing researcher at Anthropic. (Feb 24)
Gillian Hadfield is a Professor of Law & Strategic Management at the University of Toronto and holds a CIFAR AI Chair at the Vector Institute for Artificial Intelligence. (Feb 24)
Holly Elmore is currently running the US front of the Pause AI Movement and previously worked at Rethink Priorities. (Jan 24)
Jamie Bernardi co-founded BlueDot Impact and ran the AI Safety Fundamentals community, courses and website. (Oct 23)
Neel Nanda runs Google DeepMind’s mechanistic interpretability team. (Feb 24)
Nora Belrose is the head of interpretability research at EleutherAI. (Feb 24)
Noah Siegel is a senior research engineer at Google DeepMind and a PhD candidate at University College London. (Jan 24)
Ole Jorgensen is a member of technical staff at the UK Government’s AI Safety Institute (this interview was conducted before he joined). (Mar 23)
Richard Ngo is an AI governance researcher at OpenAI. (Feb 24)
Ryan Greenblatt is an AI safety researcher at the AI safety non-profit Redwood Research. (Feb 24)

These interviews were conducted between March 2023 and February 2024, and represent their views at the time.

A very brief summary of what people said

What will happen?

Many respondents expected the first human-level AI (HLAI) to be in the same paradigm as current large language models (LLMs) like GPT-4, probably scaled up (made bigger), with some new tweaks and hacks, and scaffolding like AutoGPT to make it agentic. But a smaller handful of people predicted that larger breakthroughs are required before HLAI. The most common story of how AI could cause an existential disaster was the story of unaligned AI takeover, but some explicitly pushed back on the assumptions behind the takeover story. Some took a more structural view of AI risk, emphasizing threats like instability, extreme inequality, gradual human disempowerment, and a collapse of human institutions.

What should we do about it?

When asked how AI safety might prevent disaster, respondents focussed most on

the technical solutions we might come up with,
spreading a safety mindset through AI research,
promoting sensible AI regulation,
and helping build a fundamental science of AI.

The research directions people were most excited about were mechanistic interpretability, black box evaluations, and governance research.

What mistakes have been made?

Participants pointed to a range of mistakes they thought the AI safety movement had made. There was no consensus and the focus was quite different from person to person. The most common themes included:

an overreliance on overly theoretical argumentation,
being too insular,
putting people off by pushing weird or extreme views,
supporting the leading AGI companies resulting in race dynamics,
not enough independent thought,
advocating for an unhelpful pause to AI development,
and historically ignoring policy as a potential route to safety.

Limitations

People had somewhat different interpretations of my questions, so they were often answering questions that were subtly different from each other.
The sample of people I interviewed is not necessarily a representative sample of the AI safety movement as a whole. The sample was pseudo-randomly selected, optimizing for a) diversity of opinion, b) diversity of background, c) seniority, and d) who I could easily track down. Noticeably, there is an absence of individuals from MIRI, a historically influential AI safety organization, or those who subscribe to similar views. I approached some MIRI team members but no one was available for an interview. This is especially problematic since many respondents criticized MIRI for various reasons, and I didn’t get much of a chance to integrate MIRI’s side of the story into the project.
There will also be a selection bias due to everyone I asked being at least somewhat bought into the idea of AI being an existential risk.
A handful of respondents disagreed with the goal of this project: they thought that those in AI safety typically spend too much time thinking about theories of impact.
There were likely a whole bunch of framing effects that I did not control for.
There was in some cases a large gap in time between the interview and this being written up (mostly between 1 and 4 months, a year for one early interview). Participant opinions may have changed over this period.

Subsequent posts

In the following three posts, I present a condensed summary of my findings, describing the main themes that came up for each question:

What will happen? What will human-level AI look like, and how might things go wrong?
What should we do? What should AI safety be trying to achieve and how?
What mistakes has the AI safety movement made?

You don’t need to have read an earlier post to understand a later one, so feel free to zoom straight in on what interests you.

I am very grateful to all of the participants for offering their time to this project. Also thanks to Vael Gates, Siao Si Looi, ChengCheng Tan, Adam Gleave, Quintin Davis, George Anadiotis, Leo Richter, McKenna Fitzgerald, Charlie Griffin and many of the participants for feedback on early drafts.

This work was funded and supported by FAR AI.

AI RiskAI

Frontpage

46

What will the first human-level AI look like, and how might things go wrong?

2 comments20 karma

Mentioned in

63What mistakes has the AI safety movement made?

20What will the first human-level AI look like, and how might things go wrong?

16What should AI safety be trying to achieve?

Big Picture AI Safety: Introduction

New Comment

7 comments, sorted by

top scoring

Click to highlight new comments since: Today at 11:43 PM

[-]Akash7mo3828

What do AI safety experts believe about the big picture of AI risk?

I would be careful not to implicitly claim that these 17 people are a "representative sample" of the AI safety community. Or, if you do want to make that claim, I think it's important to say a lot more about how these particular participants were chosen and why you think they are represented.

At first glance, it seems to me like this pool of participants overrepresents some worldviews and under-represents others. For example, it seems like the vast majority of the participants either work for AGI labs, Open Philanthropy, and close allies/grantees of OP. OP undoubtedly funds a lot of AIS groups, but there are lots of experts who approach AIS from a different set of assumptions and worldviews.

More specifically, I'd say this list of 17 experts over-represents what I might refer to as the "Open Phil + AGI labs + people funded by or close to those entities" cluster of thinkers (who IMO generally are more optimistic than folks at groups like MIRI, Conjecture, CAIS, FLI, etc.) & over-represents people who are primarily focused on technical research (who IMO are generally most optimistic about technical alignment, more likely to believe empirical work is better than conceptual work, and more likely to believe in technical rather than socio-technical approaches.)

To be clear– I still think that work like this is & can be important. Also, there is some representation from people outside of the particular subculture I'm claiming is over-represented.

But I think it is very hard to do a survey that actually meaningfully represents the AI safety community, and I think there are a lot of subjective decisions that go into figuring out who counts as an "expert" in the field.

[-]ryan_greenblatt7mo118

I think it probably doesn't make sense to talk about "representative samples".

Here are a bunch of different things this could mean:

A uniform sample from people who have done any work related to AI safety.
A sample from people weighted to their influence/power in the AI safety community.
A sample from people weighted by how much I personally respect their views about AI risk.

Maybe what you mean is: "I think this sample underrepresents a world view that I think this is promising. This world view is better represented by MIRI/Conjecture/CAIS/FLI/etc."

I think programs like this one should probably just apply editorial discretion and note explicitly that they are doing so.

(This complaint is also a complaint about the post which does try to use a notion of "representative sample".)

[-]ryan_greenblatt7mo60

I would be careful not to implicitly claim that these 17 people are a "representative sample" of the AI safety community.

Worth noting that this is directly addressed in the post:

The sample of people I interviewed is not necessarily a representative sample of the AI safety movement as a whole. The sample was pseudo-randomly selected, optimizing for a) diversity of opinion, b) diversity of background, c) seniority, and d) who I could easily track down. Noticeably, there is an absence of individuals from MIRI, a historically influential AI safety organization, or those who subscribe to similar views. I approached some MIRI team members but no one was available for an interview. This is especially problematic since many respondents criticized MIRI for various reasons, and I didn’t get much of a chance to integrate MIRI’s side of the story into the project.

So, in this case, I would say this is explicitly disclaimed let alone implicitly claimed.

[-]DanielFilan7mo11

OP undoubtedly funds a lot of AIS groups, but there are lots of experts who approach AIS from a different set of assumptions and worldviews.

Note that the linked paper includes a bunch of authors from AGI labs or who have received OpenPhil funding.

[-]Akash7mo50

Which of the institutions would you count as AGI labs? (genuinely curious– usually I don't think about academic labs [relative to like ODA + Meta + Microsoft] but perhaps there are some that I should be counting.)

And yeah, OP funding is a weird metric because there's a spectrum of how much grantees are closely tied to OP. Like, there's a wide spectrum from "I have an independent research group and got 5% of my total funding from OP" all the way to like "I get ~all my funding from OP and work in the same office as OP and other OP allies and many of my friends/colleagues are OP etc."

That's why I tried to use the phrase "close allies/grantees", to convey more of this implicit cultural stuff than merely "have you ever received OP $." My strong impression is that the authors of the paper are much more intellectually/ideologically/culturally independent from OP, relative to the list of 17 interviewees presented above.

[-]DanielFilan7mo20

Anca Dragan, who currently leads an alignment team at DeepMind, is the one I saw (I then mistakenly assumed there were others). And fair point re: academic OpenPhil grantees.

[-]DanielFilan7mo93

Participants pointed to a range of mistakes they thought the AI safety movement had made. There was no consensus and the focus was quite different from person to person. The most common themes included:
an overreliance on overly theoretical argumentation,
being too insular,
putting people off by pushing weird or extreme views,
supporting the leading AGI companies resulting in race dynamics,
not enough independent thought,
advocating for an unhelpful pause to AI development,
and historically ignoring policy as a potential route to safety.

FWIW one thing that jumps out to me is that it feels like this list comes in two halves each complaining about the other: one that thinks AI safety should be less theoretical, less insular, less extreme, and not advocate pause; and one that thinks that it should be more independent, less connected to leading AGI companies, and more focussed on policy. They aren't strictly opposed (e.g. one could think people overrate pause but underrate policy more broadly), but I would strongly guess that the underlying people making some of these complaints are thinking of the underlying people making others.

Moderation Log