Ideas for improving epistemics in AI safety outreach

mic

In 2022 and 2023, there has been a growing focus on recruiting talented individuals to work on mitigating the potential existential risks posed by artificial intelligence. For example, we’ve seen an increase in the number of university clubs, retreats, and workshops dedicated to introducing people to the issue of existential risk from AI.

However, these efforts might foster an environment with suboptimal epistemics. Given the goal of enabling people to contribute positively to AI safety, there’s an incentive to focus on that without worrying as much about whether our arguments are solid. Many people working on field building are not domain experts in AI safety or machine learning but are motivated due to a belief that AI safety is an important issue. Some participants may hold the belief that addressing the risks associated with AI is important, without fully comprehending their reasoning behind this belief or having engaged with strong counterarguments.

This post is a brief examination of this issue and suggests some ideas to improve epistemics in outreach efforts.

Note: I first drafted this in December 2022. Since then, concern about AI x-risk has been increasingly discussed in the mainstream, so AI safety field builders should hopefully be using fewer weird, epistemically poor arguments. Still, I think epistemics are still relevant to discuss after a recent post noted poor epistemics in EA community building.

What are some ways that AI safety field building may be epistemically unhealthy?

Organizers may promote arguments for AI safety that may be (comparatively*) compelling yet flawed
- Advancing arguments promoting the importance of AI safety while neglecting opposing arguments
  - E.g., citing that x% of researchers believe that AI has an y% chance of causing an existential catastrophe, without the caveat that experts have widely differing views
- Confidently making arguments that are flawed or have insufficiently justified premises
  - E.g., claiming that instrumental convergence is inevitable, assuming that AIs are maximizing for reward (see Reward is not the optimization target, although there are also comments disagreeing with this)
- See also: Rohin Shah’s comment here about how few people can make an argument for working on AI x-risk that he doesn’t think is obviously flawed
- *Simultaneously, I think that most ML people don’t find AI safety arguments particularly compelling.
It’s easy to form the perception that arguments in favor of AI safety are “supposed” to be the more correct ones. People might feel hesitant to voice disagreements.
In a reading group (such as one based on AI Safety Fundamentals), people may go along with the arguments from the readings or what the discussion facilitator says – deferring to authority and being hesitant to think through arguments themselves.
People may participate in reading groups but skim the readings, and walk away with a belief in the conclusions without understanding the arguments; or notice they are confused but walk away regardless believing the conclusions.

Why are good epistemics valuable?

To do productive research, we want to avoid having an understanding of AI x-risk that is obviously flawed
- “incorrect arguments lead to incorrect beliefs which lead to useless solutions” (from Rohin Shah)
Bad arguments are bad for persuading people (or at least, it seems bad if you can’t anticipate common objections from the ML community)
People making bad arguments is bad for getting people to do useful work
Attract more people with good epistemics

For the sake of epistemic rigor, I’ll also make a few possible arguments about why epistemics may be overrated.

Perhaps people can do useful work even if they don’t have an inside view of why AI safety matters or about what technical agendas matter.
You might have people who just end up feeling confused and don’t take any useful action.
Other activities may be comparatively more valuable.
Epistemics might already be good enough.
Alignment researchers disagree about a lot of things and it might be unclear whether someone you disagree with has “bad epistemics” or just has different models/priors.

Ideas to improve epistemics

Embrace more contemporary, grounded arguments about why AIs could be dangerous
- I think a lot of AI x-risk arguments from the 2000s and 2010s are not apt for the current paradigm of AI agents based on large language models. “An Overview of Catastrophic AI Risks” seems more grounded than Superintelligence, and the introductory readings of the AI Governance Course seem stronger than the (current) AI Alignment Course.
Suggested by Thomas Kwa in the comments: “actually trying to understand the arguments yourself and only using correct ones […] sometimes the required steps are as simple as listing out an argument in great detail and looking at it skeptically, or checking with someone who knows ML about whether a current ML system has some property that we assume and if we expect it to arise anytime soon.”
Conduct readings during meetings, so that participants understand the content better
- This has a couple other benefits, such as: you can remember specific points that you want to discuss
Possibly offer rationality workshops
- Notice when you’re feeling confused
- Note: Rationality workshops shouldn’t be there to get people to take unconventional ideas seriously; they should actually be targeted to help improve epistemics
Actively checking participants’ understanding of the content and check if they can make strong arguments in favor or against
- This could look like: opening discussion asking someone to summarize the argument
Don’t have conversations with the goal of persuading people
- I’d prefer to redirect efforts to encourage/support motivated people to apply for opportunities
Actively inviting people to disagree, maybe exploring their ideas
- Note: I don’t want a kumbaya environment where people voice ideas (even if bad) without any pushback.
Stay in touch with the broader ML community (e.g., by following them on Twitter, attending AI events)

Stay in touch with the broader ML community (e.g., by following them on Twitter, attending AI events)

I got a lot of value out of attending ICML and would probably recommend attending an ML conference to anyone who has the resources. You actually get to talk to authors about their field and research process, which gets a lot more than reading papers or reading Twitter.

Anyway, I think you missed one of the best ideas: actually trying to understand the arguments yourself and only using correct ones. An argument isn't correct just because it's "grounded" or "contemporary" although it is good to have supporting evidence. The steps all have to be locally valid and you have to make valid assumptions. Sometimes an argument needs to be slightly changed from the most common version to be valid [1], but this only makes it more important.

Community builders often don't do technical research themselves so my guess is it's easy to underinvest, but sometimes the required steps are as simple as listing out an argument in great detail and looking at it skeptically, or checking with someone who knows ML about whether a current ML system has some property that we assume and if we expect it to arise anytime soon.

[1]: two examples: making various arguments compatible with Reward is not the optimization target, and making coherence arguments work even though AI systems will not necessarily have a single fixed utility function

Your last sentence in the first paragraph seems to be cut off at "gets a lot more than"!

Great point, I've added this suggestion to the post.

. “An Overview of Catastrophic AI Risks” seems more grounded than Superintelligence,

Huh. I guess it is more "grounded" in the sense that it has that GIF of a boat going in circles, and some other links to other toy examples... but is it really more epistemicallly rigorous than Superintelligence? I think the answer is "obviously not, the opposite is true" though it's an unfair comparison since Superintelligence is a whole book written for an academic audience.

(This is a nitpick, I basically agree with your post overall)

Organizers may promote arguments for AI safety that may be (comparatively*) compelling yet flawed

I feel like there's an asymmetry here. "10% of researchers believe AI extinction is a possibility" isn't somehow offset by "but 90% don't". For such an outrageous claim, 10% is a huge number! Similarly, "maybe AIs won't be instrumentally convergent" is not enough here. "We are absolutely positive that we can build AIs that are not instrumentally convergent, and that no amount of unavoidable successive dumbass tinkering will suffice to change that" would be. Which is kind of what alignment research is about? Whenever people have a P(doom) lower than 100% (which is most people besides Yud), that margin usually lies somewhere in these possibilities. But even a P(doom) of 1% is stupid high and worth spending effort reducing further.

I think any outreach must start with understanding where the audience is coming from. The people most likely to make the considerable investment of "doing outreach" are in danger of being too convinced of their position and thinking it obvious; "how can people not see this?".

If you want to have a meaningful conversation with someone and interest them in a topic, you need to listen to their perspective, even if it sounds completely false and missing the point, and be able to empathize without getting frustrated. For most people to listen and consider any object level arguments about a topic they don't care about, there must first be a relationship of mutual respect, trust and understanding. Getting people to consider some new ideas, rather than convincing them of some cause, is already a very worthy achievement.