MATS AI Safety Strategy Curriculum

Ronny Fernandez; Ryan Kidd

As part of the MATS Winter 2023-24 Program, scholars were invited to take part in a series of weekly discussion groups on AI safety strategy. Each strategy discussion focused on a specific crux we deemed relevant to prioritizing AI safety interventions and was accompanied by a reading list and suggested discussion questions. The discussion groups were faciliated by several MATS alumni and other AI safety community members and generally ran for 1-1.5 h.

As assessed by our alumni reviewers, scholars in our Summer 2023 Program were much better at writing concrete plans for their research than they were at explaining their research’s theory of change. We think it is generally important for researchers, even those early in their career, to critically evaluate the impact of their work, to:

Choose high-impact research directions and career pathways;
Conduct adequate risk analyses to mitigate unnecessary safety hazards and avoid research with a poor safety-capabilities advancement ratio;
Discover blindspots and biases in their research strategy.

We expect that the majority of improvements to the above areas occur through repeated practice, ideally with high-quality feedback from a mentor or research peers. However, we also think that engaging with some core literature and discussing with peers is beneficial. This is our attempt to create a list of core literature for AI safety strategy appropriate for the average MATS scholar, who should have completed the AISF Alignment Course.

We are not confident that the reading lists and discussion questions below are the best possible version of this project, but we thought they were worth publishing anyways. MATS welcomes feedback and suggestions for improvement.

Week 1: How will AGI arise?

What is AGI?

Karnofsky - Forecasting Transformative AI, Part 1: What Kind of AI? (13 min)
Metaculus - When will the first general AI system be devised, tested, and publicly announced? (read Resolution Criteria) (5 min)

How large will models need to be and when will they be that large?

Alexander - Biological Anchors: The Trick that Might or Might Not Work (read Parts I-II) (27 min)
Optional: Davidson - What a compute-centric framework says about AI takeoff speeds (20 min)
Optional: Habryka et al. - AI Timelines (dialogue between Ajeya Cotra, Daniel Kokotajlo, and Ege Erdil) (61 min)
Optional: Halperin, Chow, Mazlish - AGI and the EMH: markets are not expecting aligned or unaligned AI in the next 30 years (31 min)

How far can current architectures scale?

Patel - Will Scaling Work? (16 min)
Epoch - AI Trends (5 min)
Optional: Nostalgebraist - Chinchilla's Wild Implications (13 min)
Optional: Porby - Why I think strong general AI is coming soon (40 min)

What observations might make us update?

Ngo - Clarifying and predicting AGI (5 min)
Optional: Berglund et al. - Taken out of context: On measuring situational awareness in LLMs (33 min)
Optional: Cremer, Whittlestone - Artificial Canaries: Early Warning Signs for Anticipatory and Democratic Governance of AI (34 min)

Week 2: Is the world vulnerable to AI?

Conceptual frameworks for risk: What kinds of technological advancements is the world vulnerable to in general?

Bostrom - The Unilateralist’s Curse and the Case for a Principle of Conformity (5-15 min)
- This is a pretty simple statistical model. You should only read enough to understand the model, eg, why would a principle of conformity help; why might research get done even if 99% of experts think it is a terrible idea; why does adding more people capable of doing the research make it more likely that the research gets done?
Bostrom - The Vulnerable World Hypothesis (15 min)
- You should read enough to understand the “urn model”. It is also worth looking over the typology of vulnerabilities.
Optional: Aschenbrenner - Securing posterity (14 min)
Optional: Sandbrink et al. - Differential technology development: An innovation governance consideration for navigating technology risks

Attack vectors: How might AI cause catastrophic harm to civilization?

Hilton - What could an AI-caused existential catastrophe actually look like? (11 min)
Seger et al. - Open Sourcing Highly Capable Foundation Models (section: “Risks of Open-Sourcing Foundation Models”) (16 min)
Longlist of possible “attack vectors”:
- Cyberweapons
  - WormGPT enables hacking at scale
  - Palisade Research showed that ChatGPT can hack an unpatched Windows machine with self-prompted chain-of-thought
  - Spear phishing attacks will become extremely sophisticated and scalable
  - AI cyber-defenders might be too weak or disfavored by offence-defence balance to stop takeover, or dangerous for similar reasons (and thus might engage in trade with rogue AI)
- Bioweapons
  - DL is great at predicting protein structure and new chemical weapons
  - LLMs could help novices build bioweapons
  - Labs might be increasingly automated due to financial incentives, allowing for hostile takeover and bioweapon experimentation by AIs
- Mass persuasion/manipulation
  - Human persuasion by AI systems will likely be powerful
    - Blake Lemoin thought LaMDA was sentient
    - Russia’s interference in the US 2016 election using chatbots was effective
    - Deepfakes cost a Chinese man $600k
    - People are already in love with Replika (i.e., like the movie “Her”)
  - Cicero beats humans at Diplomacy
- Autonomous weapons
  - AI beats human pilots at real aerial dogfights
  - AI beats human world-champions at real drone racing
Optional: Burtell, Woodside - Artificial Influence: An Analysis Of AI-Driven Persuasion (25 min)
Optional: 1a3orn - Propaganda or Science: A Look at Open Source AI and Bioterrorism Risk (29 min)
Optional: Lawsen - AI x-risk, approximately ordered by embarrassment (23 min)

AI’s unique threat: What properties of AI systems make them more dangerous than malicious human actors?

Bostrom - Superintelligence (Chapter 6: Cognitive superpowers) (up to “Power over nature and agents”) (14 min)
Karnofsky - AI Could Defeat All Of Us Combined (20 min)
Optional: Christiano - What failure looks like (10 min)
Optional: Barak, Edelman - AI will change the world, but won’t take it over by playing “3-dimensional chess” (28 min)

Week 3: How hard is AI alignment?

What is alignment?

Christiano - Clarifying Alignment (5 min)
Arbital - AI Alignment (5 min)
Optional: Christiano - Corrigibility (8 min)
Optional: Shah - What is ambitious value learning? (3 min)

How likely is deceptive alignment?

Carlsmith - Scheming AIs: Will AIs fake alignment during training in order to get power? (Summary) (35 min)
- Optional: read the full report
- The full report is extremely good in that it takes a close look at a lot of the considerations that should inform our estimate of how likely deceptive alignment is. If you haven't spent much time thinking about this, I recommend reading the full report if you have the time.

What is the distinction between inner and outer alignment? Is this a useful framing?

Hubinger - The Inner Alignment Problem (15 min)
Turner - Inner and outer alignment decompose one hard problem into two extremely hard problems (read extended summary, up to Section I) (7 min)
Optional: read the rest of the post (49 min)

How many tries do we get, and what's the argument for the worst case?

Christiano - Where I agree and disagree with Eliezer (22 min)
In the above article, Cristiano is responding to the article linked below. Both articles are good, but if you have to read only one, I think it's better to only read the Cristiano one.
Optional: Yudkowksy - AGI Ruin: A List of Lethalities (36 min)

How much do alignment techniques for SOTA models generalize to AGI? What does that say about how valuable alignment research on present day SOTA models is?

Ruthenis - Current AIs Provide Nearly No Evidence Relevant to AGI Alignment (10 min)
Optional: Comment Thread on Previous Post (5-15 min)

Week 4: How should we prioritize AI safety research?

What is an "alignment tax" and how do we reduce it?

Christiano - Current Work In AI Alignment (31 min)
- Here is a transcript of the talk if you prefer that over a video: Current Work In AI Alignment
- This talk has a lot of important content, and for that reason it appears in two places in this curriculum. It appears here for its discussion of alignment taxes, and it appears in the next section for its discussion of “handoffs”. It's also worth taking a close look at the directed acyclic graph (DAG) that Christiano uses to frame his talk.
Leike - Distinguishing Three Alignment Taxes (5 min)

What kinds of alignment research will we be able to delegate to models if any?

Leike - A Minimal Viable Product for Alignment (5 min)
Christiano - Current Work In AI Alignment (31 min, already counted above)

How should we think about prioritizing work within the control paradigm in comparison to work with the alignment paradigm?

Shlegeris, et al. - AI Control: Improving Safety Despite Intentional Subversion (12 min)
This blogpost summarizes some of Shlegeris and collaborators' recent work, but we are including it mostly because of how it highlights its relationship to more traditional safety work. I recommend paying particular attention to that section and sections after it.
Optional: Greenblatt, et al. - The case for ensuring that powerful AIs are controlled

How should we prioritize alignment research in light of the amount of time we have left until transformative AI?

Karnofsky - How might we align transformative AI if it’s developed very soon? (54 min)
This is very long, so it might be worth skimming the sections you find most interesting instead of reading the whole thing carefully. That said, I am including it because it does a good job of walking through the potential strategies and potential pitfalls of a concrete transformative AI scenario in the near future.
Hubinger - A transparency and interpretability tech tree (21 min)
Optional: Charbel-Raphaël - Against Almost Every Theory of Impact of Interpretability

How should you prioritize your research projects in light of the amount of time you have left until transformative AI?

Kidd - Aspiring AI safety researchers should ~argmax over AGI timelines (5 min)
I am including this not exactly because I endorse the methodology exactly, but because it is a good example of taking a seemingly very intractable personal prioritization problem and breaking it down into more concrete questions rendering the problem much easier to think about.
Shlegeris - A freshman year during the AI midgame: my approach to the next year (5 min)
This is a fairly personal post, but I think it gives a good example of how to think thoughtfully about prioritizing your research projects while also being kind to yourself.

Week 5: What are AI labs doing?

How are the big labs approaching AI alignment and AI risk in general?

Anthropic:
- Core Views on AI Safety: When, Why, What, and How (15 min)
- Responsible Scaling Policies (30 min)
  I recommend specifically spending more time on the ASL-3 definition and commitments.
DeepMind:
- Some high level thoughts on Deepmind's Alignment Strategy (10 min)
- AI Safety Summit: An update on our approach to safety and responsibility (30 min)
OpenAI:
- Our approach to alignment research (5 min)
- Our approach to frontier risk (10 min)

How are small non-profit research orgs approaching AI alignment and AI risk in general?

ARC: Mechanistic anomaly detection and ELK
METR: Landing page
This is just the landing page of their website, but it's a pretty good explanation of their high level strategy and priorities.
Redwood Research: Research Page
You all already got a bunch of context on what Redwood is up to thanks to their lectures, but here is a link to their “Our Research” page on their website anyway.
Conjecture: Research Page

General summaries:

Larsen, Lifland - (My understanding of) what everyone is doing and why
- This post is sort of old by ML standards, but I think it is currently still SOTA as an overview of what all the different research groups are doing. Maybe you should write a newer and better one.
- This post is also very long. I recommend skimming it and keeping it as a reference rather than trying to read the whole thing in one sitting.

Week 6: What governance measures reduce AI risk?

Should we try to slow down or stop frontier AI research through regulation?

lc - What an actually pessimistic containment strategy looks like (10 min)
- This is a very argumentative post, but I think it presents an interesting frame.
Karnofsky - We're Not Ready: Thoughts on “pausing” and responsible scaling policies (10 min)
1a3orn - Ways I Expect AI Regulation To Increase Extinction Risk (10 min)
Optional: Grace - Let's think about slowing down AI (20 min)
- This post is now two years old, but it genuinely was surprisingly revolutionary in the space two years ago.
- It might be worth coming back to read the second part of this post on AI race dynamics after you have read Carl Schulman's piece in the third section.

What AI governance levers exist?

Regulating chips: Balwit - How We Can Regulate AI (10 min)
- Optional: Shavit - What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring
Safety standards/regulations: AISF - Primer on Safety Standards and Regulations for Industrial-Scale AI Development (5 min)
- Optional: Baker - Nuclear Arms Control Verification and Lessons for AI Treaties
- Optional: Hadfield, Clark - Regulatory Markets: The Future of AI Governance
Making labs liable: Arnold - AI Insight Forum: Privacy and Liability (2 min)

What catastrophes uniquely occur in multipolar AGI scenarios?

Schulman - Arms Control and Intelligence Explosion (10 min)
Christiano - What failure looks like (20 min)
- Many of you may have already read this, but if you haven't already read it many times, it seems pretty fundamental to me, and it might be worth reading again.
Optional: Critch - What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) (30 min)

Week 7: What do positive futures look like?

Note: attending discussion this week was highly optional.

What near-term positive advancements might occur if AI is well-directed?

Universal provision of basic needs
- O'Keefe - The Windfall Clause: Sharing the Benefits of Advanced AI (14 min)
Medical advancements for healthspan and longevity
- Green - Radical Life Extension: An ethical analysis (6 min)
- Optional: Bostrom - The Fable Of The Dragon-Tyrant (narrated and animated by CGP Grey)
More informed and empowered democratic citizens
- Optional: How to ensure artificial intelligence benefits society: A conversation with Stuart Russell and James Manyika (11 min)

What values might we want to actualize with the aid of AI?

Universal basic rights and self-determination
- Optional: Stanford Encyclopedia of Philosophy - Consequentialism: Section 3, What is Good? Hedonistic vs. Pluralistic Consequentialisms (12 min)
Positive experience and non-suffering
- Bostrom - Letter from Utopia (11 min)
- Optional: Yudkowsky - 31 Laws of Fun
Cosmopolitanism (non-racism, non-ageism, non-speciesism, etc.)
- Aird - Moral circles: Degrees, dimensions, visuals (16 min)
- Optional: Anthis, Paez - Moral circle expansion: A promising strategy to impact the far future

What (very speculative) long-term futures seem possible and promising?

Long reflection on value and meaning
- Optional: Ord - The Precipice: Chapter 7, Safeguarding Humanity (40 min)
Digital societies and AI sentience
- Bostrom - Sharing the World With Digital Minds (31 min)
- Optional: Long et al. - Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
Growing life throughout the universe
- Rational Animations - How to Take Over the Universe (in Three Easy Steps) (18 min)
  - Optional: Armstrong, Sandberg - Eternity in six hours: intergalactic spreading of intelligent life and sharpening the Fermi paradox
- Rational Animations - Humanity was born way ahead of its time. The reason is grabby aliens (13 min)
  - Optional: Rational Animations - Will we grab the universe? Grabby aliens predictions.
    - Highly recommended!
  - Optional: Hanson et al. - If Loud Aliens Explain Human Earliness, Quiet Aliens Are Also Rare

Acknowledgements

Ronny Fernandez was chief author of the reading lists and discussion questions, Ryan Kidd planned, managed, and edited this project, and Juan Gil coordinated the discussion groups. Many thanks to the MATS alumni and other community members who helped as facilitators!

[-]Ronny Fernandez1y82

I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.

LESSWRONG
LW

74

MATS AI Safety Strategy Curriculum

74

Week 1: How will AGI arise?

Suggested discussion questions

Week 2: Is the world vulnerable to AI?

Suggested discussion questions

Week 3: How hard is AI alignment?

Suggested discussion questions

Week 4: How should we prioritize AI safety research?

Suggested discussion questions

Week 5: What are AI labs doing?

Suggested discussion questions

Week 6: What governance measures reduce AI risk?

Suggested discussion questions

Week 7: What do positive futures look like?

Suggested discussion questions

Acknowledgements

74