I want to note for posterity that I tried to write this reading list somewhat impartially. That is, I have a lot of takes about a lot of this stuff, and I tried to include a lot of material that I disagree with but which I have found helpful in some way or other. I also included things that people I trust have found helpful even if I personally never found it helpful.
Week 3: How hard is AI alignment?
Seems like something important to be aware of, even if they may disagree.
As part of the MATS Winter 2023-24 Program, scholars were invited to take part in a series of weekly discussion groups on AI safety strategy. Each strategy discussion focused on a specific crux we deemed relevant to prioritizing AI safety interventions and was accompanied by a reading list and suggested discussion questions. The discussion groups were faciliated by several MATS alumni and other AI safety community members and generally ran for 1-1.5 h.
As assessed by our alumni reviewers, scholars in our Summer 2023 Program were much better at writing concrete plans for their research than they were at explaining their research’s theory of change. We think it is generally important for researchers, even those early in their career, to critically evaluate the impact of their work, to:
We expect that the majority of improvements to the above areas occur through repeated practice, ideally with high-quality feedback from a mentor or research peers. However, we also think that engaging with some core literature and discussing with peers is beneficial. This is our attempt to create a list of core literature for AI safety strategy appropriate for the average MATS scholar, who should have completed the AISF Alignment Course.
We are not confident that the reading lists and discussion questions below are the best possible version of this project, but we thought they were worth publishing anyways. MATS welcomes feedback and suggestions for improvement.
Week 1: How will AGI arise?
What is AGI?
How large will models need to be and when will they be that large?
How far can current architectures scale?
What observations might make us update?
Suggested discussion questions
Week 2: Is the world vulnerable to AI?
Conceptual frameworks for risk: What kinds of technological advancements is the world vulnerable to in general?
Attack vectors: How might AI cause catastrophic harm to civilization?
AI’s unique threat: What properties of AI systems make them more dangerous than malicious human actors?
Suggested discussion questions
Week 3: How hard is AI alignment?
What is alignment?
How likely is deceptive alignment?
What is the distinction between inner and outer alignment? Is this a useful framing?
Optional: read the rest of the post (49 min)
How many tries do we get, and what's the argument for the worst case?
In the above article, Cristiano is responding to the article linked below. Both articles are good, but if you have to read only one, I think it's better to only read the Cristiano one.
How much do alignment techniques for SOTA models generalize to AGI? What does that say about how valuable alignment research on present day SOTA models is?
Suggested discussion questions
Week 4: How should we prioritize AI safety research?
What is an "alignment tax" and how do we reduce it?
What kinds of alignment research will we be able to delegate to models if any?
How should we think about prioritizing work within the control paradigm in comparison to work with the alignment paradigm?
This blogpost summarizes some of Shlegeris and collaborators' recent work, but we are including it mostly because of how it highlights its relationship to more traditional safety work. I recommend paying particular attention to that section and sections after it.
How should we prioritize alignment research in light of the amount of time we have left until transformative AI?
This is very long, so it might be worth skimming the sections you find most interesting instead of reading the whole thing carefully. That said, I am including it because it does a good job of walking through the potential strategies and potential pitfalls of a concrete transformative AI scenario in the near future.
How should you prioritize your research projects in light of the amount of time you have left until transformative AI?
I am including this not exactly because I endorse the methodology exactly, but because it is a good example of taking a seemingly very intractable personal prioritization problem and breaking it down into more concrete questions rendering the problem much easier to think about.
This is a fairly personal post, but I think it gives a good example of how to think thoughtfully about prioritizing your research projects while also being kind to yourself.
Suggested discussion questions
Week 5: What are AI labs doing?
How are the big labs approaching AI alignment and AI risk in general?
I recommend specifically spending more time on the ASL-3 definition and commitments.
How are small non-profit research orgs approaching AI alignment and AI risk in general?
This is just the landing page of their website, but it's a pretty good explanation of their high level strategy and priorities.
You all already got a bunch of context on what Redwood is up to thanks to their lectures, but here is a link to their “Our Research” page on their website anyway.
General summaries:
Suggested discussion questions
Week 6: What governance measures reduce AI risk?
Should we try to slow down or stop frontier AI research through regulation?
What AI governance levers exist?
What catastrophes uniquely occur in multipolar AGI scenarios?
Suggested discussion questions
Week 7: What do positive futures look like?
Note: attending discussion this week was highly optional.
What near-term positive advancements might occur if AI is well-directed?
What values might we want to actualize with the aid of AI?
What (very speculative) long-term futures seem possible and promising?
Suggested discussion questions
Acknowledgements
Ronny Fernandez was chief author of the reading lists and discussion questions, Ryan Kidd planned, managed, and edited this project, and Juan Gil coordinated the discussion groups. Many thanks to the MATS alumni and other community members who helped as facilitators!