Quick thoughts, feel free to ignore:
Good luck!
A thing you are maybe missing is that the discussion groups are now in the past.
You should be sure to point out that many of the readings are dumb and wrong
The hope is that the scholars notice this on their own.
Week 3 title should maybe say “How could we safely train AIs…”? I think there are other training options if you don’t care about safety.
Lol nice catch.
It's quite hard to summarize AI governance in a few readings. With that in mind, here are some AI governance ideas/concepts/frames that I would add:
Emergency Preparedness (Wasil et al; exec summary + policy proposals - 3 mins)
Governments should invest in strategies that can help them detect and prepare for time-sensitive AI risks. Governments should have ways to detect threats that would require immediate intervention & have preparedness plans for how they can effectively respond to various acute risk scenarios.
Labs should present arguments that AI systems are safe within a particular training or deployment context.
(Others that I don't have time to summarize but still want to include:)
We included a summary of Situational Awareness as an optional reading! I guess I thought the full thing was a bit too long to ask people to read. Thanks for the other recs!
As part of our Summer 2024 Program, MATS ran a series of discussion groups focused on questions and topics we believe are relevant to prioritizing research into AI safety. Each weekly session focused on one overarching question, and was accompanied by readings and suggested discussion questions. The purpose of running these discussions was to increase scholars’ knowledge about the AI safety ecosystem and models of how AI could cause a catastrophe, and hone scholars’ ability to think critically about threat models—ultimately, in service of helping scholars become excellent researchers.
The readings and questions were largely based on the curriculum from the Winter 2023-24 Program, with two changes:
In addition, the curriculum was supplemented in two ways:
As in the post about the previous cohort’s curriculum, we think that there is likely significant room to improve this curriculum, and welcome feedback in the comments.
Week 1: How powerful is intelligence?
Core readings
[In retrospect, it was a mistake to assign a reading that people could not read just by clicking on the link. Instead, it would be better to assign this LessWrong post containing a summary and discussion of the relevant section.]
Other readings
Discussion questions
Week 2: How and when will transformative AI be made?
Core readings
A 'dialogue' on whether LLMs can be scaled to get superhuman AI. In summary:
Connects cross-entropy loss to stuff we care about - how many tokens you need to draw from a model before you can distinguish it from human text. Thesis: if a model is indistinguishable from human text, it's as smart as the humans generating that text. Uses this to bound time to AIs that can write human-like.
Bio-anchors: compare FLOPs used in ML to candidate biological processes that maybe produce AGI in the relevant way, get a decent probability of AGI by 2050. Problem: maybe FLOPs as a measure of intelligence is fake.
Other readings
Dashboard of trends about big deep learning models - how compute is scaling (4x per year per notable ML model since 2010), when we'll train on all human-generated text (2028), etc.
Just scaling up neural nets on text prediction could produce AGI. Written when this was not the consensus view.
If markets were expecting super-powerful AI within the next 30 years, interest rates would go up, either because the market thought we were about to die, or because economic growth would increase. They're not up, and markets are really smart, so maybe we won't get super-powerful AI in the next 30 years.
Scaling might produce really smart AI. We should think of this from a national security perspective. The US should build tons of power plants to put energy into training AIs, people should put a bunch of work into alignment, AI will be nationalized, and the US should try to get AGI before foreign actors do, and make sure foreign state actors don't steal model weights.
Discussing disagreements between Daniel Kokotajlo, Ajeya Cotra, and Ege Erdil on AI timelines. Daniel thinks that we will get transformative AI within 10 years, Ajeya thinks it's 13 years median, and Ege thinks it's 40 years away. Ege weights heavily random difficulties popping up making stuff really tricky, while Daniel thinks scaling works pretty easily.
Discussion questions
Week 3: How could we train AIs whose outputs we can’t evaluate?
Core readings
RLHF won't scale when humans can't evaluate AI plans - we need AIs to assist humans to do evaluation
RLHF can be suboptimal when the AI knows things the human doesn't. Two failure modes: the AI making the human think things are better than they really are, and the AI spending resources to prove to the human that the AI is being useful.
Studies the problem of humans supervising superhuman models by trying to use weak models to generate labels on which to fine-tune larger models, and seeing if the larger models can perform better than the weaker models.
A common alignment plan is to observe human behaviour, infer what we want, and get an AI to optimize for that. Problem: given that humans are not optimal utility maximizers, it's unclear how one could do this with infinite data and compute.
Other readings
Repeat the following: make an AI that is as smart as a human with whatever tools the human has available, but much cheaper to run (distillation); then give that AI to the human as a tool to do stuff with, increasing the human's capabilities (amplification). This increases the human's power to do stuff while remaining aligned with the human's interests.
To help humans supervise AI outputs, train two AIs to debate each other on how good the outputs are. The AIs will be incentivized to point out flaws or omitted info in each other's arguments, and so each AI is incentivized be honest about the quality of the outputs.
Suppose one AI makes a long, multi-step argument for claim C, and another AI says that one of the steps is invalid but it can't figure out which. Then, the other AI makes a long, multi-step argument for C being false, and the first AI says that one of the steps is invalid but it can't figure out which. A human can't reliably judge this debate, which makes it hard for AI safety via debate to work. This can show up in practice in human debates.
Scalable oversight (getting AIs to help humans evaluate AI output) and weak-to-strong generalization (helping AIs learn from imperfect human evaluation) are compatible approaches to the problem of training models to perform well when we have trouble evaluating their output.
To test our ability to get smart models to do what we want, we should run experiments like "get someone who doesn't speak French to train an LLM to translate English into French, and use bilingual English-French speakers to see if it worked". More abstractly: get a fuzzy task that can't be precisely defined, an AI that could be better than some humans at that task, and try to get those humans to get the AI to do that task, via various alignment schemes.
If we built an AI system that could accelerate alignment research and helped us align more capable AI systems, this would be good - it's easier than most alignment work, and helps us solve the rest of the alignment problem.
Discussion questions
Week 4: Will AIs fake alignment?
Core readings
Scheming AIs: Will AIs fake alignment during training in order to get power?, abstract and introduction (Carlsmith - 45 min)
[In retrospect, this probably took longer than 45 minutes for most people to read]
Other readings
On inner and outer alignment
Note that in this post, a "mesa-optimizer" is an optimizer that is itself optimized by a "base optimizer" - imagine a neural network that is optimizing to perform some task (like "cleaning a room"), while itself being optimized by an algorithm like stochastic gradient descent. The "mesa-objective" is the objective of the mesa-optimizer, while the "base objective" is the objective that the base optimizer is optimizing.
On reasons to think deceptive alignment is likely
Discussion questions
Week 5: How should AI be governed?
Core readings
Set up an international AI Safety Agency that has to approve training and deploying big models. Training of general AI systems should only be allowed if safety can be guaranteed, deployment should only be allowed if no dangerous capabilities are present. Also stop people from publishing algorithm improvements, or increasing effectiveness of computational hardware.
How labs can increase safety: say "we will stop scaling when we make observation O, until we implement adequate protection P". Makes sense under varying estimates of risk, moves attention to specific risk reducing measures, gives practice for evaluation-based regulations.
Regulations are messy and can easily backfire, by (a) being misdirected and hampering safety effort, (b) favouring things that are legible to the state (which might not be good safety efforts), (c) pushing research to countries that don't regulate, and (d) empowering big companies.
Other readings
From the piece: "Since I think substantial AI regulation will likely occur by default, I urge effective altruists to focus more on ensuring that the regulation is thoughtful and well-targeted rather than ensuring that regulation happens at all."
An AI regulation scheme: if there's an AI accident that's a near miss for doom (e.g. a power-seeking AI takes over Berkeley for a week, rather than the whole earth forever), they are liable for punitive damages that represent the risk their AI could have taken over the world. Also, have damages even if there isn't provable negligence, and require labs to be insured for this legal risk.
Advanced AI is likely to kill all humans, and AI is progressing very quickly. There is no sensible plan to align AI: "get AI to figure out alignment" is not a sensible plan. There should be a global ban on frontier AI development, GPUs should be tracked, and large GPU clusters should be banned.
We're not ready for transformative AI. A global pause on AI progress and hardware is infeasible, and a partial pause (e.g. temporarily banning AI development in the US but not banning production of AI hardware and algorithmic insights) could do more harm than nothing, because progress would be very fast once a pause was lifted. RSPs are like pauses, but they only work in cases where AI is scary, and so lots of people can agree on it.
An indefinite AI pause is possible, and would require something like a world government, which would be hard to create, and would also be terrible. It would also delay technological progress. It could also prevent AI from being created ever, which would leave humanity vulnerable to other existential risks.
Takeoffs are safer if they're slower and more continuous, because people have time to react. Slow continuous takeoff is more likely in short timelines, because coordination is easier now and we might have "compute overhang" later - lots of computation that could quickly be used to make really smart AI. Also, it's worth trading time now for time during takeoff, because we will know more during takeoff.
Some things about the naming and definitions of RSPs are sketchy, and the Anthropic RSP is poorly specified. Using RSPs to reduce existential risk requires determining whether models are existentially dangerous, and it's unclear how to do that.
Discussion questions
Readings that did not fit into any specific week
Acknowledgements
Daniel Filan was the primary author of the curriculum (to the extent that it differed from the Winter 2023-24 curriculum) and coordinated the discussion groups. Ryan Kidd scoped, managed, and edited the project. Many thanks to the MATS alumni and other community members who helped as facilitators and to the scholars who showed up and had great discussions!