We’re grateful to our advisors Nate Soares, John Wentworth, Richard Ngo, Lauro Langosco, and Amy Labenz. We're also grateful to Ajeya Cotra and Thomas Larsen for their feedback on the contests.
TLDR: AI Alignment Awards is running two contests designed to raise awareness about AI alignment research and generate new research proposals. Prior experience with AI safety is not required. Promising submissions will win prizes up to $100,000 (though note that most prizes will be between $1k-$20k; we will only award higher prizes if we receive exceptional submissions.)
You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups.)
What are the contests?
We’re currently running two contests:
Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
Shutdown Problem Contest (based on Soares et al., 2015): Given that powerful AI systems might resist attempts to turn them off, how can we make sure they are open to being shut down?
What types of submissions are you interested in?
For the Goal Misgeneralization Contest, we’re interested in submissions that do at least one of the following:
- Propose techniques for preventing or detecting goal misgeneralization
- Propose ways for researchers to identify when goal misgeneralization is likely to occur
- Identify new examples of goal misgeneralization in RL or non-RL domains. For example:
- We might train an imitation learner to imitate a "non-consequentialist" agent, but it actually ends up learning a more consequentialist policy.
- We might train an agent to be myopic (e.g., to only care about the next 10 steps), but it actually learns a policy that optimizes over a longer timeframe.
- Suggest other ways to make progress on goal misgeneralization
For the Shutdown Problem Contest, we’re interested in submissions that do at least one of the following:
- Propose ideas for solving the shutdown problem or designing corrigible AIs. These submissions should also include (a) explanations for how these ideas address core challenges raised in the corrigibility paper and (b) possible limitations and ways the idea might fail
- Define The Shutdown Problem more rigorously or more empirically
- Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)
- Strengthen existing approaches to training corrigible agents (e.g., by making them more detailed, exploring new applications, or describing how they could be implemented)
- Identify new challenges that will make it difficult to design corrigible agents
- Suggest other ways to make progress on corrigibility
Why are you running these contests?
We think that corrigibility and goal misgeneralization are two of the most important problems that make AI alignment difficult. We expect that people who can reason well about these problems will be well-suited for alignment research, and we believe that progress on these subproblems would be meaningful advances for the field of AI alignment. We also think that many people could potentially contribute to these problems (we're only aware of a handful of serious attempts at engaging with these challenges). Moreover, we think that tackling these problems will offer a good way for people to "think like an alignment researcher."
We hope the contests will help us (a) find people who could become promising theoretical and empirical AI safety researchers, (b) raise awareness about corrigibility, goal misgeneralization, and other important problems relating to AI alignment, and (c) make actual progress on corrigibility and goal misgeneralization.
Who can participate?
Anyone can participate.
What if I’ve never done AI alignment research before?
You can still participate. In fact, you’re our main target audience. One of the main purposes of AI Alignment Awards is to find people who haven’t been doing alignment research but might be promising fits for alignment research. If this describes you, consider participating. If this describes someone you know, consider sending this to them.
Note that we don’t expect newcomers to come up with a full solution to either problem (please feel free to prove us wrong, though). You should feel free to participate even if your proposal has limitations.
How can I help?
You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups) or specific individuals (e.g., your smart friend who is great at solving puzzles, learning about new topics, or writing about important research topics.)
Feel free to use the following message:
AI Alignment Awards is offering up to $100,000 to anyone who can make progress on problems in alignment research. Anyone can participate. Learn more and apply at alignmentawards.com!
Will advanced AI be beneficial or catastrophic? We think this will depend on our ability to align advanced AI with desirable goals – something researchers don’t yet know how to do.
We’re running contests to make progress on two key subproblems in alignment:
- The Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
- The Shutdown Contest (based on Soares et al., 2015): Advanced AI systems might resist attempts to turn them off. How can we design AI systems that are open to being shut down, even as they get increasingly advanced?
No prerequisites are required to participate. EDIT: The deadline has been extended to May 1, 2023.
To learn more about AI alignment, see alignmentawards.com/resources.
Outlook
We see these contests as one possible step toward making progress on corrigibility, goal misgeneralization, and AI alignment. With that in mind, we’re unsure about how useful the contest will be. The prompts are very open-ended, and the problems are challenging. At best, the contests could raise awareness about AI alignment research, identify particularly promising researchers, and help us make progress on two of the most important topics in AI alignment research. At worst, they could be distracting, confusing, and difficult for people to engage with (note that we’re offering awards to people who can define the problems more concretely.)
If you’re excited about the contest, we’d appreciate you sharing this post and the website (alignmentawards.com) to people who might be interested in participating. We’d also encourage you to comment on this post if you have ideas you’d like to see tried.
ETA: Koen recommends reading Counterfactual Planning in AGI Systems before (or instead of) Corrigibility with Utility Preservation
Update: I started reading your paper "Corrigibility with Utility Preservation".[1] My guess is that readers strapped for time should read {abstract, section 2, section 4} then skip to section 6. AFAICT, section 5 is just setting up the standard utility-maximization framework and defining "superintelligent" as "optimal utility maximizer".
Quick thoughts after reading less than half:
AFAICT,[2] this is a mathematical solution to corrigibility in a toy problem, and not a solution to corrigibility in real systems. Nonetheless, it's a big deal if you have in fact solved the utility-function-land version which MIRI failed to solve.[3] Looking to applicability, it may be helpful for you to spell out the ML analog to your solution (or point us to the relevant section in the paper if it exists). In my view, the hard part of the alignment problem is deeply tied up with the complexities of the {training procedure --> model} map, and a nice theoretical utility function is neither sufficient nor strictly necessary for alignment (though it could still be useful).
So looking at your claim that "the technical problem [is] mostly solved", this may or may not be true for the narrow sense (like "corrigibility as a theoretical outer-objective problem in formally-specified environments"), but seems false and misleading for the broader practical sense ("knowing how to make an AGI corrigible in real life").[4]
Less important, but I wonder if the authors of Soares et al agree with your remark in this excerpt[5]:
"In particular, [Soares et al] uses a Platonic agent model [where the physics of the universe cannot modify the agent's decision procedure] to study a design for a corrigible agent, and concludes that the design considered does not meet the desiderata, because the agent shows no incentive to preserve its shutdown behavior. Part of this conclusion is due to the use of a Platonic agent model."
Btw, your writing is admirably concrete and clear.
Errata: Subscripts seem to broken on page 9, which significantly hurts readability of the equations. Also there is a double-typo "I this paper, we the running example of a toy universe" on page 4.
Assuming the idea is correct
Do you have an account of why MIRI's supposed impossibility results (I think these exist?) are false?
I'm not necessarily accusing you of any error (if the contest is fixated on the utility function version), but it was misleading to me as someone who read your comment but not the contest details.
Portions in [brackets] are insertions/replacements by me