We’re grateful to our advisors Nate Soares, John Wentworth, Richard Ngo, Lauro Langosco, and Amy Labenz. We're also grateful to Ajeya Cotra and Thomas Larsen for their feedback on the contests.
TLDR: AI Alignment Awards is running two contests designed to raise awareness about AI alignment research and generate new research proposals. Prior experience with AI safety is not required. Promising submissions will win prizes up to $100,000 (though note that most prizes will be between $1k-$20k; we will only award higher prizes if we receive exceptional submissions.)
You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups.)
What are the contests?
We’re currently running two contests:
Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
Shutdown Problem Contest (based on Soares et al., 2015): Given that powerful AI systems might resist attempts to turn them off, how can we make sure they are open to being shut down?
What types of submissions are you interested in?
For the Goal Misgeneralization Contest, we’re interested in submissions that do at least one of the following:
- Propose techniques for preventing or detecting goal misgeneralization
- Propose ways for researchers to identify when goal misgeneralization is likely to occur
- Identify new examples of goal misgeneralization in RL or non-RL domains. For example:
- We might train an imitation learner to imitate a "non-consequentialist" agent, but it actually ends up learning a more consequentialist policy.
- We might train an agent to be myopic (e.g., to only care about the next 10 steps), but it actually learns a policy that optimizes over a longer timeframe.
- Suggest other ways to make progress on goal misgeneralization
For the Shutdown Problem Contest, we’re interested in submissions that do at least one of the following:
- Propose ideas for solving the shutdown problem or designing corrigible AIs. These submissions should also include (a) explanations for how these ideas address core challenges raised in the corrigibility paper and (b) possible limitations and ways the idea might fail
- Define The Shutdown Problem more rigorously or more empirically
- Propose new ways of thinking about corrigibility (e.g., ways to understand corrigibility within a deep learning paradigm)
- Strengthen existing approaches to training corrigible agents (e.g., by making them more detailed, exploring new applications, or describing how they could be implemented)
- Identify new challenges that will make it difficult to design corrigible agents
- Suggest other ways to make progress on corrigibility
Why are you running these contests?
We think that corrigibility and goal misgeneralization are two of the most important problems that make AI alignment difficult. We expect that people who can reason well about these problems will be well-suited for alignment research, and we believe that progress on these subproblems would be meaningful advances for the field of AI alignment. We also think that many people could potentially contribute to these problems (we're only aware of a handful of serious attempts at engaging with these challenges). Moreover, we think that tackling these problems will offer a good way for people to "think like an alignment researcher."
We hope the contests will help us (a) find people who could become promising theoretical and empirical AI safety researchers, (b) raise awareness about corrigibility, goal misgeneralization, and other important problems relating to AI alignment, and (c) make actual progress on corrigibility and goal misgeneralization.
Who can participate?
Anyone can participate.
What if I’ve never done AI alignment research before?
You can still participate. In fact, you’re our main target audience. One of the main purposes of AI Alignment Awards is to find people who haven’t been doing alignment research but might be promising fits for alignment research. If this describes you, consider participating. If this describes someone you know, consider sending this to them.
Note that we don’t expect newcomers to come up with a full solution to either problem (please feel free to prove us wrong, though). You should feel free to participate even if your proposal has limitations.
How can I help?
You can help us by sharing this post with people who are or might be interested in alignment research (e.g., student mailing lists, FB/Slack/Discord groups) or specific individuals (e.g., your smart friend who is great at solving puzzles, learning about new topics, or writing about important research topics.)
Feel free to use the following message:
AI Alignment Awards is offering up to $100,000 to anyone who can make progress on problems in alignment research. Anyone can participate. Learn more and apply at alignmentawards.com!
Will advanced AI be beneficial or catastrophic? We think this will depend on our ability to align advanced AI with desirable goals – something researchers don’t yet know how to do.
We’re running contests to make progress on two key subproblems in alignment:
- The Goal Misgeneralization Contest (based on Langosco et al., 2021): AIs often learn unintended goals. Goal misgeneralization occurs when a reinforcement learning agent retains its capabilities out-of-distribution yet pursues the wrong goal. How can we prevent or detect goal misgeneralization?
- The Shutdown Contest (based on Soares et al., 2015): Advanced AI systems might resist attempts to turn them off. How can we design AI systems that are open to being shut down, even as they get increasingly advanced?
No prerequisites are required to participate. EDIT: The deadline has been extended to May 1, 2023.
To learn more about AI alignment, see alignmentawards.com/resources.
Outlook
We see these contests as one possible step toward making progress on corrigibility, goal misgeneralization, and AI alignment. With that in mind, we’re unsure about how useful the contest will be. The prompts are very open-ended, and the problems are challenging. At best, the contests could raise awareness about AI alignment research, identify particularly promising researchers, and help us make progress on two of the most important topics in AI alignment research. At worst, they could be distracting, confusing, and difficult for people to engage with (note that we’re offering awards to people who can define the problems more concretely.)
If you’re excited about the contest, we’d appreciate you sharing this post and the website (alignmentawards.com) to people who might be interested in participating. We’d also encourage you to comment on this post if you have ideas you’d like to see tried.
OK, Below I will provide links to few mathematically precise papers about AGI corrigibility solutions, with some comments. I do not have enough time to write short comments, so I wrote longer ones.
This list or links below is not a complete literature overview. I did a comprehensive literature search on corrigibility back in 2019 trying to find all mathematical papers of interest, but have not done so since.
I wrote some of the papers below, and have read all the rest of them. I am not linking to any papers I heard about but did not read (yet).
Math-based work on corrigibility solutions typically starts with formalizing corrigibility, or a sub-component of corrigibility, as a mathematical property we want an agent to have. It then constructs such an agent with enough detail to show that this property is indeed correctly there, or at least there during some part of the agent lifetime, or there under some boundary assumptions.
Not all of the papers below have actual mathematical proofs in them, some of them show correctness by construction. Correctness by construction is superior to having to have proofs: if you have correctness by construction, your notation will usually be much more revealing about what is really going on than if you need proofs.
Here is the list, with the bold headings describing different approaches to corrigibility.
Indifference to being switched off, or to reward function updates
Motivated Value Selection for Artificial Agents introduces Armstrong's indifference methods for creating corrigibility. It has some proofs, but does not completely work out the math of the solution to a this-is-how-to-implement-it level.
Corrigibility tried to work out the how-to-implement-it details of the paper above but famously failed to do so, and has proofs showing that it failed to do so. This paper somehow launched the myth that corrigibility is super-hard.
AGI Agent Safety by Iteratively Improving the Utility Function does work out all the how-to-implement-it details of Armstrong's indifference methods, with proofs. It also goes into the epistemology of the connection between correctness proofs in models and safety claims for real-world implementations.
Counterfactual Planning in AGI Systems introduces a different and more easy to interpret way for constructing a a corrigible agent, and agent that happens to be equivalent to agents that can be constructed with Armstrong's indifference methods. This paper has proof-by-construction type of math.
Corrigibility with Utility Preservation has a bunch of proofs about agents capable of more self-modification than those in Counterfactual Planning. As the author, I do not recommend you read this paper first, or maybe even at all. Read Counterfactual Planning first.
Safely Interruptible Agents has yet another take on, or re-interpretation of, Armstrong's indifference methods. Its title and presentation somewhat de-emphasize the fact that it is about corrigibility, by never even discussing the construction of the interruption mechanism. The paper is also less clearly about AGI-level corrigibility.
How RL Agents Behave When Their Actions Are Modified is another contribution in this space. Again this is less clearly about AGI.
Agents that stop to ask a supervisor when unsure
A completely different approach to corrigibility, based on a somewhat different definition of what it means to be corrigible, is to construct an agent that automatically stops and asks a supervisor for instructions when it encounters a situation or decision it is unsure about. Such a design would be corrigible by construction, for certain values of corrigibility. The last two papers above can be interpreted as disclosing ML designs that also applicable in the context of this stop when unsure idea.
Asymptotically unambitious artificial general intelligence is a paper that derives some probabilistic bounds on what can go wrong regardless, bounds on the case where the stop-and-ask-the-supervisor mechanism does not trigger. This paper is more clearly about the AGI case, presenting a very general definition of ML.
Anything about model-based reinforcement learning
I have yet to write a paper that emphasizes this point, but most model-based reinforcement learning algorithms produce a corrigible agent, in the sense that they approximate the ITC counterfactual planner from the counterfactual planning paper above.
Now, consider a definition of corrigibility where incompetent agents (or less inner-aligned agents, to use a term often used here) are less corrigible because they may end up damaging themselves, their stop buttons. or their operator by being incompetent. In this case, every convergence-to-optimal-policy proof for a model-based RL algorithm can be read as a proof that its agent will be increasingly corrigible under learning.
CIRL
Cooperative Inverse Reinforcement Learning and The Off-Switch Game present yet another corrigibility method with enough math to see how you might implement it. This is the method that Stuart Russell reviews in Human Compatible. CIRL has a drawback, in that the agent becomes less corrigible as it learns more, so CIRL is not generally considered to be a full AGI-level corrigibility solution, not even by the original authors of the papers. The CIRL drawback can be fixed in various ways, for example by not letting the agent learn too much. But curiously, there is very little followup work from the authors of the above papers, or from anybody else I know of, that explores this kind of thing.
Commanding the agent to be corrigible
If you have an infinitely competent superintelligence that you can give verbal commands to that it will absolutely obey, then giving it the command to turn itself into a corrigible agent will trivially produce a corrigible agent by construction.
Giving the same command to a not infinitely competent and obedient agent may give you a huge number of problems instead of course. This has sparked endless non-mathematical speculation, but in I cannot think of a mathematical paper about this that I would recommend.
AIs that are corrigible because they are not agents
Plenty of work on this. One notable analysis of extending this idea to AGI-level prediction, and considering how it might produce non-corrigibility anyway, is the work on counterfactual oracles. If you want to see a mathematically unambiguous presentation of this, with some further references, look for the section on counterfactual oracles in the Counterfactual Planning paper above.
Myopia
Myopia can also be considered to be feature that creates or improves or corrigibility. Many real-world non-AGI agents and predictive systems are myopic by construction: either myopic in time, in space, or in other ways. Again, if you want to see this type of myopia by construction in a mathematically well-defined way when applied to AGI-level ML, you can look at the Counterfactual Planning paper.