Summary: I’m going to give a $10k prize to the best evidence that my preferred approach to AI safety is doomed. Submit by commenting on this post with a link by April 20.
I have a particular vision for how AI might be aligned with human interests, reflected in posts at ai-alignment.com and centered on iterated amplification.
This vision has a huge number of possible problems and missing pieces; it’s not clear whether these can be resolved. Many people endorse this or a similar vision as their current favored approach to alignment, so It would be extremely valuable to learn about dealbreakers as early as possible (whether to adjust the vision or abandon it).
Here’s the plan:
- If you want to explain why this approach is doomed, explore a reason it may be doomed, or argue that it’s doomed, I strongly encourage you to do that.
- Post a link to any relevant research/argument/evidence (a paper, blog post, repo, whatever) in the comments on this post.
- The contest closes April 20.
- You can submit content that was published before this prize was announced.
- I’ll use some process to pick my favorite 1-3 contributions. This might involve delegating to other people or might involve me just picking. I make no promise that my decisions will be defensible.
- I’ll distribute (at least) $10k amongst my favorite contributions.
If you think that some other use of this money or some other kind of research would be better for AI alignment, I encourage you to apply for funding to do that (or just to say so in the comments).
This prize is orthogonal and unrelated to the broader AI alignment prize. (Reminder: the next round closes March 31. Feel free to submit something to both.)
This contest is not intended to be “fair”---the ideas I’m interested in have not been articulated clearly, so even if they are totally wrong-headed it may not be easy to explain why. The point of the exercise is not to prove that my approach is promising because no one can prove it’s doomed. The point is just to have a slightly better understanding of the challenges.
Edited top add the results:
- $5k for this post by Wei_Dai, and the preceding/following discussion, some points about the difficulty of learning corrigibility in small pieces.
- $3k for Point 1 from this comment by eric_langlois, an intuition pump for why security amplification is likely to be more difficult than you might think.
- $2k for this post by William_S, which clearly explains a consideration / design constraint that would make people less optimistic about my scheme. (This fits under "summarizing/clarifying" rather than novel observation.)
Thanks to everyone who submitted a criticism! Overall I found this process useful for clarifying my own thinking (and highlighting places where I could make it easier to engage with my research by communicating more clearly).
Background on what I’m looking for
I’m most excited about particularly thorough criticism that either makes tight arguments or “plays both sides”---points out a problem, explores plausible responses to the problem, and shows that natural attempts to fix the problem systematically fail.
If I thought I had a solution to the alignment problem I’d be interested in highlighting any possible problem with my proposal. But that’s not the situation yet; I’m trying to explore an approach to alignment and I’m looking for arguments that this approach will run into insuperable obstacles. I'm already aware that there are plenty of possible problems. So a convincing argument is trying to establish a universal quantifier over potential solutions to a possible problem.
On the other hand, I’m hoping that we'll solve alignment in a way that knowably works under extremely pessimistic assumptions, so I’m fine with arguments that make weird assumptions or consider weird situations / adversaries.
Examples of interlocking obstacles I think might totally kill my approach:
- Amplification may be doomed because there are important parts of cognition that are too big to safely learn from a human, yet can’t be safely decomposed. (Relatedly, security amplification might be impossible.)
- A clearer inspection of what amplification needs to do (e.g. building a competitive model of the world in which an amplified human can detect incorrigible behavior) may show that amplification isn’t getting around the fundamental problems that MIRI is interested in and will only work if we develop a much deeper understanding of effective cognition.
- There may be kinds of errors (or malign optimization) that are amplified by amplification and can’t be easily controlled (or this concern might be predictably hard to address in advance by theory+experiment).
- Corrigibility may be incoherent, or may not actually be easy enough to learn, or may not confer the kind of robustness to prediction errors that I’m counting on, or may not be preserved by amplification.
- Satisfying safety properties in the worst case (like corrigibility) may be impossible. See this post for my current thoughts on plausible techniques. (I’m happy to provisionally grant that optimization daemons would be catastrophic if you couldn’t train robust models.)
- Informed oversight might be impossible even if amplification works quite well. (This is most likely to be impossible in the context of determining what behavior is catastrophic.)
I value objections but probably won't have time to engage significantly with most of them. That said: (a) I’ll be able to engage in a limited way, and will engage with objections that significantly shift my view, (b) thorough objections can produce a lot of value even if no proponent publicly engages with them, since they can be convincing on their own, (c) in the medium term I’m optimistic about starting a broader discussion about iterated amplification which involves proponents other than me.
I think our long-term goal should be to find, for each powerful AI technique, an analog of that technique that is aligned and works nearly as well. My current work is trying to find analogs of model-free RL or AlphaZero-style model-based RL. I think that these are the most likely forms for powerful AI systems in the short term, that they are particularly hard cases for alignment, and that they are likely to turn up alignment techniques that are very generally applicable. So for now I’m not trying to be competitive with other kinds of AI systems.
I think (based on reading Paul's blog posts) that knowledge isolation provides these benefits:
The distribution of training and test examples for the distilled agent are as similar as possible (possibly identical, or possibly close enough that you can ask for new training data when you find something too far out of distribution). Suppose we allow for unlimited knowledge sharing. The training data gathered from humans will only include examples of humans processing some limited amount of information, and that information will have been produced in a fairly normal set of circumstances that occur during training. But as the IDA procedure continues, later agents will have to deal with much larger amounts of data generated during potentially weirder circumstances. So distributional shift will become more of a problem.
Security amplification. In Universality and Security Amplification, Paul describes limiting the amount of information accessible to each agent to the extent that we can perform red-teaming over the set of possible inputs an agent will have to process, and so have confidence that agents won't be attacked by any input they receive. He acknowledges that this will limit capabilities (so the system wouldn't be able to, say, translate between languages as well as a human translator). But he claims that 1) the system doesn't necessarily need to perform all tasks itself, instead it can just learn how to safely use external humans or system and 2) even the information limited set of queries the system can answer will still be able to include a "simple core of reasoning" sufficient for this task. (I'm still trying to wrap my head around whether I think this kind of system will be able to have sufficient capabilities.)