In his AI Safety “Success Stories” post, Wei Dai writes:
[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?
I share Wei Dai's intuition that the Research Assistant path is neglected, and I want to better understand the safety problems involved in this path.
Specifically, I'm envisioning AI research assistants, built without any kind of reinforcement learning, that help AI alignment researchers identify, understand, and solve AI alignment problems. Some concrete examples:
Possible with yesterday's technology: Document clustering that automatically organizes every blog post about AI alignment. Recommendation systems that find AI alignment posts similar to the one you're reading & identify connections between the thinking of various authors.
May be possible with current or near future technology: An AI chatbot, trained on every blog post about AI alignment, which makes the case for AI alignment to skeptics or attempts to shoot down FAI proposals. Text summarization software that compresses a long discussion between two forum users in a way that both feel is accurate and fair. A NLP system that automatically organizes AI safety writings into a problem/solution table as I described in this post.
May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs. An AI system that augments the problem/solution table from this post with new rows and columns generated based on original reasoning.
What safety problems are involved in creating research assistants of this sort? I'm especially interested in safety problems which haven't yet received much attention, and safety problems with advanced assistants based on future breakthroughs.
It seems like the main problem is making sure nobody's getting systematically misled. To help humans make the right updates, the AI has to communicate not only accurate results, but well-calibrated uncertainties. It also has to interact with humans in a way that doesn't send the wrong signals (more a problem to do with humans than to do with AI).
This is very much on the near-term side of the near/long term AI safety work dichotomy. We don't need the AI to understand deception as a category, and why it's bad, so that it can make plans that don't involve deceiving us. We just need its training / search process (which we expect to more or less understand) to suppress incentives for deception to an acceptable range, on a limited domain of everyday problems.
(I'm probably a bigger believer in the significance of this dichotomy than most. I think looking at an AI's behavior and then tinkering with the training procedure to eliminate undesired behavior in the training domain is a perfectly good approach to handing near-term misalignment like overconfident advisor-chatbots, but eventually we want to switch over to a more scalable approach that will use few of the same tools.)
I agree well-calibrated uncertainties are quite valuable, but I'm not convinced they are essential for this sort of application. For example, if my assistant tells me a story about how my proposed FAI could fail, if my assistant is overconfident in its pessimism, then the worst case is that I spend a lot of time thinking about the failure mode without seeing how it could happen (not that bad). If my assistant is underconfident, and tells me a failure mode is 5% likely when it's really 95% likely, it still feels like my assistant is being overall helpful