In his AI Safety “Success Stories” post, Wei Dai writes:
[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?
I share Wei Dai's intuition that the Research Assistant path is neglected, and I want to better understand the safety problems involved in this path.
Specifically, I'm envisioning AI research assistants, built without any kind of reinforcement learning, that help AI alignment researchers identify, understand, and solve AI alignment problems. Some concrete examples:
Possible with yesterday's technology: Document clustering that automatically organizes every blog post about AI alignment. Recommendation systems that find AI alignment posts similar to the one you're reading & identify connections between the thinking of various authors.
May be possible with current or near future technology: An AI chatbot, trained on every blog post about AI alignment, which makes the case for AI alignment to skeptics or attempts to shoot down FAI proposals. Text summarization software that compresses a long discussion between two forum users in a way that both feel is accurate and fair. A NLP system that automatically organizes AI safety writings into a problem/solution table as I described in this post.
May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs. An AI system that augments the problem/solution table from this post with new rows and columns generated based on original reasoning.
What safety problems are involved in creating research assistants of this sort? I'm especially interested in safety problems which haven't yet received much attention, and safety problems with advanced assistants based on future breakthroughs.
Sorry for the delayed response!
I'm confused about the "I'm not sure how the concept applies in other cases" part. It seems to me that 'arbitrarily capable systems that "want to affect the world" and are in an air-gapped computer' are a special case of 'agents which want to achieve a broad variety of utility functions over different states of matter'.
I'm not sure what's the interpretation of 'unintended optimization', but I think that a sufficiently broad interpretation would cover the failure modes I'm talking about here.
I agree. So the following is a pending question that I haven't addressed here yet: Would '(un)supervised learning at arbitrarily large scale' produce arbitrarily capable systems that "want to affect the world"?
I won't address this here, but I think this is a very important question that deserves a thorough examination (I plan to reply here with another comment if I'll end up writing something about it). For now I'll note that my best guess is that most AI safety researchers think that it's at least plausible (>10%) that the answer to that question is "yes".
I believe that researchers tend to model Oracles as agents that have a utility function that is defined over world states/histories (which would make less sense if they are confident that we can use supervised learning to train an arbitrarily powerful Oracle that does not 'want to affect the world'). Here's some supporting evidence for this:
Stuart Armstrong and Xavier O'Rourke wrote in their Safe Uses of AI Oracles paper:
Stuart Russell wrote in his book Human Compatible (2019):