In his AI Safety “Success Stories” post, Wei Dai writes:
[This] comparison table makes Research Assistant seem a particularly attractive scenario to aim for, as a stepping stone to a more definitive success story. Is this conclusion actually justified?
I share Wei Dai's intuition that the Research Assistant path is neglected, and I want to better understand the safety problems involved in this path.
Specifically, I'm envisioning AI research assistants, built without any kind of reinforcement learning, that help AI alignment researchers identify, understand, and solve AI alignment problems. Some concrete examples:
Possible with yesterday's technology: Document clustering that automatically organizes every blog post about AI alignment. Recommendation systems that find AI alignment posts similar to the one you're reading & identify connections between the thinking of various authors.
May be possible with current or near future technology: An AI chatbot, trained on every blog post about AI alignment, which makes the case for AI alignment to skeptics or attempts to shoot down FAI proposals. Text summarization software that compresses a long discussion between two forum users in a way that both feel is accurate and fair. A NLP system that automatically organizes AI safety writings into a problem/solution table as I described in this post.
May be possible with future breakthroughs in unsupervised learning, generative modeling, natural language understanding, etc.: An AI system that generates novel FAI proposals, or writes code for an FAI directly, and tries to break its own designs. An AI system that augments the problem/solution table from this post with new rows and columns generated based on original reasoning.
What safety problems are involved in creating research assistants of this sort? I'm especially interested in safety problems which haven't yet received much attention, and safety problems with advanced assistants based on future breakthroughs.
I suspect that the concept of utility functions that are specified over your actions is fuzzy in a problematic way. Does it refer to utility functions that are defined over the physical representation of the computer (e.g. the configuration of atoms in certain RAM memory cells that their value represents the selected action)? If so, we're talking about systems that 'want to affect (some part of) the world', and thus we should expect such systems to have convergent instrumental goals with respect to our world (e.g. taking control over as much resources in our world as possible).
It seems possible that something like this has happened. Though as far as I know, we don't currently know how to model contemporary supervise learning at an arbitrarily large scale in complicated domains.
How do you model the behavior of the model on examples outside the training set? If your answer contains the phrase "training distribution" then how do you define the training distribution? What makes the training distribution you have in mind special relative to all the other training distributions that could have produced the particular training set that you trained your model on?
Therefore, I'm sympathetic to the following perspective, from Armstrong and O'Rourke (2018) (the last sentence was also quoted in the grandparent):