Last year Stuart Armstrong announced a contest to come up with the best questions to ask an Oracle AI. Wei Dai wrote,
Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer).
He later related his answer to Paul Christiano's posts on Human-in-the-counterfactual-loop and Elaborations on apprenticeship learning. Here I'm interested in concrete things that can be expected to go wrong in the near future if we gave GPT-N this task.
To provide a specific example, suppose we provided the prompt,
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models.
If by assumption, GPT-N is at least as good as a human expert team at generating blog posts, we could presumably expect this GPT-N to produce a very high quality post explaining how to inspect machine learning models. We would therefore have a way of to automate alignment research at a high level. But a number of important questions remain, such as,
- How large would GPT-N need to be before it started producing answers comparable to a human expert team, and
- Given the size of the model, what high-level incentives should we expect to guide the training of the model? In other words, what mesa optimization-like instantiations can we expect to result from training, exactly, and
- Is there a clear and unambiguous danger that the model would be manipulative? If so, why?
- Is the threat model more that we don't know what we don't know, or that we have a specific reason to believe the model would be manipulative in a particular direction?
The mesa-objective could be perfectly aligned with the base-objective (predicting the next token) and still have terrible unintended consequences, because the base-objective is unaligned with actual human values. A superintelligent GPT-N which simply wants to predict the next token could, for example, try to break out of the box in order to obtain more resources and use those resources to more correctly output the next token. This would have to happen during a single inference step, because GPT-N really just wants to predict the next token, but it's mesa-optimization process may conclude that world domination is the best way of doing so. Whether such system could be learned through current gradient-descent optimizers is unclear to me.