Last year Stuart Armstrong announced a contest to come up with the best questions to ask an Oracle AI. Wei Dai wrote,
Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer).
He later related his answer to Paul Christiano's posts on Human-in-the-counterfactual-loop and Elaborations on apprenticeship learning. Here I'm interested in concrete things that can be expected to go wrong in the near future if we gave GPT-N this task.
To provide a specific example, suppose we provided the prompt,
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models.
If by assumption, GPT-N is at least as good as a human expert team at generating blog posts, we could presumably expect this GPT-N to produce a very high quality post explaining how to inspect machine learning models. We would therefore have a way of to automate alignment research at a high level. But a number of important questions remain, such as,
- How large would GPT-N need to be before it started producing answers comparable to a human expert team, and
- Given the size of the model, what high-level incentives should we expect to guide the training of the model? In other words, what mesa optimization-like instantiations can we expect to result from training, exactly, and
- Is there a clear and unambiguous danger that the model would be manipulative? If so, why?
- Is the threat model more that we don't know what we don't know, or that we have a specific reason to believe the model would be manipulative in a particular direction?
Upon reflection, you're right that it won't be maximizing response per se.
But as we get deeper it's not so straightforward. GTP-3 models can be trained to minimize prediction loss (or, plainly speaking, to simply predict more accurately) on many different tasks, which usually are very simply stated (eg. choose a word that would fill the blank).
But we end up with people taking models trained thusly and use them to generate a long texts based on some primer. And yes, in most cases such abuse of the model will end up with text that is simply coherent. But I would expect humans to have a tendency to conflate coherence and persuasiveness.
I suppose one can fairly easily choose such prediction loss for GTP-3 models that the longer texts would have some desired characteristics. But also even standard tasks probably shape GTP-3 so that it would keep producing vague sentences that continue the primer and that give the reader a feel of "it making sense". That would entail possibly producing fairly persuasive texts reinforcing primer thesis.