Last year Stuart Armstrong announced a contest to come up with the best questions to ask an Oracle AI. Wei Dai wrote,
Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer).
He later related his answer to Paul Christiano's posts on Human-in-the-counterfactual-loop and Elaborations on apprenticeship learning. Here I'm interested in concrete things that can be expected to go wrong in the near future if we gave GPT-N this task.
To provide a specific example, suppose we provided the prompt,
This is the first post in an Alignment Forum sequence explaining the approaches both MIRI and OpenAI staff believe are the most promising means of auditing the cognition of very complex machine learning models.
If by assumption, GPT-N is at least as good as a human expert team at generating blog posts, we could presumably expect this GPT-N to produce a very high quality post explaining how to inspect machine learning models. We would therefore have a way of to automate alignment research at a high level. But a number of important questions remain, such as,
- How large would GPT-N need to be before it started producing answers comparable to a human expert team, and
- Given the size of the model, what high-level incentives should we expect to guide the training of the model? In other words, what mesa optimization-like instantiations can we expect to result from training, exactly, and
- Is there a clear and unambiguous danger that the model would be manipulative? If so, why?
- Is the threat model more that we don't know what we don't know, or that we have a specific reason to believe the model would be manipulative in a particular direction?
As far as I understand GPT-N it's not very agent-like (it doesn't perform me vs environment abstraction and doesn't look for ways to transform its perceived environment to satisfy some utility function). I wouldn't expect it to "scheme" against people since it lacks any concept of "affecting its environment".
However it seems likely that GTP-N can perfect the skill of crowd-pleasing (we already see that; we're constantly amazed by it, despite little meaning of created texts). It can precisely modulate it's tone and identify the talking points that get the most response.
So I expect the GTP-N generated texts to sound really persuasive, not because of novel ideas but because of superhuman ability to compose heard ideas into persuasive essay.
I would expect GTP-N to focus on presenting solutions for alignement (therefore making us overly optimistic about naive approaches), presenting novel risks (it's easy to make something up by simple rehashing) and possibly venturing in philosophical muddling the water (humans prove to be very easily engaged by certain topics, like self-consciousness)
Upon reflection, you're right that it won't be maximizing response per se.
But as we get deeper it's not so straightforward. GTP-3 models can be trained to minimize prediction loss (or, plainly speaking, to simply predict more accurately) on many different tasks, which usually are very simply stated (eg. choose a word that would fill the blank).
But we end up with people taking models trained thusly and use them to generate a long texts based on some primer. And yes, in most cases such abuse of the model will end up with text that is simply co... (read more)