Right now, it seems that the most likely way we're gonna get an (intellectually) universal AI is by scaling models such as GPT. That is, models trained by self-supervised learning on massive piles of data, perhaps with a similar architecture to the transformer.
I do not see any risk due to misalignment here.
One failure mode I've seen discussed is that of manipulative answers, as seen in Predict-O-Matic. Maybe those AIs will learn that manipulating users to do actions with low entropy outcomes decreases the overall prediction error?
But why should a GPT-like ever output manipulative answers? I am not denying the possibility that a GPT successor develops human level intelligence. When it learns to predict the next word, it may genuinely go through an intellectual process which was created as it was forced to compress its predictions due to the ever increasing amounts of data it had to go through.
However, nowhere in the process of constructing a valid response does there seem to be an incentive to produce responses which manipulate the environment, be it to make it easier to predict, or to make it more in-line with the AI's predictions. After all, it wasn't trained in a responsive environment as an agent, but on a static dataset. And when it is in use, it's just a frozen model, so there is obviously no utility function.
Am I wrong here? Are there any other failure modes I did not think of?
One speculative way I see it, that I've yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of "Pierre Menard, author of Don Quixote" tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A, when seeing string S, just because "being" that agent performed well in similar training strings. This agent, if complex and capable enough (which may need to be, if that was what it needed to predict previous data), this agent himself could, maybe through the placement of careful answer tokens that would guarantee its stability, would be a dangerous and possibly malicious agent.
And, of course, sequence modeling as a paradigm can also be used for RL training.
I believe I understand your point, but there are two things that I need to clarify, that kind of bypasses some of these criticism:
a) I am not assuming any safety technique applied to language models. In a sense, this is the worst-case scenario, one thing that may happen if the language model is run "as-it-is". In particular, the scenario I described would be mitigated if we could possibly stop the existence of stable sub-agents appearing in language models, although how to do this I do not know.
b) The incentives for the language models to be a superoptimiz... (read more)