Right now, it seems that the most likely way we're gonna get an (intellectually) universal AI is by scaling models such as GPT. That is, models trained by self-supervised learning on massive piles of data, perhaps with a similar architecture to the transformer.
I do not see any risk due to misalignment here.
One failure mode I've seen discussed is that of manipulative answers, as seen in Predict-O-Matic. Maybe those AIs will learn that manipulating users to do actions with low entropy outcomes decreases the overall prediction error?
But why should a GPT-like ever output manipulative answers? I am not denying the possibility that a GPT successor develops human level intelligence. When it learns to predict the next word, it may genuinely go through an intellectual process which was created as it was forced to compress its predictions due to the ever increasing amounts of data it had to go through.
However, nowhere in the process of constructing a valid response does there seem to be an incentive to produce responses which manipulate the environment, be it to make it easier to predict, or to make it more in-line with the AI's predictions. After all, it wasn't trained in a responsive environment as an agent, but on a static dataset. And when it is in use, it's just a frozen model, so there is obviously no utility function.
Am I wrong here? Are there any other failure modes I did not think of?
I feel like there is a failure mode in this line of thinking, that being 'confusingly pervasive consequentialism'. AI x-risk is concerned with a self-evidently dangerous object, that being superoptimizing agents. But whenever a system is proposed that is possibly intelligent without being superoptimizing, an argument is made, "well, this thing would do its job better if it was a superoptimizer, so the incentives (either internal or external to the system itself) will drive the appearance of a superoptimizer." Well, yes, if you define the incredibly dangerous thing as the only way to solve any problem, and claim that incentives will force that dangerous thing into existence even if we try to prevent it, then the conclusion flows directly from the premise. You have to permit the existence of something that is not a superoptimizer in order to solve the problem. Otherwise you are essentially defining a problem that, by definition, cannot be solved, and then waving your hands saying "There is no solution!"