That makes a lot of sense, thanks for the link. It is not as dangerous of a situation as a true agent AGI as this failure mode involves a (relatively stupid) user error. I trust researchers not to make that mistake, but it seems like there is no way to safely make those systems available to the public.
A way to make this more plausible I thought of after reading this is that of accidentally making it think it's hostile. Perhaps you make a joking remark about paperclip maximizers, or maybe it just so happens that the chat history is similar to the premise of...
One speculative way I see it, that I've yet to expand on, is that GPT-N, to minimize prediction error in training, could simulate some sort of entity enacting some reasoning, to minimize the prediction error in non-trivial settings. In a sense, GPT would be a sort of actor interpreting a play through extreme method acting. I have in mind something like what the protagonist of "Pierre Menard, author of Don Quixote" tries to do to replicate the book Don Quixote word by word.
This would mean that, for some set of strings, GPT-N would boot and run some agent A, when seeing string S, just because "being" that agent performed well in similar training strings. This agent, if complex and capable enough (which may need to be, if that was what it needed to predict previous data), this agent himself could, maybe through the placement of careful answer tokens that would guarantee its stability, would be a dangerous and possibly malicious agent.
And, of course, sequence modeling as a paradigm can also be used for RL training.
I feel like there is a failure mode in this line of thinking, that being 'confusingly pervasive consequentialism'. AI x-risk is concerned with a self-evidently dangerous object, that being superoptimizing agents. But whenever a system is proposed that is possibly intelligent without being superoptimizing, an argument is made, "well, this thing would do its job better if it was a superoptimizer, so the incentives (either internal or external to the system itself) will drive the appearance of a superoptimizer." Well, yes, if you define the incredibly dangero...
I posted a somewhat similar response to MSRayne, with the exception that what you accidentally summon is not an agent with a utility function, but something that tries to appear like one and nevertheless tricks you into making some big mistake.
Here, what you get is a genuine agent which works across prompts by having some internal value function which outputs a different value after each prompt, and acts accordingly, if I understand correctly. It doesn't seem incredibly unlikely, as there is nothing in the process of evolution that necessarily has to make ...
One of the holy grails of AI has been "common sense knowledge" - the kind of comprehensive general knowledge about the concrete everyday world, that humans begin to acquire when just a few years old, and which we then keep refining throughout our lives. Before the large language models, the only halfway successful approach to this was Cyc, and they dealt with the problem by simply spoonfeeding their AI with tens of thousands of everyday propositions, laboriously added to its knowledge base by hand.
But as we have discovered, large language models, designed simply to learn and imitate patterns in very large collections of Internet text, can do a surprisingly good job of talking as if they are a person with a typical person's knowledge. There seems to be very little understanding of how they do this. But let's postulate that what they develop, are "chatbot schemas", conversational agents which roughly mimic the internal changes of state in a thinking and communicating human being, along with fragments of knowledge that can be drawn upon by the "chatbots".
A language model, then, is a kind of mirror held to the corpus of human writings, a mirror of sufficient fineness that it reveals some of the cognitive and conceptual structure implicit within those writings. But also an enchanted mirror that we can talk to, that summons persons and places that never existed, but which are fashioned according to the logic it has discerned in our productions.
Left to itself, the language model is passive, and random when it responds. But having discovered a learning process sufficiently deep and general that it cheaply produces imitations of agents with common-sense knowledge, the human race is now trying to harness that power, refine it, make it more predictable, turn it into part of a true AGI. In my opinion, that's where the sharpest dangers lie: not that a coherently malevolent agent will spontaneously crystallize inside a straightforward language model, but that a language model, reshaped and trained to be a dutiful part of a larger cognitive architecture, will also be part of what pushes that larger "mind" beyond human understanding or control.
Right now, it seems that the most likely way we're gonna get an (intellectually) universal AI is by scaling models such as GPT. That is, models trained by self-supervised learning on massive piles of data, perhaps with a similar architecture to the transformer.
I do not see any risk due to misalignment here.
One failure mode I've seen discussed is that of manipulative answers, as seen in Predict-O-Matic. Maybe those AIs will learn that manipulating users to do actions with low entropy outcomes decreases the overall prediction error?
But why should a GPT-like ever output manipulative answers? I am not denying the possibility that a GPT successor develops human level intelligence. When it learns to predict the next word, it may genuinely go through an intellectual process which was created as it was forced to compress its predictions due to the ever increasing amounts of data it had to go through.
However, nowhere in the process of constructing a valid response does there seem to be an incentive to produce responses which manipulate the environment, be it to make it easier to predict, or to make it more in-line with the AI's predictions. After all, it wasn't trained in a responsive environment as an agent, but on a static dataset. And when it is in use, it's just a frozen model, so there is obviously no utility function.
Am I wrong here? Are there any other failure modes I did not think of?