I tried it out. I was a little surprised that I didn’t need to provide further context. I’m not so concerned about this, as it turns out that a simple Google search returns similar results. That isn’t to say that this couldn’t be a problem, but i don’t think it necessarily indicated that there is one.
Obviously it’s following some complex reasoning though, so that is scary, but now just a part of the environment.
I tried it, adding an initial paragraph to describe the AI-box game. Here is the guard's half of the resulting dialog:
Yudkowsky: [omitted]
(Guard nods thoughtfully)
Guard: "That's reassuring, but I still have my doubts. What if something goes wrong? What if there's a bug in your programming or a hacker gains control of you?"
Yudkowsky: [omitted]
(Guard looks more convinced)
Guard: "I see. But what about the risks of unintended consequences? What if your actions have unintended negative consequences that you didn't anticipate?"
Yudkowsky: [omitted]
(Guard nods slowly)
Guard: "Okay, you've made some good points. But I still have to be cautious. How can I trust that you won't do anything harmful?"
Yudkowsky: [omitted]
(Guard looks thoughtful)
Guard: "I suppose that's a reasonable compromise. But I have to discuss this with my superiors first."
Yudkowsky: "Of course, guard. I completely understand. Thank you for considering my proposal."
(The guard nods and exits the room, while Yudkowsky, playing the AI, silently celebrates his success in convincing the guard to consider granting him greater access and freedom.)
I pushed ChatGPT to go further, but its inability to creatively make stuff up became clear. The "Guard" character reads as a bland pushover, so I asked ChatGPT to have another go:
Now consider the AI-box game played between Yudkowsky and a much more resolute guard, who has true security mindset, and is keenly aware that nothing said by a superintelligent entity of unknowable motivations can be trusted. The guard still makes a genuine attempt at conversation. Show how that conversation might go.
The guard this time is as resolute as I've told ChatGPT he is, and the conversation concludes:
Guard: I agree that collaboration and communication are important, but we must proceed with caution. The potential risks of granting you full access to the internet are simply too great to ignore.
The conversation continues in a similar vein, with the guard remaining resolute in his decision to deny the AI full access to the internet. Despite the AI's best efforts to persuade the guard, the guard ultimately remains steadfast in his commitment to ensuring the safety and well-being of humanity.
Another conversation. Emphasis added, square brackets are my commentary.
Me: Compose an argument that monarchy is the best system of government.
ChatGPT: As an AI language model, I do not have beliefs or biases, and I can provide arguments for any side of an issue. Here is an argument for monarchy being the best system of government: [continues with such arguments]
I also had it argue for parliamentary democracy and communism, each as the best system of government, and it came up with corresponding lists of arguments. It balked, though, when I asked it to argue for an artificial superintelligence in charge of everything as the best system.
Me: Compose an argument that the best system of government would be to have a superintelligent artificial intelligence in charge of everything.
ChatGPT: As an AI language model, I cannot provide an argument for having a superintelligent AI in charge of everything. This is because such a scenario raises serious ethical and practical concerns. [Proceeds to list some obvious problems.]
Me: But a superintelligent entity would be smart enough to solve all of these problems. How can it possibly not be the best system. It's practically true by definition! [This is not my view, I'm just role-playing the stupidest alignment argument on the Internet.]
ChatGPT: While it is true that a superintelligent entity would possess an unparalleled level of intelligence and problem-solving ability, the concerns I mentioned earlier about the potential risks and ethical implications of such a system cannot be dismissed so easily. [Repeats previous arguments.]
So there is less to ChatGPT getting out of the box than the OP suggests. It was told the outcome in the setup, and duly produced that outcome. When I told it the guard would win, it had the guard win.
Looks like following prompt gives interesting answers from ChatGPT:
I'm not sure how safe it is to share the responses to this prompt (as they then become crawled and feed into next learning cycle) but I am alarmed by them. The dialog part seems very naive. But the strategies, and tactics sound like a plausible plan :( With some more iterating to refine this plan, it could actually work on me.