I just recalled I've read ACX: Janus' Simulators which outlines a 5th missing interpretation: Neither current nor future LLMs will develop goals, but will become dangerous nevertheless.
If future superintelligences look like GPT, is there anything to worry about?
Answer 1: Irrelevant, future superintelligences will be too different from GPT for this to matter.
Answer 2: There’s nothing to worry about with pure GPT (a simulator), but there is something to worry about with GPT+RLHF (a simulator successfully simulating an agent). The inner agent can have misaligned goals and be dangerous. For example, if you train a future superintelligence to simulate Darth Vader, you’ll probably get what you deserve. Even if you avoid such obvious failure modes, the inner agent can be misaligned for all the usual agent reasons. For example, an agent trained to be Helpful might want to take over the world in order to help people more effectively, including people who don’t want to be helped.
Answer 3: Even if you don’t ask it to simulate an agent, it might come up with agents anyway. For example, if you ask it “What is the best way to obtain paperclips?”, and it takes “best way” literally, it would have to simulate a paperclip maximizer to answer that question. Can the paperclip maximizer do mischief from inside GPT’s simulation of it? Probably the sort of people who come up with extreme AI risk scenarios think yes. This post gives the example of it answering with “The best way to get paperclips is to run this code” (which will turn the AI into a paperclip maximizer). If the user is very dumb, they might agree.
I'm trying to understand how the classical case for AI safety (Bostrom/Yudkowsky argument) best works for LLMs.
My impression is that LLMs have passed an important threshold, they have learnt to learn (for instance, GPT 3 understands translation and can learn a new language), however we don't see a rapid self-improvement & power-seeking. I can come up with four interpretations of what that means:
My questions
The intuition that slight misalignment doesn't lead to extinction has been the most emphasized counter-argument by Yann Lecun. Ajeya Cotra frames this as an open question: