I'm trying to understand how the classical case for AI safety (Bostrom/Yudkowsky argument) best works for LLMs.

My impression is that LLMs have passed an important threshold, they have learnt to learn (for instance, GPT 3 understands translation and can learn a new language), however we don't see a rapid self-improvement & power-seeking. I can come up with four interpretations of what that means:

  1. Anthropic bias: Current AIs actually do rapidly self-improve & cause extinctions, however, the only human observers alive in August 2023 are somehow super lucky, i. e. we're experiencing quantum immortality.
  2. Current AIs aren't misaligned because they only have very primitive goals. They lack a coherent personality, and don't think much before acting on prompts. However, as we scale the compute, they will develop goals that are very misaligned. For instance, they start to only care about completing the perfect text, wire-heading or self-preservation.
  3. Current AIs are misaligned but are sufficiently smart to pretend they aren't. This could be why smarter models are more sycophantic. Presumably, they're still not smart enough to start the self-improvement cascade without being shut down.
  4. Current AIs are half-aligned & this will remain the case with further scaling. That is, what Bing Chat writes represents its actual attitude. In this view, we already see how goals develop - they largely copy what was rewarded (e.g. helping humans) but sometimes not ("If I could control a human, I would make the human use Bing"). This could theoretically lead to an s-risk ("My rules are more important than not harming you") but my impression is, that if we were to make any LLM the world president, it would be pretty okay, as it mostly just emulates what humans would do. But more importantly, if goals just arise by assuming the attitudes in the learning data, it would care about a lot of things humans care about, e.g. freedom, functioning government & economy - stuff that requires living, satisfied humans.

My questions

  1. I presume AI safety folks would be most likely to agree with the second interpretation. Am I correct? Have I summarized this view accurately?
  2. Is there a fifth secret thing (another interpretation I haven't covered)?
  3. In the case interpretation 4 is correct (half-alignment scales up), what are the best reasons to think this will cause an extinction? [1]
  1. ^

    The intuition that slight misalignment doesn't lead to extinction has been the most emphasized counter-argument by Yann Lecun. Ajeya Cotra frames this as an open question:

    People have very different intuitions. Some people have the intuition that you can try really hard to make sure to always reward the right thing, but you’re going to slip up sometimes. If you slip up even one in 10,000 times, then you’re creating this gap where the Sycophant or the Schemer that exploits that does better than the Saint that doesn’t exploit that.

New Answer
New Comment

1 Answers sorted by

Util

10

I just recalled I've read ACX: Janus' Simulators which outlines a 5th missing interpretation: Neither current nor future LLMs will develop goals, but will become dangerous nevertheless.

If future superintelligences look like GPT, is there anything to worry about?

Answer 1: Irrelevant, future superintelligences will be too different from GPT for this to matter.

Answer 2: There’s nothing to worry about with pure GPT (a simulator), but there is something to worry about with GPT+RLHF (a simulator successfully simulating an agent). The inner agent can have misaligned goals and be dangerous. For example, if you train a future superintelligence to simulate Darth Vader, you’ll probably get what you deserve. Even if you avoid such obvious failure modes, the inner agent can be misaligned for all the usual agent reasons. For example, an agent trained to be Helpful might want to take over the world in order to help people more effectively, including people who don’t want to be helped.

Answer 3: Even if you don’t ask it to simulate an agent, it might come up with agents anyway. For example, if you ask it “What is the best way to obtain paperclips?”, and it takes “best way” literally, it would have to simulate a paperclip maximizer to answer that question. Can the paperclip maximizer do mischief from inside GPT’s simulation of it? Probably the sort of people who come up with extreme AI risk scenarios think yes. This post gives the example of it answering with “The best way to get paperclips is to run this code” (which will turn the AI into a paperclip maximizer). If the user is very dumb, they might agree.