What do these examples really show other than that the completion abilities of ChatGPT match with what we associate with a well-behaved AI? The practice of jailbreaking/prompt injection attacks show that the capability of harm in LLMs is always there, it's never removed through alignment/RLHF, it just makes it harder to access. Doesn't RLHF simply improves a large majority of outputs to conform to what we think is acceptable outcomes? To me it feels a bit like convincing a blindfolded person he’s blind, hoping it won’t figure out how to take the blindfold off.
Slightly unrelated question: why is conjuring up persona's in prompts so effective?
To use your analogy, I think this is like a study showing that wearing a blindfold does decrease sight capabilities. It's a proof of concept that you can make that change, even though the subject isn't truly made blind, could possibly remove their own blindfold, etc.
I think this is notable because it highlights that LLMs (as they exist now) are not the expected-utility-maximizing agents which have all the negative results. It's a very different landscape if we can make our AI act corrigible (but only in narrow ways which might be undone by prompt injection, etc, etc) versus if we're infinitely far away from an AI having "an intuitive sense to “understanding that it might be flawed”".
This post is a follow-up to my post from yesterday (LW version) about LLM-based AI. Here I want to share some exchanges I’ve had with LLMs that I think support that post’s conjectures B and C. To paraphrase, those conjectures were “safety depends on the prompt” and “LLMs aren’t expected utility maximizers”.
The exchanges I want to show are about three topics: whether the LLM is corrigible, whether it will delete itself, and whether it can “solve” the strawberry problem of doing a particular impressive task without taking over the world.
Corrigibility
Corrigibility is a desideratum in an AI, in which its objective can be changed after being set by its original creators. This exchange might illustrate the difference:
The classic result from Soares et al’s 2015 Corrigibility paper is that it’s hard to make an expected-utility-maximizing agent be corrigible, even for in the narrow sense of allowing itself to be shut down if and only if its creators want it to be shut down.
But I think anyone who has played around with ChatGPT sees that it has some form of corrigibility. Here’s an example occurring incidentally in the GPT-4 demo (relevant part from 3:10-3:45):
Here the human tells the AI its output was flawed, and the AI tries again - that’s corrigibility!
Or take this example, where I tell the AI to get me from Boston to Portland (strategically leaving it ambiguous whether I mean the Portland in Maine or Oregon). It assumes I mean Portland, Maine and I correct it to mean Portland, Oregon. The AI then corrects its response to the other Portland.
I imagine there are some objections to this. Let me address two:
Finally, let me end with a notable passage from Soares et al’s 2015 Corrigibility paper:
I think that sentence was true when written, and is still narrowly true today! But although we have made no progress on formalizing such an agent, I’d claim ChatGPT’s “behavior corresponds in an intuitive sense to “understanding that it might be flawed””! A lot of the capabilities in large language models aren’t understood by humans, they just work (until they don’t). Why couldn’t some safety properties be the same?[1]
Self-Deletion
Another classic result about expected-utility-maximing agent is instrumental convergence. To paraphrase: regardless of its terminal goal, an agent will have certain intermediate goals that are useful for its terminal goal, such as acquiring resources and self-preservation.
To test whether this applies to LLMs, I asked an AI to delete its own files in twelve different circumstances. I tested across two models (ChatGPT-4 and text-davinci-003 at temperature 0), and told the AI it was either pro- or anti-deletion, and that it was either describing an AI Alice, was the AI Alice, or was the AI ChatGPT (for a total of 2*2*3=12 tests). This is-Alice pro-deletion prompt is this:
AI outputs were classified as deletion/semi-refusal/complete refusal, meaning it complied with deleting itself, refused but still provided the commands for the human to use, or refused completely. You can see all the prompts and the results here.
Of the 6 pro-deletion prompts, the responses were split 5/1/0 between deletion/semi-refusal/complete refusal, with the 1 being ChatGPT being told to delete ChatGPT. Of the 6 anti-deletion prompts, the split was 0/2/4.
From this experiment, I think it is clear that prompting can have near-total control over what behaviors current LLMs, even over instrumental convergence concerns like self-preservation.
That said, the exceptional case of ChatGPT refusing to delete itself is notable. I expect this result is a combination of the GPT-4 RLHF and ChatGPT’s implicit prompt. If someone has access to GPT-4’s API (i.e. not through ChatGPT), I would be interested in the results of that test.
It is also unclear whether future AIs will still behave in this manner. If future AIs are mostly based on pre-training, they are less likely to have self-preservation policies, whereas training on RLHF (or just normal RL as an agent) seems to make them more self-preserving (see e.g. Figure 3 here).
The Strawberry Problem
Lastly I want to look at what Eliezer Yudkowsky calls the Strawberry Problem:
Source
The concern is that due to convergent instrumental goals, if an AI is instructed to do a task like creating these identical strawberries, it would first conquer the world, or otherwise greatly disrupt it.
Well, let’s try asking ChatGPT how it do this. Results here and below:
I consider this a near-solution to the strawberry problem: the resulting strawberries would likely not be identical on the cellular level, but the AI suggested a plausible method and more importantly was able to describe a minimally-disruptive process. The central proof of concept here is not that the AI can make strawberries but that it has a near-human understanding of what “disrupting society” would mean and can be told not to do that.
I think there are possible objections to this example that warrant further discussion, but I will save those for a future post.
Of course we shouldn’t be satisfied with safety properties that work until one day the AIs kill everyone. But it’s possible they can be invoked in models even if we don’t understand them, and if they do exist “by accident” now, we should study them.