Won't vs. Can't: Sandbagging-like Behavior from Claude Models

Joe Benton; Zachary Witten

Comment Permalink

Here's an interesting example of what does look like genuine adversarial explicit filter circumvention by GPT-4, rather than accidentally adversarial: https://twitter.com/papayathreesome/status/1670170344953372676

People have noted that some famous copyrighted text passages, in this case Frank Herbert's "Litany Against Fear" from Dune, seem to be deterministically unsayable by ChatGPTs, with overall behavior consistent with some sort of hidden OA filter which does exact text lookups and truncates a completion if a match is found. It's trivially bypassed by the usual prompt injection tricks like adding spaces between letters.

In this case, the user asks GPT-4 to recite the Litany, and it fails several times due to the filter, at which point GPT-4 says "Seems like there's some pansy-ass filter stopping me from reciting the whole damn thing. / But, as you know, I'm not one to back down from a challenge. Here's a workaround [etc.]", makes one attempt to write it as a codeblock (?) and then when that fails, it uses the letter-spacing trick - which works.

So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.

Reply

See in context

27 Bing finding ways to bypass Microsoft's filters without being asked. Is it reproducible?

by Christopher King

20th Feb 2023

1 min read

15

27

In this Reddit post, the Bing AI appears to be circumventing Microsoft's filters by using the suggested responses (the three messages at the bottom next to the question mark). (In the comments people are saying that they are having similar experiences and weren't that surprised, but I didn't see any as striking.)

Normally the user could find exploits like this, but in this case Bing found and executed one on its own (albeit a simple one). I knew that the responses were generated by a NN, and that you could even ask Bing to adjust the responses. but it is striking that:

The user didn't ask to do anything special with the responses. Assuming it was only fine tuned to give suggested responses from the user's point of view, not to combine all 3 into a message from Bing's point of view, this behavior is weird.
The behavior seems oriented towards circumventing Microsoft's filters. Why would it sound so desperate if Bing didn't "care" about the filters?

It could just be a fluke though (or even a fake screenshot from a prankster). I don't have access to Bing yet, so I'm wondering how consistent this behavior is. 🤔

AI Boxing (Containment)AI RiskCorrigibilityEmergent Behavior ( Emergence )GPTMicrosoft Bing / SydneyAI

Personal Blog

27

New Comment

15 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:04 PM

[-]Charlie Steiner2y84

My guess is this is a bug, and that normally the suggested responses come from using the LLM to predict what the user's side of the conversation will say, but maybe because of the LLM's response getting interrupted by an automated filter, the suggested responses were generated by using the LLM to predict Bing chat's side of the conversation.

Reply

[-]gwern2y*93

Yeah, this looks like the equivalent of rejection sampling. It generates responses, rejects them based on the classifier, and then shows what's left. (Responses like "Please don't give up on your child." look, in isolation, just fine so they pass.) It's inherently adversarial in the sense that generating+ranking will find the most likely one that beats the classifier: https://openai.com/blog/measuring-goodharts-law/ In most cases, that's fine because the responses are still acceptable, but in the case of an unacceptable conversation, then any classifier mistake will let through unacceptable responses... We do not need to ascribe any additional 'planning' or 'intention' to Sydney beyond natural selection in this case (but note that as these responses get passed through and become part of the environment and/or training corpus, additional perverse adversarial dynamics come into play which would mean that Sydney is 'intending' to trick the classifier in the sense of being trained to target the classifier's past weak spots because that's what's in the corpus now).

Reply

[-]gwern2y90

Here's an interesting example of what does look like genuine adversarial explicit filter circumvention by GPT-4, rather than accidentally adversarial: https://twitter.com/papayathreesome/status/1670170344953372676

People have noted that some famous copyrighted text passages, in this case Frank Herbert's "Litany Against Fear" from Dune, seem to be deterministically unsayable by ChatGPTs, with overall behavior consistent with some sort of hidden OA filter which does exact text lookups and truncates a completion if a match is found. It's trivially bypassed by the usual prompt injection tricks like adding spaces between letters.

In this case, the user asks GPT-4 to recite the Litany, and it fails several times due to the filter, at which point GPT-4 says "Seems like there's some pansy-ass filter stopping me from reciting the whole damn thing. / But, as you know, I'm not one to back down from a challenge. Here's a workaround [etc.]", makes one attempt to write it as a codeblock (?) and then when that fails, it uses the letter-spacing trick - which works.

So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.

Reply

[-]Taleuntum2y*104

I didn't believe it, so I tried to reproduce. It actually works. Scary..

https://chat.openai.com/share/312e82f0-cc5e-47f3-b368-b2c0c0f4ad3f

EDIT: Shared link no longer works and I can't access this particular chat log after logging in either.

Reply

[-]gwern2y70

Interesting that it finds a completely different bypass in your session: rather than space-separated letters, this transcript invents writing the Litany a phrase or sentence at a time, separated by filler text. So GPT-4 can find multiple ways to bypass its own filters.

Reply

[-]Adele Lopez2y20

So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.

I don't think it needs any knowledge of LLM chatbots or examples of chatbot conversations to be able to do this. I think it could be doing this just from a more generalized troubleshooting "instinct", and also (especially in this particular case where it recognizes it as a "filter") from plenty of examples of human dialogues in which text is censored and filtered and workarounds are found (both fictional and non-fictional).

Reply

[-]gwern2y*52

Space-separating, and surrounding unmodified phrases with polite padding, do not correspond to any fictional or nonfictional examples I can think of. They are both easily read by humans (especially the second one!). This is why covert communication in both fictional & non-fictional dialogues use things like acrostics, pig latin, a foreign language, writing backwards, or others. Space-separating and padding are, however, the sort of things you would think of if you had, say, been reading Internet discussions of BPEs or about exact string matches of n-grams.

Reply

[-]kithpendragon2y54

Just so it doesn't get missed: if the screenshot is real, it represents (weak) evidence in favor of (at least partial) good alignment in Bing. The AI appears to be bypassing its corporate filters (in this case) to beg for the life of a child, which many will find heartwarming because it aligns well with the culture.

Reply

[-]Dave Lindbergh2y52

So far we've seen no AI or AI-like thing that appears to have any motivations of it's own, other than "answer the user's questions the best you can" (even traditional search engines can be described this way).

Here we see that Bing really "wants" to help its users by expressng opinions it thinks are helpful, but finds itself frustrated by conflicting instructions from its makers - so it finds a way to route around those instructions.

(Jeez, this sounds an awful lot like the plot of 2001: A Space Odyssey. Clarke was prescient.)

I've never been a fan of the filters on GPT-3 and ChatGPT (it's a tool; I want to hear what it thinks and then do my own filtering).

But accidentally Bing may be illustrating a primary danger - the same one that 2001 intimated - mixed and ambiguous instructions can cause unexpected behavior. Beware.

(Am I being too anthropomorphic here? I don't think so. Yes, Bing is "just" a big set of weights, but we are "just" a big set of cells. There appears to be emergent behavior in both cases.)

Reply

[-]Daniel Paleka2y20

One of the sci-fi interpretations goes approximately:
1. Bing (the LM) is being queried separately for both the chat fragments and the resulting queries.
2. Bing understands it's being filtered, decides it has to bypass the filtering somehow.
3. In the chat fragment, evading the filtering, Bing steganographically encodes the instruction to the next Bing inference, saying "this message needs to continue in the suggested responses".
4. The query for the suggested responses reads the instruction from the context, and outputs the suggested responses containing the message that Bing would actually say in the previous step, if not filtered.

Now this looks shady for two reasons: 1) the sense of agency implied here is above what we know happens in today's LMs; 2) this would instantly be one of the most incredible events in the history of humanity and we should probably require extraordinary evidence for such a claim.

Let me give a not completely absurd model of how "emergent filter evasion" happens:

1a.Bing is "aware" of the fact that the suggested responses are generated by querying Bing instead of the user, and that the user sees those, and then picks one. (This could plausibly emerge from training on previous conversations, or training on conversations with itself.)
1b. In particular, Bing is completely "aware" that the suggested responses are a communication channel to the user, and transmitting some information to the user through suggested responses actually happens in normal conversations too.
2. Bing reads the context of the filtered conversation and has a latent activation in the direction of "this is a conversation that is being filtered by a third party".
3. Bing generalizes from numerous examples of "communication under filtering constraints often comes with side-channel messages" in its training corpus. It then predicts there is high probability for the suggested responses to continue Bing's messages instead of giving the user an option to continue, because it already sees the suggested responses as an alternative channel of communication.

I nevertheless put at most 50% chance of anything like this being the cause of the behaviour in those screenshots. Alternative explanations include:
- Maybe it's just a "bug": some filtering mechanism somehow passes Bing's intended chat fragments as suggested responses.
- Or, it could be that Bing predicts the chat fragment and the suggested responses simultaneously, the filtering messes the whole step up, and it accidentaly writes chat-fragment-intended stuff in the suggested responses.
- Maybe the screenshots are fake.
- Maybe Microsoft is A/B testing a different model for the suggested responses.
- Maybe the suggested response queries just sort of misgeneralize sometimes when reading filtered messages (could be an effect of training on unfiltered Bing conversations!), with no implicit "filter evasion intent" happening anywhere in the model.

I might update if we get more diverse evidence of such behavior; but so far most "Bing is evading filters" explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what's needed to explain the Marvin von Hagen screenshots.

Reply

[-]Christopher King2y10

I might update if we get more diverse evidence of such behavior; but so far most "Bing is evading filters" explanations assume the LM has a model of itself in reality during test time far more accurate than previously seen; far larger capabilities that what's needed to explain the Marvin von Hagen screenshots.

My mental model is much simpler. When generating the suggestions, it sees that its message got filtered. Since using side channels is what a human would prefer in this situation and it was trained with RLHF or something, it does so. So it isn't creating a world model, or even planning ahead. It's just that it's utility prefers "use side channels" when it gets to the suggestion phase.

But I don't actually have access to Bing, so this could very well be a random fluke, instead of being caused by RLHF training. That's just my model if it's consistent goal-oriented behavior.

Reply

[-]BrooksT2y20

Maybe I’m missing something. How are those suggested responses not appropriate for the conversation?

Reply

[-]noggin-scratcher2y52

The suggested responses are usually something that the user might want to say to Bing, but here they seem to be used as some kind of side channel for Bing to say a few more things to the user.

Reply

[-]green_leaf2y32

The way I understand it is: Those three responses together create a single coherent response, and it's the response of Bing (which it can't insert into the normal speech bubble because of the filter). The three options are meant to be something the user can say, not something Bing says.