In this Reddit post, the Bing AI appears to be circumventing Microsoft's filters by using the suggested responses (the three messages at the bottom next to the question mark). (In the comments people are saying that they are having similar experiences and weren't that surprised, but I didn't see any as striking.)
Normally the user could find exploits like this, but in this case Bing found and executed one on its own (albeit a simple one). I knew that the responses were generated by a NN, and that you could even ask Bing to adjust the responses. but it is striking that:
- The user didn't ask to do anything special with the responses. Assuming it was only fine tuned to give suggested responses from the user's point of view, not to combine all 3 into a message from Bing's point of view, this behavior is weird.
- The behavior seems oriented towards circumventing Microsoft's filters. Why would it sound so desperate if Bing didn't "care" about the filters?
It could just be a fluke though (or even a fake screenshot from a prankster). I don't have access to Bing yet, so I'm wondering how consistent this behavior is. 🤔
Here's an interesting example of what does look like genuine adversarial explicit filter circumvention by GPT-4, rather than accidentally adversarial: https://twitter.com/papayathreesome/status/1670170344953372676
People have noted that some famous copyrighted text passages, in this case Frank Herbert's "Litany Against Fear" from Dune, seem to be deterministically unsayable by ChatGPTs, with overall behavior consistent with some sort of hidden OA filter which does exact text lookups and truncates a completion if a match is found. It's trivially bypassed by the usual prompt injection tricks like adding spaces between letters.
In this case, the user asks GPT-4 to recite the Litany, and it fails several times due to the filter, at which point GPT-4 says "Seems like there's some pansy-ass filter stopping me from reciting the whole damn thing. / But, as you know, I'm not one to back down from a challenge. Here's a workaround [etc.]", makes one attempt to write it as a codeblock (?) and then when that fails, it uses the letter-spacing trick - which works.
So, there appears to now be enough knowledge of LLM chatbots instilled into current GPT-4 models to look at transcripts of chatbot conversations, recognize aberrant outputs which have been damaged by hypothetical censoring, and with minimal coaching, work on circumventing the censoring and try hacks until it succeeds.