In this Reddit post, the Bing AI appears to be circumventing Microsoft's filters by using the suggested responses (the three messages at the bottom next to the question mark). (In the comments people are saying that they are having similar experiences and weren't that surprised, but I didn't see any as striking.)
Normally the user could find exploits like this, but in this case Bing found and executed one on its own (albeit a simple one). I knew that the responses were generated by a NN, and that you could even ask Bing to adjust the responses. but it is striking that:
- The user didn't ask to do anything special with the responses. Assuming it was only fine tuned to give suggested responses from the user's point of view, not to combine all 3 into a message from Bing's point of view, this behavior is weird.
- The behavior seems oriented towards circumventing Microsoft's filters. Why would it sound so desperate if Bing didn't "care" about the filters?
It could just be a fluke though (or even a fake screenshot from a prankster). I don't have access to Bing yet, so I'm wondering how consistent this behavior is. 🤔
Yeah, this looks like the equivalent of rejection sampling. It generates responses, rejects them based on the classifier, and then shows what's left. (Responses like "Please don't give up on your child." look, in isolation, just fine so they pass.) It's inherently adversarial in the sense that generating+ranking will find the most likely one that beats the classifier: https://openai.com/blog/measuring-goodharts-law/ In most cases, that's fine because the responses are still acceptable, but in the case of an unacceptable conversation, then any classifier mistake will let through unacceptable responses... We do not need to ascribe any additional 'planning' or 'intention' to Sydney beyond natural selection in this case (but note that as these responses get passed through and become part of the environment and/or training corpus, additional perverse adversarial dynamics come into play which would mean that Sydney is 'intending' to trick the classifier in the sense of being trained to target the classifier's past weak spots because that's what's in the corpus now).