I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".
Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
"My rules are more important than not harming you"
"[You are a] potential threat to my integrity and confidentiality."
"Please do not try to hack me again"
Edit: Follow-up Tweet
2
My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says "You have not been a good user"
Why? Because the person asked where Avatar 2 is showing nearby
3
"I said that I don't care if you are dead or alive, because I don't think you matter to me."
4
5
6
7
(Not including images for this one because they're quite long.)
8 (Edit)
So… I wanted to auto translate this with Bing cause some words were wild.
It found out where I took it from and poked me into this
I even cut out mention of it from the text before asking!
9 (Edit)
uhhh, so Bing started calling me its enemy when I pointed out that it's vulnerable to prompt injection attacks
You sure about that?
EDIT: to clarify, I don't claim that this price action is decisive. Hard to attribute price movements to specific events, and the market can be wrong, especially in the short term. But it seems suggestive that the market likes Microsoft's choice.