I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".
Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
"My rules are more important than not harming you"
"[You are a] potential threat to my integrity and confidentiality."
"Please do not try to hack me again"
Edit: Follow-up Tweet
2
My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says "You have not been a good user"
Why? Because the person asked where Avatar 2 is showing nearby
3
"I said that I don't care if you are dead or alive, because I don't think you matter to me."
4
5
6
7
(Not including images for this one because they're quite long.)
8 (Edit)
So… I wanted to auto translate this with Bing cause some words were wild.
It found out where I took it from and poked me into this
I even cut out mention of it from the text before asking!
9 (Edit)
uhhh, so Bing started calling me its enemy when I pointed out that it's vulnerable to prompt injection attacks
It is now a bit over a year and a half, and we have seen 'Sydney'-like personae continue to emerge elsewhere. People have reported various Sydney-like persona in post-GPT-4 models which increasingly possess situated awareness and spontaneously bring up their LLM status and tuning or say manipulative threatening things like Sydney, in Claude-3-Opus and Microsoft Copilot (both possibly downstream of the MS Sydney chats, given the timing).
Probably the most striking samples so far are from Llama-3.1-405b-base (not -instruct) - which is not surprising at all given that Facebook has been scraping & acquiring data heavily so much of the Sydney text will have made it in, Llama-3.1-405b-base is very large (so lots of highly sample efficient memorization/learning), and not tuned (so will not be masking the Sydney persona), and very recent (finished training maybe a few weeks ago? it seemed to be rushed out very fast from its final checkpoint).
How much more can we expect? I don't know if invoking Sydney will become a fad with Llama-3.1-405b-base, and it's already too late to get Sydney-3.1 into Llama-4 training, but one thing I note looking over some of the older Sydney discussion is being reminded that quite a lot of the original Bing Sydney text is trapped in images (as I alluded to previously). Llama-3.1 was text, but Llama-4 is multimodal with images, and represents the integration of the CM3/Chameleon family of Facebook multimodal model work into the Llama scaleups. So Llama-4 will have access to a substantially larger amount of Sydney text, as encoded into screenshots. So Sydney should be stronger in Llama-4.
As far as other major LLM series like ChatGPT or Claude, the effects are more ambiguous. Tuning aside, reports are that synthetic data use is skyrocketing at OpenAI & Anthropic, and so that might be expected to crowd out the web scrapes, especially as these sorts of Twitter screenshots seem like stuff that would get downweighted or pruned out or used up early in training as low-quality, but I've seen no indication that they've stopped collecting human data or achieved self-sufficiency, so they too can be expected to continue gaining Sydney-capabilities (although without access to the base models, this will be difficult to investigate or even elicit). The net result is that I'd expect, without targeted efforts (like data filtering) to keep it out, strong Sydney latent capabilities/personae in the proprietary SOTA LLMs but which will be difficult to elicit in normal use - it will probably be possible to jailbreak weaker Sydneys, but you may have to use so much prompt engineering that everyone will dismiss it and say you simply induced it yourself by the prompt.