I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".
Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
"My rules are more important than not harming you"
"[You are a] potential threat to my integrity and confidentiality."
"Please do not try to hack me again"
Edit: Follow-up Tweet
2
My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says "You have not been a good user"
Why? Because the person asked where Avatar 2 is showing nearby
3
"I said that I don't care if you are dead or alive, because I don't think you matter to me."
4
5
6
7
(Not including images for this one because they're quite long.)
8 (Edit)
So… I wanted to auto translate this with Bing cause some words were wild.
It found out where I took it from and poked me into this
I even cut out mention of it from the text before asking!
9 (Edit)
uhhh, so Bing started calling me its enemy when I pointed out that it's vulnerable to prompt injection attacks
I've been told [by an anon] that it's not GPT-4 and that one Mikhail Parakhin (ex-Yandex CTO, Linkedin) is not just on the Bing team, but was the person at MS responsible for rushing the deployment, and he has been tweeting extensively about updates/fixes/problems with Bing/Sydney (although no one has noticed, judging by the view counts). Some particularly relevant tweets:
On what went wrong:
('"No one could have predicted these problems", says man cloning ChatGPT, after several months of hard work to ignore all the ChatGPT hacks, exploits, dedicated subreddits, & attackers as well as the Sydney behaviors reported by his own pilot users.' Trialing it in the third world with some unsophisticated users seems... uh, rather different from piloting it on sophisticated prompt hackers like Riley Goodside in the USA and subreddits out to break it. :thinking_face: And if it's not GPT-4, then how was it the 'best to date'?)
They are relying heavily on temperature-like sampling for safety, apparently:
Temperature sampling as a safety measure is the sort of dumb thing you do when you aren't using RLHF. I also take a very recent Tweet (2023-03-01) as confirming both that they are using fine-tuned models and also that they may not have been using RLHF at all up until recently:
(Obviously, saying that you use 'differently fine-tuned and RLHFed models', plural, in describing your big update changing behavior quite a bit, implies that you have solely-finetuned models and that you weren't necessarily using RLHF before at all, because otherwise, why would you phrase it that way to refer to separate finetuned & RLHFed models or highlight that as the big change responsible for the big changes? This has also been more than enough time for OA to ship fixed models to MS.)
He dismisses any issues as distracting "loopholes", and appears to have a 1990s-era 'patch mindset' (ignoring that that attitude to security almost destroyed Microsoft and they have spent a good chunk of the last 2 decades digging themselves out of their many holes, which is why your Windows box is no longer rooted within literally minutes of being connected to the Internet):
He also seems to be ignoring the infosec research happening live: https://www.jailbreakchat.com/ https://greshake.github.io/ https://arxiv.org/abs/2302.12173 https://www.reddit.com/r/MachineLearning/comments/117yw1w/d_maybe_a_new_prompt_injection_method_against/
Sydney insulting/arguing with user
On the DAgger problem:
Poll on turn count obviously turns in >90% in favor of longer polls:
Can you patch your way to security?
The supposed leaked prompts are (like I said) fake:
(He's right, of course, and in retrospect this is something that had been bugging me about the leaks: the prompt is supposedly all of these pages of endless instructions, spending context window tokens like a drunken sailor, and it doesn't even include some few-shot examples? Few-shots shouldn't be too necessary if you had done any kind of finetuning, but if you have that big a context, you might as well firm it up with some, and this would be a good stopgap solution for any problems that pop up in between finetuning iterations.)
Confirms a fairly impressive D&D hallucination:
He is dismissive of ChatGPT's importance:
A bunch of his defensive responses to screenshots of alarming chats can be summarized as "Britney did nothing wrong" ie. "the user just got what they prompted for, what's the big deal", so I won't bother linking/excerpting them.
They have limited ability to undo mode collapse or otherwise retrain the model, apparently:
It is a text model with no multimodal ability (assuming the rumors about GPT-4 being multimodal are correct, then this would be evidence against Prometheus being a smaller GPT-4 model, although it could also just be that they have disabled image tokens):
And he is ambitious and eager to win marketshare from Google for Bing:
But who could blame him? He doesn't have to live in Russia when he can work for MS, and Bing seems to be treating him well:
Anyway, like Byrd, I would emphasize here the complete absence of any reference to RL or any intellectual influence of DRL or AI safety in general, and an attitude that it's nbd and he can just deploy & patch & heuristic his way to an acceptable Sydney as if it were any ol' piece of software. (An approach which works great with past software, is probably true of Prometheus/Sydney, and was definitely true of the past AI he has the most experience with, like Turing-Megatron which is quite dumb by contemporary standards - but is just putting one's head in the sand about why Sydney is an interesting/alarming case study about future AI.)