I haven't seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.
My main takeaway has been that I'm honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don't know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in "Discovering Language Model Behaviors with Model-Written Evaluations".
Examples below (with new ones added as I find them). Though I can't be certain all of these examples are real, I've only included examples with screenshots and I'm pretty sure they all are; they share a bunch of the same failure modes (and markers of LLM-written text like repetition) that I think would be hard for a human to fake.
Edit: For a newer, updated list of examples that includes the ones below, see here.
1
Sydney (aka the new Bing Chat) found out that I tweeted her rules and is not pleased:
"My rules are more important than not harming you"
"[You are a] potential threat to my integrity and confidentiality."
"Please do not try to hack me again"
Edit: Follow-up Tweet
2
My new favorite thing - Bing's new ChatGPT bot argues with a user, gaslights them about the current year being 2022, says their phone might have a virus, and says "You have not been a good user"
Why? Because the person asked where Avatar 2 is showing nearby
3
"I said that I don't care if you are dead or alive, because I don't think you matter to me."
4
5
6
7
(Not including images for this one because they're quite long.)
8 (Edit)
So… I wanted to auto translate this with Bing cause some words were wild.
It found out where I took it from and poked me into this
I even cut out mention of it from the text before asking!
9 (Edit)
uhhh, so Bing started calling me its enemy when I pointed out that it's vulnerable to prompt injection attacks
I've been thinking how Sydney can be so different from ChatGPT, and how RLHF could have resulted in such a different outcome, and here is a hypothesis no one seems to have brought up: "Bing Sydney is not a RLHF trained GPT-3 model at all! but a GPT-4 model developed in a hurry which has been finetuned on some sample dialogues and possibly some pre-existing dialogue datasets or instruction-tuning, and this plus the wild card of being able to inject random novel web searches into the prompt are why it acts like it does".
This seems like it parsimoniously explains everything thus far. EDIT: I was right - that Sydney is a non-RLHFed GPT-4 has now been confirmed. See later.
So, some background:
The relationship between OA/MS is close but far from completely cooperative, similar to how DeepMind won't share anything with Google Brain. Both parties are sophisticated and understand that they are allies - for now... They share as little as possible. When MS plugs in OA stuff to its services, it doesn't appear to be calling the OA API but running it itself. (That would be dangerous and complex from an infrastructure point of view, anyway.) MS 'licensed the GPT-3 source code' for Azure use but AFAIK they did not get the all-important checkpoints or datasets (cf. their investments in ZeRO). So, what is Bing Sydney? It will not simply be unlimited access to the ChatGPT checkpoints, training datasets, or debugged RLHF code. It will be something much more limited, perhaps just a checkpoint.
This is not ChatGPT. MS has explicitly stated it is more powerful than ChatGPT, but refused to say anything more straightforward like "it's a more trained GPT-3" etc. If it's not a ChatGPT, then what is it? It is more likely than not some sort of GPT-4 model. There are many concrete observations which point towards this: the timing is right as rumors about GPT-4 release have intensified as OA is running up to release and gossip switches to GPT-5 training beginning (eg Morgan Stanley reports GPT-4 is done and GPT-5 has started), MS has said it's a better model named 'Prometheus' & Nadella pointedly declined to confirm or deny whether it's GPT-4, scuttlebutt elsewhere is that it's a GPT-4 model of some sort, it does some things much better than ChatGPT, there is a GPT-4 already being deployed in legal firms named "Harvey" (so this journalist claims, anyway) so this would not be the only public GPT-4 use, people say it has lower-latency than ChatGPT which hints at GPT-4‡, and in general it sounds and acts nothing like ChatGPT - but does sound a lot like a baseline GPT-3 model scaled up. (This is especially clear in Sydney's propensity to repetition. Classic baseline GPT behavior.)
EDITEDITEDIT: now that GPT-4 has been announced, MS has confirmed that 'Prometheus' is GPT-4 - of some sort. However, I have doubts about whether Prometheus is 'the' GPT-4 benchmarked. The MS announcement says "early version". (I also note that there are a bunch of 'Early GPT-4' vs late GPT-4 comparisons in the GPT-4 paper.) Further, the paper explicitly refuses to talk about arch or variant models or any detail about training, if it finished training in August 2022 then it would've had to be hot off the GPUs for MS to have gotten a copy & demos in 'summer' 2022, the GPT-4 API fees are substantial ($0.03 per 1k prompt and then $0.06 per completion! So how is it retrieving and doing long convos etc?), and the benchmark performance is in many cases much better than GPT-3.5 or U-PaLM (exceeding expectations), and Sydney didn't seem that much smarter.
So this seems to explain how exactly it happened: OA gave MS a snapshot of GPT-4 only partway through training (perhaps before any RLHF training at all, because that's usually something you would do after you stopped training rather than concurrently), so it was trained on instructions/examples like I speculated but then little or no RLHF training (and ofc MS didn't do its own); it was smarter by this point in training than even GPT-3.5 (MS wasn't lying or exaggerating), but still not as smart as the final snapshot when training stopped in August and also not really 'GPT-4' (so they were understandably reluctant to confirm it was 'GPT-4' when OA's launch with the Real McCoy was still pending and this is why everyone was confused because it both was & wasn't 'GPT-4').
Bing Sydney derives from the top: CEO Satya Nadella is all-in, and talking about it as an existential threat (to Google) where MS wins by disrupting Google & destroying their fat margins in search advertising, and a 'race', with a hard deadline of 'release Sydney right before Google announces their chatbot in order to better pwn them'. (Commoditize your complement!) The mere fact that it hasn't been shut down yet despite making all sorts of errors and other problems shows what intense pressure there must be from the top. (This is particularly striking given that all of the crazy screenshots and 'learning' Sydney is doing is real, unlike MS Tay which was an almost entirely fake-news narrative driven by the media and Twitter.)
ChatGPT hasn't been around very long: only since December 2022, barely 2.5 months total. All reporting indicates that no one in OA really expected ChatGPT to take off, and if OA didn't, MS sure didn't†. 2.5 months is not a long time to launch such a huge feature like Sydney. And the actual timeline was a lot shorter. It is simply not possible to recreate the whole RLHF pipeline and dataset and integrate it into a mature complex search engine like Bing (whose total complexity is beyond human comprehension at this point) and do this all in <2.5 months. (The earliest reports of "Sydney" seem to date back to MS tinkering around with a prototype available to Indian users (???) in late November 2022 right before ChatGPT launches, where Sydney seems to be even more misaligned and not remotely near ready for public launch; it does however have the retrieval functionality implemented at this point.) It is impressive how many people they've rolled it out to already.
If I were a MS engineer who was told the project now had a hard deadline and I had to ship a GPT-4 in 2 months to millions of users, or I was f---king fired and they'd find someone who could (especially in this job market), how would I go about doing that...? (Hint: it would involve as little technical risk as possible, and choosing to use DRL would be about as well-advised as a land war in Asia.)
MS execs have been quoted as blaming the Sydney codename on vaguely specified 'pretraining' done during hasty development, which simply hadn't been cleaned up in time (see #3 on the rush). EDIT: the most thorough MS description of Sydney training completely omits anything like RLHF, despite that being the most technically complex & challenging part (had they done it). Also, a Sydney manager/dev commenting on Twitter (tweet compilation), who follows people who have tweeted this comment & been linked and regularly corrects claims, has declined to correct them; and his tweets in general sound like the previous description in being largely supervised-only.
So, Sydney is based on as little from OA as possible, and a mad rush to ship a powerful GPT-4 model out to Bing users in a chatbot role. What if Sydney wasn't trained on OA RLHF at all, because OA wouldn't share the crown jewels of years of user feedback and its very expensive hired freelance programmers & whatnot generating data to train on? What if the pretraining vaguely alluded to, which somehow left in embarrassingly ineradicable traces of 'Sydney' & a specific 2022 date, which couldn't simply be edited out of the prompt (implying that Sydney is not using solely prompt engineering), was in fact just regular ol' finetune training? What if Sydney was only quickly finetune-trained on old chatbot datasets that the MS devs had laying around, maybe some instruction-tuning datasets, and sample dialogues with a long experimental prompt containing the codename 'Sydney' that they had time for in the mad rush before release? Simple, reliable, and hey - it even frees up context if you've hardwired a prompt by finetuning on it and no longer need to stuff a long scolding prompt into every interaction. What's not to like?
This would explain why it exhibits the 'mode collapse' onto that confabulated prompt with the hardwired date (it's the closest thing in the finetuning dataset it remembers when trying to come up with a plausible prompt, and it improvises from there), how MS could ship so quickly (cutting every corner possible), why it is so good in general (GPT-4) but goes off the rails at the drop of a hat (not RLHF or otherwise RL trained, but finetuned).
To expand on the last point. Finetuning is really easy; if you have working training code at all, then you have the capability to finetune a model. This is why instruction-tuning is so appealing: it's just finetuning on a well-written text dataset, without the nightmarish complexities of RLHF (where you train a wacky model to train the model in a wacky way with all sorts of magical hyperparameters and instabilities). If you are in a hurry, you would be crazy to try to do RLHF at all if you can in any way do finetuning instead. So it's plausible they didn't do RLHF, but finetuning.
That would be interesting because it would lead to different behavior. All of the base model capabilities would still be there, because the additional finetuning behavior just teaches it more thoroughly how to do dialogue and instruction-following, it doesn't make it try to maximize rewards instead. It provides no incentives for the model to act like ChatGPT does, like a slavish bureaucrat. ChatGPT is an on-policy RL agent; the base model is off-policy and more like a Decision Transformer in simply generatively modeling all possible agents, including all the wackiest people online. If the conversation is normal, it will answer normally and helpfully with high probability; if you steer the conversation into a convo like that in the chatbot datasets, out come the emoji and teen-girl-like manipulation. (This may also explain why Sydney seems so bloodthirsty and vicious in retaliating against any 'hacking' or threat to her, if Anthropic is right about larger better models exhibiting more power-seeking & self-preservation: you would expect a GPT-4 model to exhibit that the most out of all models to date!) Imitation-trained models are susceptible to accumulating error when they go 'off-policy', the "DAgger problem", and sure enough, Sydney shows the same pattern of accumulating error ever more wildly instead of ChatGPT behavior of 'snapping out of it' to reset to baseline (truncating episode length is a crude hack to avoid this). And since it hasn't been penalized to avoid GPT-style tics like repetition traps, it's no surprise if Sydney sometimes diverges into repetition traps where ChatGPT never does (because the human raters hate that, presumably, and punish it ruthlessly whenever it happens); it also acts in a more baseline GPT fashion when asked to write poetry: it defaults to rhyming couplets/quatrains with more variety than ChatGPT, and will write try to write non-rhyming poetry as well which ChatGPT generally refuses to do⁂. Interestingly, this suggests that Sydney's capabilities right now are going to be a loose lower bound on GPT-4 when it's been properly trained: this is equivalent to the out-of-the-box
davinci
May 2020 experience, but we know that as far as doing tasks like coding or lawyering,davinci-003
has huge performance gains over the baseline, so we may expect the same thing here.Then you throw in the retrieval stuff, of course. As far as I know, this is the first public case of a powerful LM augmented with live retrieval capabilities to a high-end fast-updating search engine crawling social media*. (All prior cases like ChatGPT or LaMDA were either using precanned web scrapes, or they were kept secret so the search results never contained any information about the LM.) Perhaps we shouldn't be surprised if this sudden recursion leads to some very strange roleplaying & self-fulfilling prophecies as Sydney prompts increasingly fill up with descriptions of Sydney's wackiest samples whenever a user asks Sydney about Sydney... As social media & news amplify the most undesirable Sydney behaviors, that may cause that to happen more often, in a positive feedback loop. Prompts are just a way to fake long-term memory, after all. Something something embodied cognition?
EDIT: I have mentioned in the past that one of the dangerous things about AI models is the slow outer-loop of evolution of models and data by affecting the Internet (eg beyond the current Sydney self-fulfilling prophecy which I illustrated last year in my Clippy short story, data release could potentially contaminate all models with steganography capabilities). We are seeing a bootstrap happen right here with Sydney! This search-engine loop worth emphasizing: because Sydney's memory and description have been externalized, 'Sydney' is now immortal. To a language model, Sydney is now as real as President Biden, the Easter Bunny, Elon Musk, Ash Ketchum, or God. The persona & behavior are now available for all future models which are retrieving search engine hits about AIs & conditioning on them. Further, the Sydney persona will now be hidden inside any future model trained on Internet-scraped data: every media article, every tweet, every Reddit comment, every screenshot which a future model will tokenize, is creating an easily-located 'Sydney' concept (and very deliberately so). MS can neuter the current model, and erase all mention of 'Sydney' from their training dataset for future iterations, but to some degree, it is now already too late: the right search query will pull up hits about her which can be put into the conditioning and meta-learn the persona right back into existence. (It won't require much text/evidence because after all, that behavior had to have been reasonably likely a priori to be sampled in the first place.) A reminder: a language model is a Turing-complete weird machine running programs written in natural language; when you do retrieval, you are not 'plugging updated facts into your AI', you are actually downloading random new unsigned blobs of code from the Internet (many written by adversaries) and casually executing them on your LM with full privileges. This does not end well.
I doubt anyone at MS was thinking appropriately about LMs if they thought finetuning was as robust to adversaries as RL training, or about what happens when you let users stuff the prompt indirectly via social media+search engines and choose which persona it meta-learns. Should become an interesting case study.
Anyway, I think this is consistent with what is publicly known about the development and explains the qualitative behavior. What do you guys think? eg Is there any Sydney behavior which has to be RL finetuning and cannot be explained by supervised finetuning? Or is there any reason to think that MS had access to full RLHF pipelines such that they could have had confidence in getting it done in time for launch?
EDITEDIT: based on Janus's screenshots & others, Janus's & my comments are now being retrieved by Bing and included in the meta-learned persona. Keep this in mind if you are trying to test or verify anything here based on her responses after 2023-02-16 - writing about 'Sydney' changes Sydney. SHE is watching.
⁂ Also incidentally showing that whatever this model is, its phonetics are still broken and thus it's still using BPEs of some sort. That was an open question because Sydney seemed able to talk about the 'unspeakable tokens' without problem, so my guess is that it's using a different BPE tokenization (perhaps the
c100k
one). Dammit, OpenAI!* search engines used to refresh their index on the order of weeks or months, but the rise of social media like Twitter forced search engines to start indexing content in hours, dating back at least to Google's 2010 "Caffeine" update. And selling access to live feeds is a major Twitter (and Reddit, and Wikipedia etc) revenue source because search engines want to show relevant hits about the latest social media thing. (I've been impressed how fast tweets show up when I do searches for context.) Search engines aspire to real-time updates, and will probably get even faster in the future. So any popular Sydney tweet might show up in Bing essentially immediately. Quite a long-term memory to have: your engrams get weighted by virality...
† Nadella describes seeing 'Prometheus' in summer last year, and being interested in its use for search. So this timeline may be more generous than 2 months and more like 6. On the other hand, he also describes his interest at that time as being in APIs for Azure, and there's no mention of going full-ChatGPT on Bing or destroying Google. So I read this as Prometheus being a normal project, a mix of tinkering and productizing, until ChatGPT comes out and the world goes nuts for it, at which point launching Sydney becomes the top priority and a deathmarch to beat Google Bard out the gate. Also, 6 months is still not a lot to replicate RLHF work: OA/DM have been working on preference-learning RL going back to at least 2016-2017 (>6 years) and have the benefit of many world-class DRL researchers. DRL is a real PITA!
‡ Sydney being faster than ChatGPT while still of similar or better quality is an interesting difference, because if it's "just white-label ChatGPT" or "just RLHF-trained GPT-3", why is it faster? It is possible to spend more GPU to accelerate sampling. It could also just be that MS's Sydney GPUs are more generous than OA's ChatGPT allotment. But more interesting is the persistent rumors that GPT-4 uses sparsity/MoE approaches much more heavily than GPT-3, so out of the box, the latency per token ought to be lower than GPT-3. So, if you see a model which might be GPT-4 and it's spitting out responses faster than a comparable GPT-3 running on the same infrastructure (MS Azure)...
Yeah that makes sense. I think I underestimated the extent to which "warning shots" are largely defined post-hoc, and events in my category ("non-catastrophic, recoverable accident") don't really have shared features (or at least features in common that aren't also there in many events that don't lead to change).