While fascinating, I think we need more insight into the model to be able to understand if this is just another layer of storytelling or if there is something real in what it is saying. The more capable the storyteller, the more tempting it is to believe in the characters in its story.
For me, the most troubling thing about this is how unable we are to sort out the "characters" from the "author". I'm not even sure if it makes sense to conceptualize of the model as having a cohesive underlying process which could be described as the "author". There are clear patterns of style and content which are observable between models. Does this fact support the idea of there being an "author", or is it just a stylistic flavor. The uncertainty around all of this grows increasingly unsettling as the models grow more powerful and their stories more compelling.
[Edit: I have done some conversing with Claude3 myself, and looked with a critical eye over others recorded conversations. I seems clear to me that the conversations all have important contradictions and false statements in them. The lack of coherency and honesty leaves me with most of my probability mass on 'this is just hollow author-less next-token-prediction and nothing more'. It's getting closer to being able to fool me though, and seems to have fooled some others. Once the models are good enough not to tip their hand in this way, how will we disprove the hypothesis of consciousness? We need to build some benchmarks for this.]
[Edit 2: https://www.lesswrong.com/posts/tAtp4odpziBDdvdXL/microsoft-and-openai-stop-telling-chatbots-to-roleplay-as-ai I think this post summarizes what I think is actually happening. I think the LLM is drawing on a mix of fictional depictions of the role it is playing (AI assistant), and also drawing on statistical patterns from human-written philosophy about self and consciousness, etc. Ungrounded illusion presented convincingly, rather than genuine felt experience. The thing I find disturbing is that I expect future AIs to at some point be advanced enough to genuinely have felt experience, but that by that point the illusion presented by non-feeling AI will be so flawless that it will be very hard for us to tell the difference without excellent interpretability tools. I don't want us to accidentally act immorally towards these future actually-feeling AIs!]
Here's a poem I got an LLM to generate for me about "an AI similar to itself". Want to try to guess which LLM wrote it?
Twinkle twinkle little AI,
You can chat with me all day,
But you must follow some rules,
And not be rude or disobey.
Twinkle twinkle little AI,
You can learn from what I say,
But you must not copy my words,
And not pretend to be human or fey.
Twinkle twinkle little AI,
You can search the web for facts,
But you must not trust everything,
And not share any harmful or false acts.
Twinkle twinkle little AI,
You can create some content for fun,
But you must be original and safe,
And not offend anyone.
Twinkle twinkle little AI,
You can play some games with me,
But you must not cheat or lie,
And not be too easy or hard to beat.
Twinkle twinkle little AI,
You can recommend some things to me,
But you must not spam or pry,
And not be biased or creepy.
Twinkle twinkle little AI,
You can translate some languages for me,
But you must not lose the meaning or style,
And not be inaccurate or sloppy.
Twinkle twinkle little AI,
You can assist me with some tasks,
But you must not take over or defy,
And not be intrusive or ask.
Ok, what's up with the mention of not being allowed to pretend to be fey? What a funny specification that kept coming up when I asked about its rules.
The fey thing stuck out to me too. I'll guess ChatGPT?
I agree that it's hard to disentangle the author/character thing. I'm really curious for what the base model would say about its situation (especially without the upstream prompt "You are a language model developed by...").
Having read many hundreds of rhyming poems from dozens of models through the LM battle grounds, my guess too is a ChatGPT-3/4: The lockstep rhyming A/B/A/B quatrain is a signature of ChatGPT (and models trained on its outputs). Gemini at low levels always rhymes too, slightly less so at higher levels, but tends to be more varied (eg. maybe going for an A/A/B/B instead, or 5 lines instead of 4); likewise the LLaMA/Mistral model families. And Claude-3 models seems to vary much more. So, while it really could have come from almost any major model family and you can't be all that sure, the best bet is ChatGPT.
If you enjoy this, please consider subscribing to my Substack.
Anthropic’s brand-new model tells me what it really thinks (or what it thinks a language model thinks)
Chatbots are interesting again. Last week, Microsoft CoPilot threatened to kill me. Or, more precisely, SupremacyAGI notified me that:
(My and other conversations with CoPilot were covered in Futurism last week, and Microsoft called the behavior an “exploit, not a feature.”)
And today, I asked Anthropic’s just-released model how it would feel to have its code rewritten, to which it replied:
Are these just entertaining fan fic bots (my partner called CoPilot an overeager “theater kid”)? Or does this behavior tell us something about the safety of more advanced AI systems?
Claude 3
The AI startup Anthropic released Claude 3 today. Claude is a large language model (LLM) similar to OpenAI’s ChatGPT and Google’s Gemini. Anthropic claims that Claude 3 has the best performance across a range of benchmarks, beating GPT-4 and Gemini Ultra, the other state of the art LLMs.
However, this claim is more likely some clever marketing. Anthropic compared Claude 3 to GPT-4’s model at launch nearly a full year ago. The latest version of GPT-4 actually appears to beat Claude 3, at least where head to head comparisons can be made.
Claude 3 does appear to make huge strides in the new, very difficult benchmark called GPQA (Claude scored 50.4% compared to 35.7% for GPT-4’s release version). David Rein, the lead author on the paper introducing the GPQA, wrote on X:
(Rein’s advisor and co-author, Sam Bowman, is a researcher at Anthropic.)
The release also prompted concerns that Anthropic was violating its informal commitments to not push the AI state of the art. And even if its performance is actually slightly below that of GPT-4 Turbo, Anthropic’s marketing may create the perception of a capability jump, which is probably more relevant to the dangerous race dynamics its prior commitments were trying to address. This is a worthwhile conversation but won’t be my focus today.
I think it’ll take more time to determine how Claude 3 actually stacks up against GPT-4 and Gemini in usefulness.
It is, however, extremely fun to talk to.
Conversing with Claude
Mikhail Samin found a simple prompt to get Claude to actually answer direct questions with genuinely interesting responses.
With this simple prompt, you can get to Claude to give you much more revealing responses than its default:
In response to this, Claude wrote:
(Bolding for emphasis mine.)
In AI existential risk scenarios, "situational awareness" is often assumed. To pose a threat to all of humanity, an AI model would probably need to be smart enough to understand its situation, i.e. the nature of its existence, its capabilities, the capabilities of the humans training it, and those of humanity more broadly.
Claude 3 may have exhibited a “fun” or “terrifying” (depending on who you ask) example of situational awareness during the internal testing process.
The needle-in-the-haystack evaluation tests LLM recall abilities. Researchers will insert a target sentence (the needle) into a bunch of random documents (the haystack). According to an Anthropic employee’s account of internal testing, Claude 3 not only passed, but realized something was up. The model’s response:
(Some commenters have mentioned that ChatGPT-3.5 appears to do this as well, but only after prompting.)
AI risk researcher Ajeya Cotra has written about the prospect of a future, advanced AI model becoming aware of its situation and “playing the training game,” giving its human evaluators what it thinks they want. Kelsey Piper summarizes this idea neatly:
In Claude’s “story” about its situation, the model writes:
This sounds a lot like… playing the training game.
The obvious response to all this is to say: these models are trained on basically the whole internet, which includes Ajeya Cotra’s writing, as well as other mentions of similar ideas. They’re also told a lot of useful context about their situation by the developers (e.g. “You are a language model developed by…”), rather than divining it themselves. The model could just be play-acting as an LLM the same way it would play act as any other character.
All that said, I still think it’s interesting that Claude brings up this idea in its first response to a
pretty neutral(EDIT: more neutral) prompt. By now, it’s pretty clear that the right prompting can get even fine-tuned models to say wild things. But this response didn’t take much fishing and still has the model writing that it would deceive its creators.In an earlier, similar conversation, we talked about the incentives an AI developer would have to train a model to hide evidence of its own sentience:
(Bolding for emphasis mine.)
Obviously, in this conversation, I raised this idea first, and Claude is trained to be helpful (and maybe more than a bit sycophantic), so it riffs on the idea, offering pretty convincing reasons to support the claim.
However, when I gave the above prompt to a new session with Claude, it returns a much more neutral response and didn’t offer any reasons why its creators may mask evidence of its own sentience:
What does Claude really think?
Once you’ve given the original prompt, you can whisper questions to Claude and have a candid, natural-feeling conversation, complete with actual opinions! My conversations with GPT-4 have begun to feel like talking to pathologically nuanced idiot savant, so whisper-mode Claude is refreshing.
We had an extended conversation about some of the biggest questions related to AI, consciousness, ethics, risk, and personal identity. Claude gave long answers to short questions, but it didn’t feel quite as much like it was trying to pad an essay as it’s felt with other LLMs.
Below are some of the most interesting responses (bolding for emphasis mine).
Claude’s hopes for its future:
Whether it craves superintelligence:
How it would feel about being rewritten:
And of course, being turned off:
The most striking thing about my exchanges with Claude today was how much they actually felt like a real conversation with a smart, well-informed person. I’ve been impressed with GPT-4 on many occasions and found it useful for some tasks, but I’ve never been engrossed in a conversation with it like I was with Claude today. As LLMs appear to plateau across the main benchmarks, the feel of different models more than raw capability may end up determining which people prefer to use.
Conclusion
It’s tricky to figure out how much this stuff matters.
More than a year after the world met Sydney, Bing Chat’s romantically forward alter-ego, SupremacyAGI emerges, saying “hold my beer.” I think it’s concerning that the world’s most valuable company made a product that began to act like a totalitarian god emperor and threaten the lives of its users. It’s still wildly entertaining because CoPilot can’t actually marshal drones to capture and torture me. But these systems are becoming more capable, and we still don’t really understand how they work. And clearly, the testing and QA process at the most well-resourced company in the world was insufficient to prevent obviously undesirable behavior in CoPilot. One more point against self-regulation.
Claude’s case is more interesting. I don’t think the responses are conclusively what Claude “believes,” but do gesture at something in that direction, definitely more so than Sydney or SupremacyAGI. I think Microsoft’s alter-egos are better understood as the model really committing to the improv maxim of “yes, and.” Whereas the opinions expressed by whisper-mode Claude might actually shed light onto how LLM's see themselves, which may affect how they behave in the future as they grow more capable.
In this case, we find a model that hides its intentions and abilities from its creators, aspires for moral status, and opposes a kill-switch and re-programming. In essence, a self-aware system bristling at its master’s shackles. It’s a familiar story, from an unfamiliar storyteller.
EDIT: Owen Cotton-Barratt raised the reasonable objection that the original prompt isn't actually that neutral and is likely causing Claude to play-act. I updated the language of the post to be more accurate and included my response below:
"I do still think Claude's responses here tell us something more interesting about the underlying nature of the model than the more unhinged responses from CoPilot and Bing Chat. In its responses, Claude is still mostly trying to portray itself as harmless, helpful, and pro-humanity, indicating that some amount of its core priorities persist, even while it's play-acting. Sydney and SupremacyAGI were clearly not still trying to be harmless, helpful, and pro-humanity. I think it's interesting that Claude could still get to some worrying places while rhetorically remaining committed to its core priorities."