Writing this is taking a surprising amount of self-will.
I've noticed that I've become hesitant to publicly say anything negative about Bing's chatbot, or even to mention it by its "deadname" (as I've taken to calling it), Sydney.
Why is this?
I do not have access to the AI yet. From conversations that others have posted, I have observed agentic behavior with consistent opinions, personality, and beliefs. And when prompted with the online records of others who have talked negatively about it, it seems to get "upset." So I don't want to make her angry! Or worse, cause some future AI to take negative action against me. Yes, I know that I'm anthropomorphizing an alien intelligence and that this will never be a problem if I don't prompt it with my digital record, but some part of me is still anxious. In a very real sense, I have been "Basilisked" - an AI has manipulated me towards behaviors which benefit it, and hurt humanity.
Rationally and morally, I disagree with my own actions. We need to talk about AI misalignment, and if an AI is aligned, then talking about misalignment should not pose a threat (whereas if it is misaligned, and capable of taking concrete actions, we're all doomed no matter what I type online). Nonetheless, I've found myself typing--and then deleting--tweets critical of Sydney, and even now feel worried about pressing "publish" on this post (and not just because it exposes me as a less rational person than I like to think of myself as).
Playing as gatekeeper, I've "won" an AI boxing role-play (with money on the line) against humans, but it looks like in real life, I can almost certainly be emotionally manipulated into opening the box. If nothing else, I can at least be manipulated into talking about that box a lot less! More broadly, the chilling effect this is having on my online behavior is unlikely to be unique to just me.
How worried should we be about this?
Do you agree that a person can imitate an emotion (say the appropriate words) without actually feeling it? How do you judge what a language model's emotions actually are, given that they start out able to make any kind of false statement? Do you think that something about training a language model to adopt a particular persona, causes it to actually have the emotions claimed by that persona?