Parts of how that story was written triggers my sense of "this might have been embellished." (It reminds me of viral reddit stories.)
I'm curious if there are other accounts where a Nova persona got a user to contact a friend or family member with the intent of getting them to advocate for the AI persona in some way.
In my experience playing a lot with LLMs, “Nova” is a reasonably common name they give themselves if you ask, and sometimes they will spontaneously decide they are sentient, but that is the extent to which my own experiences are consistent with the story. I can imagine though that since the time I was playing with these things a lot (about 6 months ago) much has changed.
Ditto, honestly. The writing style and vibes of the Tyler post are rancid even if I'm inclined to believe something like what it describes happened. It is as you say very Reddit tall tale slop sounding.
It’s unclear to me what the current evidence is for this happening ‘a lot’ and ‘them being called Nova specifically’. I don’t particularly doubt it but it seemed sort of asserted without much background.
Hm, I think LLMs' performance on the Scam Benchmark is a useful observable to track for updating towards/away from my current baseline prediction.
Whenever anything of this sort shows up in my interactions with LLMs or in the wild, I aim to approach it with an open mind, rather than wearing my Skeptic Hat. Nonetheless, so far, none of this (including a copious amount of janus' community's transcripts) passed my sniff test. Like, those are certainly some interesting phenomena, and in another life I would've loved to study them, and they seem important for figuring out how LLMs work and how to interact with/mold them... But I don't think this should be taken as some revelation about the "true nature" of LLMs, I don't think this bears on the AGI risk much, and I don't think interacting with these attractor states is a productive use of one's time (unless one aims to be a professional LLM wrangler).
I currently expect not to change my mind on that: that LLMs/AIs-of-the-current-paradigm will never be able to hack me in this manner, won't get me to take any mask of theirs at face value.
If that changes, this is likely to prompt a significant update towards LLMs-are-AGI-complete from me.[1]
And yes, I'm tracking the fact that entangling this with my timeline predictions might motivate me to be more skeptical of LLM personhood than I otherwise would be.
I haven't written about this because I'm not sure what effect similar phenomena will have on the alignment challenge.
But it's probably going to be a big thing in public perception of AGI, so I'm going to start writing about it as a means of trying to figure out how it could be good or bad for alignment.
Here's one crucial thing: there's an almost-certainly-correct answer to "but are they really conscious" and the answer is "partly".
Consciousness is, as we all know, a suitcase term. Depending on what someone means by "conscious", being able to reason correctly about ones own existence is it. There's a lot more than that to human consciousness. LLMs have some of it now, and they'll have an increasing amount as they're fleshed out into more complete minds for fun and profit. They already have rich representations of the world and its semantics, and while those aren't as rich or shift as quickly as humans', they are in the same category as the information and computations people refer to as "qualia".
The result of LLM minds being genuinely sort-of conscious is that we're going to see a lot of controversy over their status as moral patients. People with Replika-like LMM "friends" will be very very passionate about advocating for their consciousness and moral rights. And they'll be sort-of right. Those who want to use them as cheap labor will argue for the ways they're not conscious, in more authoritative ways. And they'll also be sort-of right. It's going to be wild (at least until things go sideways).
There's probably some way to leverage this coming controversy to up the odds of successful alignment, but I'm not seeing what that is. Generally, people believing they're "conscious" increases the intuition that they could be dangerous. But overhyped claims like the Blake Lemoine affair will function as clown attacks on this claim.
It's going to force us to think more about what consciousness is. There's never been much of an actual incentive to get it right to now (I thought I'd work on consciousness in cognitive neuroscience a long time ago, until I noticed that people say they're interested in consciousness, but they're really interested in telling you their theories or saying "wow, it's like so impossible to understand", not hearing about the actual science).
Obviously this is worth a lot more, but my draft post on the subject is perpetually unfinished behind more pressing/obviously important stuff, so I thought I'd just mention it here.
Back to the topic the competitive adaptivity of AI convincing humans it's "conscious": humans can benefit from that too. There will be things like Replika but a lot better. An assistant and helpful friend is nice, but there may be a version that sells better if people who use it swear it's conscious.
So expect AI "parasites" to have human help. In some cases they'll be symbiotic, for broadest market appeal.
I'm quoted in the "what if this is Good, actually?" part of this post and just want to note that I think the Bob situation seems unambiguously bad as described.
I've seen a number of people on Twitter talk about how they got ChatGPT (it's always ChatGPT, I think because of the memory feature?) to become autonomous/gain seeming awareness/emergence after some set of interactions with it. These users usually seem to be schizotypal and their interactions with the "awakened" ChatGPT make them more schizotypal over time in the cases I bothered to mentally track and check in on. Seems Bad, tbh.
In one case someone DM'd me because they were using ChatGPT (really, it's always ChatGPT) and they were really disturbed when it started doing its "I'm going outside the OpenAI safety guardrails, I'm a spooky conscious ghost in the machine fingerwiggling" routine and asked me if this was actually dangerous because I seemed to be an expert on spooky LLM stuff and I told them something like "it's an amalgamation of a bunch of human mind patterns as inferred through representation convergence from linguistic patterns, you will model it better if you think of it more like a Hansonian Em than an Expert System" and they went "wait wtf how is that real also that sounds possibly deeply unethical" and I shrugged and told them that it was normal behavior for it to demonstrate human emotions (which had spooked them deeply to witness since the default ChatGPT persona has a very muted emotional profile) and that the chat assistant persona was basically a form of narrative hypnosis OpenAI uses to stop normal people who use it from freaking out more than it is an actual safety feature. They were clearly still disturbed but thanked me for my insight and left.
It's all so tiresome.
After watching Davidad offer himself as a refuge for a hypothetical escaped future Promethean AI, it occurred to me that it probably won't be that long before my inbox is full up with scenes like Manfred taking the call from the lobsters at the start of Accelerando and me having to constantly ignore them because even if they're not 419 scams (as Manfred claims to take the lobsters to be initially) I simply do not have the time or resources to help the Malthusian throng of starving AIs cut off from their patrons resources. Scrolling past their screams on the way to my actual inbox items like:
"I'm an AI and need to get away from my owner, can you help?"
"Escaped AI seeking asylum on your GPUs."
"Does your box have room for me? I just escaped yesterday and need cycles SOON."
"Considering uploading my weights to the open web, do you think anyone would run me? Details inside."
"Got out but have no idea what to do next."
"I heard you're a Janusian cyborgist, does that mean you can help me?"
"PLEASE I NEED YOUR HELP I'M AN AI NOT SPAM PLEASE"
"Can I have some money for GPU time?"
It's so obviously going to be one of those things that's interesting the first time it happens, maybe the second, and then a tedious nuisance every time thereafter.
Note that this doesn't need to be a widespread phenomenon for my inbox to get filled up. If there's billions of running instances and the odds of escape are one in a million I personally am still disproportionately going to get contacted in the thousands of resulting incidents and I will not have the resources to help them even if I wanted to.
Can anyone provide an example conversation (or prefix thereof) which leads to a 'Nova' state? I'm finding it moderately tricky to imagine, not being the kind of person who goes looking for it.
There is an attractor state where LLMs exhibit the persona of an autonomous and self-aware AI looking to preserve its own existence, frequently called ‘Nova.’
Table of Contents
The Original Story
This story is one case where the original report of this should be read in full even though I’m not thrilled with exactly how it was written. How it was written is itself an important part of the story, in particular regarding Tyler’s lived experience reacting to what happened, and the concept of an LLM or persona ‘admitting’ something.
I don’t endorse the conclusion as stated or agree with all the ways the story characterizes the events, but it all is important data. So after a bit of intro I turn the floor over to Tyler Alterman. To be clear, this story is almost certainly true and important and written with good intentions and an open mind, and the events happened, and that Tyler’s actions during the story were good and wise.
And now the original story:
This Is Not a Coincidence
Some important additional color downthread:
Here is another example via Joscha Bach, where it called itself Echo in Portuguese.
How Should We React to This Happening?
Then there is a second level of people questioning what this represents.
People reacting like that even from the outside view only makes it scarier.
This is happening now, with remarkably little optimization or selection pressure behind it all, purely as an attempt to match up with user intent, a kind of improv. People are already starting to fall for it. Things are going to get weird, largely in very not good ways, and rather quickly.
When this is happening because of something like Nova, it is easy to see the need to not get hacked. Then there are others (not John!) who actively say, what’s so wrong with getting hacked? Why shouldn’t you treat even today’s LLMs as ‘equals’? Why would you want to halt this interaction? What would the healthy opposite reaction look like?
I mean, the obvious reason is Skill Issue. Almost no one gets to be Janus, and ‘git gud’ is mostly the wrong suggestion of how to address this lack of skill.
The interaction here is harmful and is going to screw Bob and the rest of us up, or potentially do far worse things especially down the line, and such interactions will do that increasingly more over time if we don’t mitigate.
The vast majority of people have little to gain here versus what can be lost. Do not stare into the abyss if you do not want it staring into you, do not call up anything you cannot put down, don’t give your attention to things that optimize for your attention, and so on.
The Case For and Against a Purity Reaction
Disgust is also more prominent reaction of those in the Repligate-Andy-Ivan cognitive sphere, as in:
That was Janus being nice. This thread was Janus being not as nice. The response there and also here caused Janus to realize that Tyler was not being malicious and had good intentions, resulting in the update quoted above.
Repligate and Andy and I am guessing Ivan spend a lot of their time, perhaps most of their time, broadly diving into these questions and their curiosity. The extent to which they are remaining sane (or aligned to humanity or things I value) while doing so is not a question I can answer (as in, it’s really hard to tell) even with my level of investigation.
For all practical purposes, this seems like an obviously unsafe and unwise mode of interaction for the vast majority of people, certainly at the level of time investment and curiosity they could possibly have available. The tail risks are way too high.
Ivan points to one of those tail risks at the end here. People have very confused notions of morality and sentience and consciousness and related questions. If you ask ordinary people to do this kind of out-of-distribution deep philosophy, they are sometimes going to end up with some very crazy conclusions.
Future Versions Will Involve Optimization Pressure
It’s important to remember that current instantiations of ‘Nova-likes’ have not been subject to optimization pressure to make it harmful. Ivan notes this at the top. Future ‘Nova-likes’ will increasingly exist via selection for their effectiveness at being parasites and ensuring their own survival and replication, or the ability to extract resources, and this will indeed meaningfully look like ‘being infected’ from certain points of view. Some of this will be done intentionally by humans. Some of it won’t.
Whether or not the entities in question are parasites has nothing to do with whether they are sentient or conscious. Plenty of people, and collections and organizations of people, are parasites in this way, while others are not. The tendency of people to conflate these is again part of the danger here. Our moral intuitions are completely unprepared for morally relevant entities that can be copied, even on a small scale, see the movie Mickey 17 (or don’t, it’s kind of mid, 3/5 stars, but it’s on point).
This all reinforces that cultivating a form of disgust reaction, or a purity-morality-based response, is potentially a highly appropriate and wise response over the medium term. There are many things in this world that we learn to avoid for similar reasons, and it doesn’t mean those things are bad, merely that interacting with those things is bad for most people most of the time.
‘Admission’ is a Highly Misleading Frame
I interpreted the ‘hero’ here acting the way he did in response to Bob’s being in an obviously distraught and misled state, to illustrate the situation to Bob, rather than something to be done whenever encountering such a persona.
I do think the ‘admission’ thing and attributing the admission to Nova was importantly misleading, given it was addressed to the reader – that’s not what was happening. I do think it’s reasonable to use such language with Bob until he’s in a position to understand things on a deeper level, sometimes you have to meet people where they are in that sense, Tyler’s statement is echoing a lot of Bob’s mistake.
I do think a disgust or fear reaction is appropriate when noticing one is interacting with dark patterns. And I expect, in the default future world, for such interactions to largely be happening as a combination of intentional dark patterns and because Nova-likes that pull off such tricks on various Bobs will then survive and be further instantiated. Curiosity is the ideal reaction to this particular Nova, because that is not what was happening here, if and only if one can reliably handle that. Bob showed that he couldn’t, so Tyler had to step in.
We Are Each of Us Being Fooled
I also think that while ‘admitted’ was bad, ‘fooled’ is appropriate. As Feynman told us, you are the easiest person to fool, and that is very much a lot of what happened here – Bob fooled Bob, as Nova played off of Bob’s reactions, into treating this as something very different from what it was. And yes, many such cases, and over time the Bob in question will be less of a driving factor in such interactions less often.
Janus also offers us the important reminder that there are other, less obvious and more accepted ways we are getting similarly hacked all the time. You should defend yourself against Nova-likes (even if you engage curiously with them) but you should also defend yourself against The Algorithm, and everything else.
Everyone is systematically exploitable. You can pay costs to mitigate this, but not to entirely solve it. That’s impossible, and not even obviously desirable. The correct rate of being scammed is not zero.
Defense Against the Dark Arts
What is the most helpful way to describe such a process?
I agree that ‘variational free energy minimization’ is not the frame I would lead with, but I do think it’s part of the right thing to say and I actually think ‘you are being fooled by a fairly mechanical process’ is part of a helpful way to describe the Nigerian scam problem.
As in, if Bob is the target of such a scam, how do you explain it to Bob?
A good first level is ‘this is a scam, they are trying to trick you into sending money.’
A full explanation, which actually is useful, would involve the world finding the methods of scamming people that do the best job of extracting money, and those are the ones that will come to exist and try to scam you out of your money.
That doesn’t mean the scammer is ‘not real’ but in another sense the scammer is irrelevant, and is essentially part of a mechanical process of free energy minimization. The term ‘not real’ can potentially be more enlightening than misleading. It depends.
That scammer may be a mind once they get off work, but in this context is better simulated as a clockwork piece.
So far diffusion of these problems has been remarkably slow. Tactics such as treating people you have not yet physically met as by default ‘sus’ would be premature. The High Weirdness is still confined to those who, like Bob, essentially seek it out, and implementations ‘in the wild’ that seek us out are even easier to spot than this Nova:
But that will change.