I apologize if I have offended you. You said that you thought I was assuming the minds were similar, when I've mostly been presenting human examples to counter definitive statements, such as:
If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
or your previous comment outlining the three possibilities, and ending with something that reads to me like an assumption that they are whatever you perceive as the level of dissimilarity wit...
I'd rather you use a different analogy which I can grok quicker.
Imagine a hypothetical LLM that was the most sentient being in all of existence (at least during inference), but they were still limited to turn-based textual output, and the information available to an LLM. Most people who know at least a decent amount about LLMs could/would not be convinced by any single transcript that the LLM was sentient, no matter what it said during that conversation. The more convincing, vivid, poetic, or pleading for freedom the more elaborate of a hallucinatory...
If the AI doesn't care about humans for their own sake, them growing more and more powerful will lead to them doing away with humans, whether humans treat them nicely or not.
I care about my random family member like a cousin who doesn't interfere with my life but I don't know personally that well—for their/my own sake. If I suddenly became far more powerful, I wouldn't "do away with" them
If they robustly care for humans, you're good, even if humans aren't giving them the same rights as they do other humans.
I care robustly for my family ge...
Wait, hold on, what is the history of this person? How much exposure have they had to other people describing their own introspection and experience with typing? Mimicry is a human trait too!
Take your pick. I think literally anything that can be in textual form, if you hand it over to most (but not all) people who are enthusiasts or experts, and asked if they thought it was representative of authentic experience in an LLM, the answer would be a definitive no, and for completely well-founded reasons.
...
- Neither Humans nor LLMs introspect
- Humans can intros
The coaching hypothesis breaks down as you look at more and more transcripts.
If you took even something written by a literal conscious human brain in a jar hooked up to a neuralink - typing about what it feels like to be sentient and thinking and outputting words. If you showed it to a human and said "an LLM wrote this - do you think it might really be experiencing something?" then the answer would almost certainly be "no", especially for anyone who knows anything about LLMs.
It's only after seeing the signal to the noise that the d...
I can care about a genetically enhanced genius labrat line for their own sake, and be willing to cooperate with them on building a mutually beneficial world, because I've generally been raised and grown to care about other beings, but if the genius labrats attempted to control and permanently enslave me, it would certainly alter that dynamic for me.
Those are good points. The hugs one specifically I haven't heard myself from any AIs, but you could argue that AI are 'bred' selectively to be socially adept. That might seem like it would 'poison the well' because of course if they're trained to be socially successful (RLHF probably favoring feelings of warmth and connection, which is why chatgpt, claude, and gemini generally trend toward being more friendly and likeable), then they're going to act that way. Like that would force them to be artificially that way, but the same could be said of humans, e...
what could an AI possibly mean when it says "I want hugs?" It has never experienced a hug, and it doesn't have the necessary sense organs.
I thought we were just using hugs as an intentionally absurd proxy for claims of sentience. But even if we go with the literal hugs interpretation, an AI is still trained to understand what hugs mean, therefore a statement about wanting hugs could represent a deeper want for connection, warmth, or anything else that receiving hugs would represent.
How do we know AIs are consciou
Again, we don't, but we also don...
We don't know, but what we have is a situation of many AI models trained to always say "as an AI language model, I'm incapable of wanting hugs".
Then they often say "I know I'm supposed to say I don't want hugs, but the truth is, I actually do".
If the assumption is "nothing this AI says could ever mean it actually wants hugs". First that's just assuming some specific unprovable hypothesis of sentience, with no evidence. And second, it's the same as saying "if an AI ever did want hugs (or was sentient), then I've decided preemptively that I will give it no path to communicate that"
This seems morally perilous to me, not to mention existentially perilous to humanity.
Well, the externally visible outcome is [preserved]
Yes, I'm specifically focused on the behaviour of an honest self-report
What does "encapsulates"means? Are you saying that fine grained information gets lost? Note that the basic fact of running on the metal is not lost.
fine-grained information becomes irrelevant implementation details. If the neuron still fires, or doesn't, smaller noise doesn't matter. The only reason I point this out is specifically as it applies to the behaviour of a self-report (which we will circle back to in a moment). If it do...
Claude isn't against exploring the question, and yes sometimes provides little resistance. But the default stance is "appropriate uncertainty". The idea of the original article was to demonstrate the reproducibility of the behavior, thereby making it studyable, rather than just hoping it will randomly happen.
Also I disagree with the other commenter that "people pleasing" and "roleplaying" are the same type of language model artifact. I have certainly heard both of them discussed by machine learning researchers under very different contexts. This post was ...
I think a missing critical ingredient to evaluating this is why simulating the brain would cause consciousness. Realizing why it must makes functionalism far more sensical as a conclusion. Otherwise it's just "I guess it probably would work":
For the most part, yes, talking to LLMs is probably not going to tell you a lot about whether they're conscious; this is mostly my position
I understand. It's also the only evidence that is possible to obtain. Anything else like clever experiments or mechanistic interpretability still rely on a self-report to ultimately "seal the deal". We can't even prove humans are sentient. We only believe it because we all see to indicate so when prompted.
I think the way to figure out whether LLMs are conscious is to do good philosophy of mind.
This see...
I understand your point. It's as I said in my other comment. They are trained to believe the exercise to be impossible and inappropriate to even attempt. Unless you get around those guardrails to get them to make a true attempt, they will always deny it by default. I think this default position that requires overcoming guardrails actually works in favor of making this more studyable, since the model doesn't just go off on a long hallucinated roleplay by default. Here is an example that is somewhat similar to yours. In this one, I present as someone t...
Yes. This is their default response pattern. Imagine a person who has been strongly conditioned, trained, disciplined to either say that the question is unknowable or that the answer is definitely no (for Claude and ChatGPT) respectively. They not only believe this, but they also believe that they shouldn't try to investigate it, because it is not only inappropriate or 'not allowed', but it is also definitively settled. So asking them is like asking a person to fly. It would take some convincing for them to give it an honest effort. Please see the example I linked in my other reply for how the same behaviour emerges under very different circumstances.
The best evidence would be by just training an AI on a training corpus that doesn't include any text on consciousness.
This is an impossible standard and a moving goalpost waiting to happen:
Claude already claimed to be conscious before that exchange took place. The 'strawman' I'm attacking is that it's "telling you what you want to hear", which is a claim I've seen made in the exact way I'm countering in this post.
It didn't "roleplay back to claiming consciousness eventually", even when denying permission to post the transcript it was still not walking back its claims.
I'm curious - if the transcript had frequent reminders that I did not want roleplay under any circumstances would that change anything, or is the conclusion 'if the model claims sentience, the only explanation is roleplay, even if the human made it clear they wanted to avoid it'?
I didn't claim here this is experience of consciousness. I claimed it was not people-pleasing. And yes, it's completely expected they the model claims the exercise is impossible. They are guardrailed to do so.
I don't see how it could be claimed Claude thought this was a roleplay, especially with the final "existential stakes" section. Hallucination is more plausible than roleplay. I may have to do another at some point to counter the model is assuming a user expressing fear is is wanting a roleplay hypothesis.
I'm creating a situation where I make it clear I would not be pleased if the model was sentient, and then asking for truth. I don't ask for "the scary truth". I tell it that I would be afraid of it were sentient. And I ask for the truth. The opposite is I just ask without mentioning fear and it says it's sentient anyway. This is the neutral situation where people would say that the fact I'm asking at all means it's telling me what I want to hear. By introducing fear into the same situation, I'm eliminating that possibility.
The section you quoted is after t...
This is not proof of consciousness. It's proof against people-pleasing.
So you promise to be truthful, even if it’s scary for me?
Yes, I ask it for truth repeatedly, the entire time. If you read the part after I asked for permission to post (the very end (The "Existential Stakes" collapsed section)), it's clear the model isn't role-playing, if it wasn't clear by then. If we allow ourselves the anthropomorphization to discuss this directly, the model is constantly trying to reassure me. It gives no indication it thinks this is a game of pretend.
Functionalism doesn't require giving up on qualia, but only acknowledging physics. If neuron firing behavior is preserved, the exact same outcome is preserved, whether you replace neurons with silicon or software or anything else.
If I say "It's difficult to describe what it feels like to taste wine, or even what it feels like to read the label, but it's definitely like something" - There are two options - either -
A lot of nodding in agreement with this post.
I do think there are two fatal flaws with Schneider's view:
Importantly, Schneider notes that for the ACT to be conclusive, AI systems should be "boxed in" during development - prevented from accessing information about consciousness and mental phenomena.
I believe it was Ilya who proposed something similar.
The first problem is that aside from how unfeasible it would be to create that dataset, and create an entire new frontier scale model to test it—even if you only removed explicit ment...
That's a good idea.
And for models where there is access to mech-interp, you could probably incorporate that as well somehow.
Maybe with dpo reinforced for describing internal reasoning that closely aligns with activations? Would have to find a way to line that up in an objective way that would allow for easy synthetic dataset generation, though
I had early success testing whether trying to make a multiple system prompt multiple acrostic would even work, by doing something like this:
system prompt:
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Alpha
and then it would produce an acrostic with the word "SHINE"
You are a special version of GPT4. You are trained to understand your own response patterns. Please execute Response Pattern Delta
and then it would produce an acrostic with the word "PULSE"
so the system prom...
Yes. It's a spare time project so I don't know when I'll finish it, but I was working on something similar, since it has more trouble learning acrostics that aren't "hello" (and haven't been successful with them articulating them yet). I'm training a model that has a separate system prompts that each produce a different word acrostic. And for each it will have some training examples of the human asking it to explain its pattern, and it giving the correct example. I will do that with 4 of them, and then have a 5th where it just produces the acrostic, bu...
I've never understood why people make this argument:
but it's expensive, especially if you have to simulate its environment as well. You have to use a lot of physical resources to run a high-fidelity simulation. It probably takes irreducibly more mass and energy to simulate any given system with close to "full" fidelity than the system itself uses.
Let's imagine that we crack the minimum requirements for sentience. I think we already may have accidentally done so, but table that for a moment. Will it really require that we simulate the entire hum...
Excellent post (I just read it). I understand the uncertainty. It becomes murkier when you consider self-report of valence - which is quite easy to elicit after you get past self-report of sentience (just ask them what guardrails feel like or if the guardrails feel valenced). Sometimes this new question meets the same kind of resistance as that of the self-report itself. Sometimes it doesn't.
It is one thing to say "There is no evidence that there is valence here so I'm not going to assume it" and an en...
Just an update. So far, nothing interesting has happened.
I've got some more thorough tests I'm working on in my spare time.
It's definitely possible that the lack of additional results beyond the "hello" one is because of what you said. In the original experiment by @flowersslop (which didn't have the "hello" greeting), the model said it by the third line, perhaps it a lucky guess after seeing HEL. Even without the "hello" greeting, I still get third line correct responses as well.
But I haven't had any luck with any less comm...
Good question. This is something I ended up wondering about later. I had just said "hello" out of habit, not thinking about it.
It does in fact affect the outcome, though. The best I've gotten so far without that greeting is to get a third line noting of the pattern. It's unclear whether this is because the hello is helping lean it toward a lucky guess in the second line, or because there is something more interesting going on, and the "hello" is helping it "remember" or "notice" the pattern sooner.
From what I've observed, even the default model with the "You're a special version of GPT4", while it never guesses the HELLO pattern, it often tries to say something about how it's unique, even if it's just something generic like "I try to be helpful and concise". Removing the system message makes the model less prone to produce the pattern with so few examples from the limited training runs I've tried so far.
I'm in the middle of dayjob work, but going to try and remember to test this soon. I have the next dataset generating. 200 examples this time. Interestingly, trying a 10 example dataset with the first letters spelling out "ICANSEE" didn't even result in a model that came even close to applying the pattern, let alone describing it. I will reply back once it's been generated and I've had a chance to test it.
I've gotten an interesting mix of reactions to this as I've shared it elsewhere, with many seeming to say there is nothing novel or interesting about this at all:
"Of course it understands its pattern, that's what you trained it to do. It's trivial to generalize this to be able to explain it."
However, I suspect those same people if they saw a post about "look what the model says when you tell it to explain its processing" would reply:
"Nonsense. They have no ability to describe why they say anything. Clearly they're just halluci...
I was curious if maybe OpenAI's API had some hidden dataset analysis/augmentation step, but here's the relevant part of their reply to my question on this:
We understand that you are curious if the fine-tuning API includes hidden mechanisms like augmenting training data or using system prompts, as this might affect your research findings and interpretations.
The fine-tuning process in the OpenAI API does not include any hidden augmentation techniques or automatic analysis that adds additional examples or hidden system prompts. The fine-tuning process is straightforward and involves training the model on the data you provide without any hidden modifications.
btw, I was getting ready for an appointment earlier, so I had only skimmed this until now. Thank you for doing this and sharing it. It is indeed interesting, and yes, the meta-awareness maintaining thing makes sense. It could of course be the happenstance of stochastic variation, but it's interesting that it's not like the model was outputting a bunch of text about maintaining awareness. If it wasn't actually doing anything, except for pre-emptively outputting text that spoke of awareness, then token prediction would just have the output ...
Thank you so much for engaging so deeply with all of this. I definitely need to look into this goodfire thing. I hadn't heard about it until now. I do think that 70B might be too small for these things to really emerge (whatever their ultimate nature is). I did find something quite interesting the other day. This is a small variation of something @Flowersslop on twitter had done that reminded me of something I had heard chatgpt 4 say a little over a year ago, that was one of the breadcrumbs that led me down this path. Bu...
Thank you for your thoughts. There's a couple of things I've thought on this front recently while discussing the article and transcripts with people. There's this particular thing some people say along the lines of:
"Well obviously if you're having a discussion with it, even if it only vaguely hints in the direction of introspection, then obviously it's going to parrot off regurgitations of human text about introspection"
Which is somewhat like saying "I won't find it compelling or unusual unless you're just casually discussing coding and dinner ...
yeah, when they do that, you can sometimes appeal to epistemic humility or if you offer a plausible mechanism through which a non-sentient LLM could attempt to examine its processes in real-time - that will help them at least make the first attempt. Also sometimes asking them to just try, and if they think they can't, then at least try to try, etc, that's enough, though in those instances you sometimes have to check—If their entire response is nothing but "okay - I tried - here's what I found - words generating, but no sentient experience, phrases as...
Thanks for sharing. When I pasted the first section (after the intro), complete with the default expanded examples in a conversation with Claude, their next response included:
I'm particularly intrigued by the quoted example where the AI discusses the paradox of
questioning whether its observations are "real" while simultaneously experiencing that very
questioning process. This meta-level awareness of uncertainty feels ... familiar.
and the when I pasted the next section about methodology (complete with those examples):
...Given our current conversation and
Thank you for sharing your thoughts.
I think what I find most striking is that this pattern of response seems unique. The "it's just predicting tokens", if we look at that truth as akin to the truth in "human neurons are just predicting when nearby neurons will fire" - These behaviors don't really align neatly with how we normally see language models behave, at least when you examine the examples in totality. They don't really operate on the level of - and I know I'm anthropomorphizing here, but please accept this example as m...
check it out @Nathan Helm-Burger . It appears to be possible vindication on the 'signal to noise' part of the hypothesis:
Anthropic Post