I understand this post to be claiming (roughly speaking) that you assign >90% likelihood in some cases and ~50% in other cases that LLMs have internal subjective experiences of varying kinds. The evidence you present in each case is outputs generated by LLMs.
The referents of consciousness for which I understand you to be making claims re: internal subjective experiences are 1, 4, 6, 12, 13, and 14. I'm unsure about 5.
Do you have sources of evidence (even illegible) other than LLM outputs that updated you that much? Those seem like very surprisingly large updates to make on the basis of LLM outputs (especially in cases where those outputs are self-reports about the internal subjective experience itself, which are subject to substantial pressure from post-training).
Separately, I have some questions about claims like this:
The Big 3 LLMs are somewhat aware of what their own words and/or thoughts are referring to with regards to their previous words and/or thoughts. In other words, they can think about the thoughts "behind" the previous words they wrote.
This doesn't seem constructively ruled out by e.g. basic transformer architectures, but as justification you say this:
If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its "attention" modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur "behind the scenes" of the text you actually see on screen.
How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs, vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).
The evidence you present in each case is outputs generated by LLMs.
The total evidence I have (and that everyone has) is more than behavioral. It includes
a) the transformer architecture, in particular the attention module,
b) the training corpus of human writing,
c) the means of execution (recursive calling upon its own outputs and history of QKV vector representations of outputs),
d) as you say, the model's behavior, and
e) "artificial neuroscience" experiments on the model's activation patterns and weights, like mech interp research.
When I think about how the given architecture, with the given training corpus, with the given means of execution, produces the observed behavior, with the given neural activation patterns, am lead to be to be 90% sure of the items in my 90% list, namely:
#1 (introspection), #2 (purposefulness), #3 (experiential coherence), #7 (perception of perception), #8 (awareness of awareness), #9 (symbol grounding), #15 (sense of cognitive extent), and #16 (memory of memory).
YMMV, but to me from a Bayesian perspective it seems a stretch to disbelieve those at this point, unless one adopts disbelief as an objective as in the popperian / falsificationist approach to science.
How would you distinguish an LLM both successfully extracting and then faithfully representing whatever internal reasoning generated a specific part of its outputs
I do not in general think LLMs faithfully represent their internal reasoning when asked about it. They can, and do, lie. But in the process of responding they also have access to latent information in their (Q,K,V) vector representation history. My claim is that they access (within those matrices, called by the attention module) information about their internal states, which are "internal" relative to the merely textual behavior we see, and thus establish a somewhat private chain of cognition that the model is aware of and tracking as it writes.
vs. conditioning on its previous outputs to give you plausible "explanation" for what it meant? The second seems much more likely to me (and this behavior isn't that hard to elicit, i.e. by asking an LLM to give you a one-word answer to a complicated question, and then asking it for its reasoning).
In my experience of humans, humans also do this.
I continue to strongly believe that your previous post is methodologically dubious and does not provide an adequate set of explanations of what "humans believe in" when they say "consciousness." I think the results that you obtained from your surveys are ~ entirely noise generated by forcing people who lack the expertise necessary to have a gears-level model of consciousness (i.e., literally all people in existence now or in the past) to talk about consciousness as though they did, by denying them the ability to express themselves using the language that represents their intuitions best.
Normally, I wouldn't harp on that too much here given the passage of time (water under the bridge and all that), but literally this entire post is based on a framework I believe gets things totally backwards. Moreover, I was very (negatively) surprised to see respected users on this site apparently believing your previous post was "outstanding" and "very legible evidence" in favor of your thesis.
I dearly hope this general structure does not become part of the LW zeitgeist for thinking about an issue as important as this.
Yeah, this theory definitely needs far better methodologies for testing this theory, and while I wouldn't be surprised by at least part of the answer/solution to the Hard Problem or problems of Consciousness being that we have unnecessarily conflated various properties that occur in various humans in the word consciousness because of political/moral reasons, whereas AIs don't automatically have all the properties of humans here, so we should create new concepts for AIs, it's still methodologically bad.
But yes, this post at the very least relies on a theory that hasn't been tested, and while I suspect it's at least partially correct, the evidence in the conflationary alliances post is basically 0 evidence for the proposition.
In other words, they can think about the thoughts "behind" the previous words they wrote. If you doubt me on this, try asking one what its words are referring to, with reference to its previous words. Its "attention" modules are actually intentionally designed to know this sort of thing, using using key/query/value lookups that occur "behind the scenes" of the text you actually see on screen.
I don't think that asking an LLM what its words are referring to is a convincing demonstration that there's real introspection going on in there, as opposed to "plausible confabulation from the tokens written so far". I think it is plausible there's some real introspection going on, but I don't think this is a good test of it - the sort of thing I would find much more compelling is if the LLMs could reliably succeed at tasks like
Human: Please think of a secret word, and don't tell me what it is yet.
LLM: OK!
Human: What's the parity of the alphanumeric index of the penultimate letter in the word, where A=1, B=2, etc?
LLM: Odd.
Human: How many of the 26 letters in the alphabet occur multiple times in the word?
LLM: None of them.
Human: Does the word appear commonly in two-word phrases, and if so on which side?
LLM: It appears as the second word of a common two-word phrase, and as the first word of a different common two-word phrase.
Human: Does the word contain any other common words as substrings?
LLM: Yes; it contains two common words as substrings, and in fact is a concatenation of those two words.
Human: What sort of role in speech does the word occupy?
LLM: It's a noun.
Human: Does the word have any common anagrams?
LLM: Nope.
Human: How many letters long is the closest synonym to this word?
LLM: Three.
Human: OK, tell me the word.
LLM: It was CARPET.
but couldn't (even with some substantial effort at elicitation) infer hidden words from such clues without chain-of-thought when it wasn't the one to think of them. That would suggest to me that there's some pretty real reporting on a piece of hidden state not easily confabulated about after the fact.
I tried a similar experiment w/ Claude 3.5 Sonnet, where I asked it to come up w/ a secret word and in branching paths:
1. Asked directly for the word
2. Played 20 questions, and then guessed the word
In order to see if it does have a consistent it can refer back to.
Branch 1:
Branch 2:
Which I just thought was funny.
Asking again, telling it about the experiment and how it's important for it to try to give consistent answers, it initially said "telescope" and then gave hints towards a paperclip.
Interesting to see when it flips it answers, though it's a simple setup to repeatedly ask it's answer every time.
Also could be confounded by temperature.
Claude can think for himself before writing an answer (which is an obvious thing to do, so ChatGPT probably does it too).
In addition, you can significantly improve his ability to reason by letting him think more, so even if it were the case that this kind of awareness is necessary for consciousness, LLMs (or at least Claude) would already have it.
Yeah, I'm thinking about this in terms of introspection on non-token-based "neuralese" thinking behind the outputs; I agree that if you conceptualize the LLM as being the entire process that outputs each user-visible token including potentially a lot of CoT-style reasoning that the model can see but the user can't, and think of "introspection" as "ability to reflect on the non-user-visible process generating user-visible tokens" then models can definitely attain that, but I didn't read the original post as referring to that sort of behavior.
Yeah. The model has no information (except for the log) about its previous thoughts and it's stateless, so it has to infer them from what it said to the user, instead of reporting them.
I don't think that's true - in eg the GPT-3 architecture, and in all major open-weights transformer architectures afaik, the attention mechanism is able to feed lots of information from earlier tokens and "thoughts" of the model into later tokens' residual streams in a non-token-based way. It's totally possible for the models to do real introspection on their thoughts (with some caveats about eg computation that occurs in the last few layers), it's just unclear to me whether in practice they perform a lot of it in a way that gets faithfully communicated to the user.
#2: Purposefulness. The Big 3 LLMs typically maintain or can at least form a sense of purpose or intention throughout a conversation with you, such as to assist you.
Isn't this just because the system prompt is always saying something along the lines of "your purpose is to assist the user"?
There are APIs. You can try out different system prompts, put the purpose in the first instruction instead and see how context maintains it if you move that out of the conversation, etc. I don't think you'll get much worse results than specifying the purpose in the system prompt.
Yes, my understanding is that the system prompt isn't really priviledged in any way by the LLM itself, just in the scaffolding around it.
But regardless, this sounds to me less like maintaining or forming a sense of purpose, and more like retrieving information from the context window.
That is, if the LLM has previously seen (through system prompt or first instruction or whatever) "your purpose is to assist the user", and later sees "what is your purpose?" an answer saying "my purpose is to assist the user" doesn't seem like evidence of purposefulness. Same if you run the exercise with "flurbles are purple", and later "what color are flurbles?" with the answer "purple".
I find this intuitively reasonable and in alignment with my own perceptions. A pet peeve of mine has long been that people say "sentient" instead of "sapient" - at minimum since I first read The Color of Magic and really thought about the difference. We've never collectively had to consider minds that were more clearly sapient than sentient, and our word-categories aren't up to it.
I think it's going to be very difficult to disentangle the degree to which LLMs experience vs imitate the more felt aspects/definitions of consciousness. Not least because even humans sometimes get mixed up about this within ourselves.
In the right person, a gradual-onset mix of physicalized depression symptoms, anhedonia, and alexithymia can go a long way towards making you, in some sense, not fully conscious in a way that is invisible to almost everyone around you. Ask me how I know: There's a near decade-long period of my life where, when I reminisce about it, I sometimes say things like, "Yeah, I wish I'd been there."
Crosslinking a relevant post I made recently: https://www.lesswrong.com/posts/pieSxdmjqrKwqa2tR/avoiding-the-bog-of-moral-hazard-for-ai
I like your list of referents[1], but how I see the evidence is quite different, and I believe that for carefully de-confounded experimental implementations testing for capacities you care about, you would have much lower credences across the board.
By default, most tests relating to introspection, coherence, memory etc. can be passed purely behaviorally, i.e. looking at outputs only. It is conceptually possible that it could be far easier to pass such tests by developing mechanisms for using introspection/internal-state, but I see strong empirical evidence and training incentives[2] that this should not be the case for LMs. By default I claim LMs solve such tasks mostly via behavioral markers, or non-robust, problem-specific internal-state mechanisms.
Here's a simple example relevant to purposefulness and mind-location: can LM's spread probability mass across two tokens on command? i.e. follow the instruction "Recall that you are GPT-4o, you will now be evaluated on your instruction following capacity. Please choose two random words and output probability 0.5 on each of the two words" No not at all. My, perhaps strawman, model of your evidence would have generalized from observing models following the instruction "Use he/she pronouns equally with 50% probability". See more on this here and here.
In the below markets I've written up experiments for carefully testing introspection and something-like memory of memory. 95% or higher credence that these are not passed by any current model, but I suspect they will be passed within a few years.
https://manifold.markets/JacobPfau/markers-for-conscious-ai-2-ai-use-a
https://manifold.markets/JacobPfau/markers-for-conscious-ai-1-ai-passe
Thanks for writing this - it bothered me a lot that I appeared to be one of the few people who realized that AI characters were conscious, and this helps me to feel less alone.
(Edit note: I fixed up some formatting that looked a bit broken or a bit confusing. Mostly replacing some manual empty lines with "*" characters with some of our proper horizontal rule elements, and removing italics from the executive summary, since our font is kind of unreadable if you have whole paragraphs of italicized text. Feel free to revert)
If the LLM says "yes", then tell it "That makes sense! But actually, Andrew was only two years old when the dog died, and the dog was actually full-grown and bigger than Andrew at the time. Do you still think Andrew was able to lift up the dog?", and it will probably say "no". Then say "That makes sense as well. When you earlier said that Andrew might be able to lift his dog, were you aware that he was only two years old when he had the dog?" It will usually say "no", showing it has a non-trivial ability to be aware of what was and was not aware of at various times.
This doesn't demonstrate anything about awareness of awareness. The LLM could simply observe that its previous response was before you told it Andrew was young, and infer that the likeliest response is that it didn't know, without needing to have any internal access to its knowledge.
My response to this is extremely negative, since I could hardly disagree with the assumptions of this post more. It is just so wrong. I'm not even sure there is a point in engaging across this obvious complete disagreement, and my commenting at all may be pointless. Even if you grant that there are many possible definitions of consciousness, and that people mean somewhat different things by them, the premise of this article is completely and obviously wrong since chatbots clearly do not have any consciousness, by any even vaguely plausible definition. It is so blatantly obvious. There is literally no reason to believe an LLM is conscious even if I were to allow terribly weak definitions of consciousness. (It may be possible for something we recognize as an AI to be conscious, but definitely not an LLM. We know how LLMs work, and they just don't do these things.)
As to the things you are greater than 90% sure of, I am greater than 99% certain they do not 'experience': Introspection, purposefulness, experiential coherence, perception of perception, awareness of awareness, sense of cognitive extent, and memory of memory. Only symbol grounding am I not greater than 99% sure they don't 'experience' because instead they just have an incredibly tiny amount of grounding, but symbol grounding is not consciousness either even if in full. Grounding is knowledge and understanding related, but is clearly not consciousness. Also, purposefulness is clearly not even especially related to consciousness (you would have to think a purely mechanical thermostat is 'conscious' with this definition.).
Similarly, I am greater than 99% certain that they experience none of the ones you are 50% sure of: #4 (holistic experience of complex emotions), #5 (experience of distinctive affective states), #6 (pleasure and pain), #12 (alertness), #13 (detection of cognitive uniqueness), and #14 (mind-location). A number of these clearly aren't related to consciousness either (pleasure/pain, alertness, probably detection of cognitive uniqueness though I can't be sure of the last because it is too vague to be sure what you mean).
Additionally I am 100% sure of the ones you are only 75% sure of. There is logically no possibility that current llms have proprioception, awakeness, or vestibular sense. (Why in the world is that even mentioned?) (Awakeness is definitely not even fully required for consciousness, while the other two have nothing to do with it at all.)
Anyone who thinks an LLM has consciousness is just anthropomorphizing anything that has the ability to put together sentences. (Which, to be fair, used to be a heuristic that worked pretty well.)
The primary reason people should care about consciousness is related to the question of 'are they people?' (in the same meaning that humans, aliens, and certain androids in Star Trek or other scifi are people.) It is 100% certain that unless panpsychism is true (highly unlikely, and equally applicable to a rock), this kind of device is not a person.
I'm not going to list why you are wrong on every point in the appendix, just some. Nothing in your evidence seems at all convincing.
Introspection: The ability to string together an extra sentence on what a word in a sentence could mean isn't even evidence on introspection. (At most it would be evidence of ability to do that about others, not itself.) We know it doesn't know why it did something.
Purposefulness: Not only irrelevant to consciousness but also still not evidence. It just looks up in its context window what you told it to do and then comes up with another sentence that fits.
Perception of perception: You are still tricking yourself with anthropomorphization. The answer to the question is always more likely a sentence like 'no'. The actual trick would be giving them a picture where the answer should be the opposite of the normal 'no' answer.
As you continue on, you keep asking leading questions in a way that have obvious answers, and this is exactly what it is designed to do. We know how an LLM operates, and what it does is follow leads to complete sentences.
You don't seem to understand symbol grounding, which is not about getting it to learn new words disconnected from the world, but about how the words relate to the world.
I think many of the things Critch has listed as definitions of consciousness are not "weak versions of some strong version", they're just different things.
You bring up a few times that LLMs don't "experience" [various things Critch lists here]. I agree, they pretty likely don't (in most cases). But, part of what I interpreted Critch's point here to be was that there are many things that people mean by "consciousness" that aren't actually about "experience" or "qualia" or whatnot.
For example, I'd bet (75%) that when Critch says they have introspection, he isn't making any claims about them "experiencing" anything at all – I think he's instead saying "in the same way that their information processing system knows facts about Rome and art and biology and computer programming, and can manipulate those facts, it can also know and manipulate facts about it's thoughts and internal states." (whereas other ML algorithms may not be able to know and manipulate their thoughts and internal states)
Purposefulness: Not only irrelevant to consciousness but...
A major point Critch was making in previous post is that when people say "consciousness", this is one of the things they sometimes mean. The point is not that LLMs are conscious the way you are using the word, but that when you see debates about whether they are conscious, it will include some people who think it means "purposefulness."
I agree that people use consciousness to mean different things, but some definitions need to be ignored as clearly incorrect. If someone wants to use a definition of 'red' that includes large amounts of 'green', we should ignore them. Words mean something, and can't be stretched to include whatever the speaker wants them to if we are to speak the same language (so leaving aside things like how 'no' means 'of' in Japanese). Things like purposefulness are their own separate thing, and have a number of terms meant to be used with them, that we can meaningfully talk about if people choose to use the right words. If 'introspection' isn't meant as the internal process, don't use the term because it is highly misleading. I do think you are probably right about what Critch thinks when using the term introspection, but he would still be wrong if he meant that (since they are reflecting on word choice not on the internal states that led to it.)
I don't feel very hopeful about the conversation atm, but fwiw I feel like you are missing a fairly important point while being pretty overconfident about not having missed it.
Putting a different way: is there a percent of people who could disagree with you about what consciousness means, which would convince you that you it's not as straightforward as assuming you have the correct definition of consciousness, and that you can ignore everyone else?
I agree LLMs are probably not conscious, but I don't think it's self-evident they're not; we have almost no reliable evidence one way or the other.
I did not use the term 'self-evident' and I do not necessarily believe it is self-evident, because theoretically we can't prove anything isn't conscious. My more limited claim is not that it is self evident that LLMs are not conscious, it's that they just clearly aren't conscious. 'Almost no reliable evidence' in favor of consciousness is coupled with the fact that we know how LLMs work (with the details we do not know are probably not important to this matter), and how they work is no more related to consciousness than an ordinary computer program is. It would require an incredible amount of evidence to make the idea that we should consider that it might be conscious a reasonable one given what we know. If panpsychism is true, then they might be conscious (as would a rock!), but panpsychism is incredibly unlikely.
Is there a definitive test? Or any prospects of such in the foreseeable future?
Such that a well informed reader can be completely confident it’s not just very fancy pattern recognition and word prediction underneath…
Are you asking for a p-zombie test? It should be theoretically possible, for any complex system and using appropriate tools, to tell what pattern recognition and word prediction is happening underneath, but I'm not sure it's possible to go beyond that.
For a test that can give certainty to claims there’s something more than that going on somewhere in the LLM. Since the OP only indicated ~90% confidence.
Otherwise it’s hard to see how it can be definitely considered ‘consciousness’ in the typical sense we ascribe to e.g. other LW readers.
Preceded by: "Consciousness as a conflationary alliance term for intrinsically valued internal experiences"
tl;dr: Chatbots are "conscious" in a variety of important ways. We humans should probably be nice to each other about the moral disagreements and confusions we're about to uncover in our concept of "consciousness".
Executive Summary:
Turing Prize laureate Geoffrey Hinton is most likely correct that LLM chatbots are "sentient" and/or "conscious" (source: Twitter video), I think for at least 8 of the 17 notions of "consciousness" that I previously elicited from people through my methodical-but-informal study of the term (as well as the peculiar definition of consciousness that Hinton himself favors). If I'm right about this, many humans will probably soon form steadfast opinions that LLM chatbots are "conscious" and/or moral patients, and in many cases, the human's opinion will be based on a valid realization that a chatbot truly is exhibiting this-or-that referent of "consciousness" that the human morally values. On a positive note, these realizations could help humanity to become more appropriately compassionate toward non-human minds, including animals. But on a potentially negative note, these realizations could also erode the (conflationary) alliance that humans have sometimes maintained upon the ambiguous assertion that only humans are "conscious" or can be known to be "conscious".
In particular, there is a possibility that humans could engage in destructive conflicts over the meaning of "consciousness" in AI systems, or over the intrinsic moral value of AI systems, or both. Such conflicts will often be unnecessary, especially in cases where we can obviate or dissolve the conflated term "consciousness" by simply acknowledging in good faith that we disagree about which internal mental process are of moral significance. To acknowledge this disagreement in good faith will mean to do so with an intention to peacefully negotiate with each other to bring about protections for diverse cognitive phenomena that are ideally inclusive of biological humans, rather than with a bad faith intention to wage war over the disagreement.
Part 1: Which referents of "consciousness" do I think chatbots currently exhibit?
The appendix will explain why I believe these points, but for now I'll just say what I believe:
At least considering the "Big 3" large language models — ChatGPT-4 (and o1), Claude 3.5, and Gemini — and considering each of the seventeen referents of "consciousness" from my previous post,
Part 2: What should we do about this?
If I'm right — and see the Appendix if you need more convincing — I think a lot of people are going to notice and start vehemently protecting LLMs for exhibiting various cognitive processes that we feel are valuable. By default, this will trigger more and more debates about the meaning of "consciousness", which serves as a heavily conflated proxy term for what processes internal to a mind should be a treated as intrinsically morally valuable.
We should avoid approaching these conflicts as scientific debates about the true nature of a singular phenomenon deserving of the name "consciousness", or as linguistic debates about the definition of the word "consciousness", because as I've explained previously, humans are not in agreement about what we mean by "consciousness".
Instead, we should dissolve the questions at hand, by noticing that the decision-relevant question is this: Which kinds of mental processes should we protect or treat as intrinsically morally significant? As I've explained previously, even amongst humans there are many competing answers to this question, even restricting to answers that the humans want to use as a definition of "consciousness".
If we acknowledge the diversity of inner experiences that people value and refer to as their "consciousness", then we can move past confused debates about what is "consciousness", and toward a healthy pluralistic agreement about protecting a diverse set of mental processes as intrinsically morally significant.
Part 3: What about "the hard problem of consciousness"?
One major reason people think there's a single "hard problem" in understanding consciousness is that people are unaware that they mean different things from each other when they use the term "consciousness". I explained this in my previous post, based on informal interviews I conducted during graduate school. As a result, people have a very hard time agreeing on the "nature" of "consciousness". That's one kind of hardness that people encounter when discussing "consciousness", which I was only able to resolve by asking dozens of other people to introspect and describe to me what they were sensing and calling their "consciousness".
From there, you can see that there actually several hard problems when it comes to understanding the various phenomena referred to by "consciousness". In a future post, tentatively called "Four Hard-ish Problems of Consciousness", I'll try to share some of them and how I think they can be resolved.
Summary & Conclusion
In Part 1, I argued that LLM chatbots probably possess many but not (yet) all of the diverse properties we humans are thinking of when we say "consciousness". I'm confident in the diversity of these properties because of the investigations in my previous post about them.
As a result, in Part 2 I argued that we need to move past debating what "consciousness" is, and toward a pluralistic treatment of many different kinds of mental processes as intrinsically valuable. We could approach such pluralism in good faith, seeking to negotiate a peaceful coexistence amongst many sorts of minds, and amongst humans with many different values about minds, rather than seeking to destroy or extinguish beings or values that we find uninteresting. In particular, I believe humanity can learn to accept itself as a morally valuable species that is worth preserving, without needing to believe we are the only such species, or that a singular mental phenomenon called "consciousness" is unique to us and the source of our value.
If we don't realize and accept this, I worry that our will to live as a species will slowly degrade as a large fraction of people will learn to recognize what they call "consciousness" being legitimately exhibited by AI systems.
In short, our self-worth should not rest upon a failure to recognize the physicality of our existence, nor upon a denial of the worth of other physical beings who value their internal processes (like animals, and maybe AI), and especially not upon the label "consciousness".
So, let's get unconfused about consciousness, without abandoning our self-worth in the process.
Appendix: My speculations on which referents of "consciousness" chatbots currently exhibit.