I think the sparse autoencoder line of interpretability work is somewhat convincing evidence that LLMs are not conscious.

In order for me to consciously take in some information (e.g. the house is yellow), I need to store not only the contents of the statement but also some aspect of my conscious experience. I need to store more than the minimal number of bits it would take to represent "the house is yellow".

The sparse autoencoder line of work appears to suggest that LLMs essentially store "bits" that represent "themes" in the text they're processing, but close to nothing (at least in L2 norm) beyond that. And furthermore, this is happening in each layer. Thus, there doesn't appear to be any residual "space" that left over for storing aspects of consciousness.

New Comment
9 comments, sorted by Click to highlight new comments since:

I agree that this is some evidence, but perhaps not very strong evidence. We don't know for sure that the SAE latent we have chosen to label 'yellow' represents only an objective representation of yellow instead of both an objective and subjective representation of yellow. 

What is consciousness?  

What are its related (component? overlapping?) concepts like subjective point of view, self-awareness, and qualia? 

What do these look like in a model's weights?

Might these things be spread through many different concepts?

 I do think that the conclusion that current LLMs are not conscious is correct. However, I worry that this might not hold for long as architectures evolve. I expect that architectures which enable consciousness will be shown to have useful properties, and there will thus be pressure to develop and use them. I know some researchers are explicitly working on this already.

I support creating evals for consciousness so that we can determine empirically whether future models are conscious or not. Unfortunately, to objectively establish this we may need to learn more about the human brain and human consciousness, and/or deliberately create conscious models in order to study them. Such work, if mishandled, invites moral catastrophe. 

It can't represent a subjective sense of yellow, because if so, consciousness would be a linear function. That's somewhat ridiculous because I would experience a story about a "dog" differently based on the context.

 Furthermore, LLMs scale "features" by how strongly they appear (e.g. the positive sentiment vector is scaled up if the text is very positive). So the LLM's conscious processing of a positive sentiment would be linearly proportional to how positive the text is. Which also seems ridiculous.

I don't expect consciousness to have any useful properties. Let's say you have a deterministic function y = f(x). You can encode just y = f(x), or y = f(x) where f includes conscious representations in the intermediate layers. The latter does not help you achieve increased training accuracy in the slightest. Neural networks also have a strong simplicity bias towards low frequency functions (this has been mathematically proven), and f(x) without consciousness is much simpler/lower frequency to encode than f(x) with consciousness. 

I've been thinking about this comment every day since you made it 11 days ago. I love it. Maybe it's silly of me, but I just hadn't thought about the question in such a grounded empirical manner before.

I agree with you that it seems unlikely that current transformer-based LLMs are conscious. I also agree that we would need to be able to find extra context-dependent computation present in the stream of calculations in order to say that there was some consciousness-related computation present.

I also agree that it is hard to imagine how consciousness would provide a clear benefit on the task of next-token-prediction on web text.

I disagree though on the extrapolation from the above points. Let me explain.

Assume, for this hypothetical, that we are analyzing a future model which has some things in common with transformer-based LLMs but also some extra components. We can get into the details of plausibly useful extra components if you like, but for now let's just say that this is a diffusion-guided transformer as an example. Now let's also assume that this future model wasn't trained on web text, but was instead trained in some moderately realistic simulation of surviving in the wild as an early homonid tribe member. They need to track simulated hunger, hunting and gathering skills, and social relationships. They had a constant simulated state of health/homeostasis throughout training, as an RL signal proportional to intensity of simulated need. So there was a constant combination of training pressure for next token prediction and for satisficing the simulated state homeostasis.

Now, in this hypothetical, it seems more fair to compare this model to an animal. Supposing that the intuitive understanding of a common feature of behaviors across animal species (particularly mammals, marsupials, and birds) is correct. It seems like all these animals are running some sort of computational process which could fairly be described as a form of 'consciousness'. Why would this be a common computational process evolved and maintained across many species if it weren't useful in some way? Neural computation is expensive. Especially so for flighted birds. Yet some flighted birds, like corvids, seem both conscious and remarkably intelligent. Relatedly, they can be reasonably be described as curious, playful, puzzle-solving, and with detailed long-lasting memories. Since consciousness seems useful for all these different species, in a convergent-evolution pattern even across very different brain architectures (mammals vs birds), then I believe we should expect it to be useful in our homonid-simulator-trained model. If so, we should be able to measure this difference to a next-token-predictor trained on an equivalent number of tokens of a dataset of, for instance, math problems.

Do you agree? Am I missing something?

Sorry for the late response. I don't really use this forum regularly. But to get back to it - the main reason neural networks generalize is that they find the simplest function that gets a given accuracy on the training data.

This holds true for all neural networks, regardless of how they are trained, what type of data they are trained on, or what the objective function is. It's the whole point of why neural networks work. Functions that have more high frequency components are exponentially more unlikely. This holds for the randomly initialized prior (see arxiv.org/pdf/1907.10599) and throughout training, as the averaging part of SGD allows lower frequency components to be learned faster than higher frequency ones (see [1806.08734] On the Spectral Bias of Neural Networks).

You can have any objective function you want; it doesn't change this basic fact. If this basic fact didn't hold, the neural network wouldn't generalize and would be useless. There are many papers that formalize this and provide generalization bounds based off of the complexity of the function learned by the neural network.

A "conscious" neural network doesn't increase the accuracy over a neural network encoding the same function sans consciousness but does increase the complexity of the function. Therefore, it's exponentially more unlikely.

I think biological systems are really different from silicon ones. The biggest difference is that biological systems are able to generate their own randomness. Silicon ones are not - they're deterministic. If a NN is probabilistic, it's because we are feeding it random samples as an input. I think consciousness is a precursor for free will, which can be valuable for inherently non-deterministic biological systems.

In my original post, I had linked a recent paper that finds suggestive evidence that the brain is non-classical (e.g. undergoes quantum computation) but deleted it after someone told me to.

More generally, I feel that for folks concerned about AI safety, the first step is to develop a solid theoretical understanding of why neural networks generalize, the types of functions they are biased towards, how this bias is affected by the # of layers, etc.  

I feel that most individuals on Less Wrong lack this knowledge because they exclusively consume content from individuals within the rationality/AI safety sphere. I think this leads to a lot of outlandish conjectures (e.g. AI conscious, paperclip maximizer, etc.) that don't make sense. 

Since consciousness seems useful for all these different species, in a convergent-evolution pattern even across very different brain architectures (mammals vs birds), then I believe we should expect it to be useful in our homonid-simulator-trained model. If so, we should be able to measure this difference to a next-token-predictor trained on an equivalent number of tokens of a dataset of, for instance, math problems.

What do you mean by difference here? Increase in performance due to consciousness? Or differences in functions?

I'm not sure we could measure this difference. It seems very likely to me that consciousness evolved before, say, language and complex agency. But complex language and complex agency might not require consciousness, and may capture all of the benefits that would be captured by consciousness, so consciousness wouldn't result in greater performance.

However, it could be that

  1. humans do not consistently have complex language and complex agency, and humans with agency are fallible as agents, so consciousness in most humans is still useful to us as a species (or to our genes),
  2. building complex language and complex agency on top of consciousness is the locally cheapest way to build them, so consciousness would still be useful to us, or
  3. we reached a local maximum in terms of genetic fitness, or evolutionary pressures are too weak on us now, and it's not really possible to evolve away consciousness while preserving complex language and complex agency. So consciousness isn't useful to us, but can't be practically gotten rid of without loss in fitness.

 

Some other possibilities:

  1. The adaptive value of consciousness is really just to give us certain motivations, e.g. finding our internal processing mysterious, nonphysical or interesting makes it seem special to us, and this makes us
    1. value sensations for their own sake, so seek sensations and engage in sensory play, which may help us learn more about ourselves or the world (according to Nicholas Humphrey, as discussed here, here and here),
    2. value our lives more and work harder to prevent early death, and/or
    3. develop spiritual or moral beliefs and adaptive associated practices,
  2. Consciousness is just the illusion of the phenomenality of what's introspectively accessible to us. Furthermore, we might incorrectly believe in its phenomenality just because of the fact that much of the processing we have introspective access to is wired in and its causes are not introspectively accessible, but instead cognitively impenetrable. The full illusion could be a special case of humans incorrectly using supernatural explanations for unexplained but interesting and subjectively important or profound phenomena.

I would remove that last paragraph. It doesn't add to your point and gives the impression that you might have a specific agenda.

I removed it. I don't have an agenda; I just included it because it changed my priors on the mechanism for human consciousness. So that subsequently affected my prior for whether or not AI could be conscious. 

To consciously take in an information, you don't have to store any bits - you only have to map the correct input to the correct output. (By logical necessity, any transformation that preserves the input/output relationship preserves consciousness.)