I agree that this is some evidence, but perhaps not very strong evidence. We don't know for sure that the SAE latent we have chosen to label 'yellow' represents only an objective representation of yellow instead of both an objective and subjective representation of yellow.
What is consciousness?
What are its related (component? overlapping?) concepts like subjective point of view, self-awareness, and qualia?
What do these look like in a model's weights?
Might these things be spread through many different concepts?
I do think that the conclusion that current LLMs are not conscious is correct. However, I worry that this might not hold for long as architectures evolve. I expect that architectures which enable consciousness will be shown to have useful properties, and there will thus be pressure to develop and use them. I know some researchers are explicitly working on this already.
I support creating evals for consciousness so that we can determine empirically whether future models are conscious or not. Unfortunately, to objectively establish this we may need to learn more about the human brain and human consciousness, and/or deliberately create conscious models in order to study them. Such work, if mishandled, invites moral catastrophe.
It can't represent a subjective sense of yellow, because if so, consciousness would be a linear function. That's somewhat ridiculous because I would experience a story about a "dog" differently based on the context.
Furthermore, LLMs scale "features" by how strongly they appear (e.g. the positive sentiment vector is scaled up if the text is very positive). So the LLM's conscious processing of a positive sentiment would be linearly proportional to how positive the text is. Which also seems ridiculous.
I don't expect consciousness to have any useful properties. Let's say you have a deterministic function y = f(x). You can encode just y = f(x), or y = f(x) where f includes conscious representations in the intermediate layers. The latter does not help you achieve increased training accuracy in the slightest. Neural networks also have a strong simplicity bias towards low frequency functions (this has been mathematically proven), and f(x) without consciousness is much simpler/lower frequency to encode than f(x) with consciousness.
I've been thinking about this comment every day since you made it 11 days ago. I love it. Maybe it's silly of me, but I just hadn't thought about the question in such a grounded empirical manner before.
I agree with you that it seems unlikely that current transformer-based LLMs are conscious. I also agree that we would need to be able to find extra context-dependent computation present in the stream of calculations in order to say that there was some consciousness-related computation present.
I also agree that it is hard to imagine how consciousness would provide a clear benefit on the task of next-token-prediction on web text.
I disagree though on the extrapolation from the above points. Let me explain.
Assume, for this hypothetical, that we are analyzing a future model which has some things in common with transformer-based LLMs but also some extra components. We can get into the details of plausibly useful extra components if you like, but for now let's just say that this is a diffusion-guided transformer as an example. Now let's also assume that this future model wasn't trained on web text, but was instead trained in some moderately realistic simulation of surviving in the wild as an early homonid tribe member. They need to track simulated hunger, hunting and gathering skills, and social relationships. They had a constant simulated state of health/homeostasis throughout training, as an RL signal proportional to intensity of simulated need. So there was a constant combination of training pressure for next token prediction and for satisficing the simulated state homeostasis.
Now, in this hypothetical, it seems more fair to compare this model to an animal. Supposing that the intuitive understanding of a common feature of behaviors across animal species (particularly mammals, marsupials, and birds) is correct. It seems like all these animals are running some sort of computational process which could fairly be described as a form of 'consciousness'. Why would this be a common computational process evolved and maintained across many species if it weren't useful in some way? Neural computation is expensive. Especially so for flighted birds. Yet some flighted birds, like corvids, seem both conscious and remarkably intelligent. Relatedly, they can be reasonably be described as curious, playful, puzzle-solving, and with detailed long-lasting memories. Since consciousness seems useful for all these different species, in a convergent-evolution pattern even across very different brain architectures (mammals vs birds), then I believe we should expect it to be useful in our homonid-simulator-trained model. If so, we should be able to measure this difference to a next-token-predictor trained on an equivalent number of tokens of a dataset of, for instance, math problems.
Do you agree? Am I missing something?
Since consciousness seems useful for all these different species, in a convergent-evolution pattern even across very different brain architectures (mammals vs birds), then I believe we should expect it to be useful in our homonid-simulator-trained model. If so, we should be able to measure this difference to a next-token-predictor trained on an equivalent number of tokens of a dataset of, for instance, math problems.
What do you mean by difference here? Increase in performance due to consciousness? Or differences in functions?
I'm not sure we could measure this difference. It seems very likely to me that consciousness evolved before, say, language and complex agency. But complex language and complex agency might not require consciousness, and may capture all of the benefits that would be captured by consciousness, so consciousness wouldn't result in greater performance.
However, it could be that
Some other possibilities:
I would remove that last paragraph. It doesn't add to your point and gives the impression that you might have a specific agenda.
I removed it. I don't have an agenda; I just included it because it changed my priors on the mechanism for human consciousness. So that subsequently affected my prior for whether or not AI could be conscious.
To consciously take in an information, you don't have to store any bits - you only have to map the correct input to the correct output. (By logical necessity, any transformation that preserves the input/output relationship preserves consciousness.)
I think the sparse autoencoder line of interpretability work is somewhat convincing evidence that LLMs are not conscious.
In order for me to consciously take in some information (e.g. the house is yellow), I need to store not only the contents of the statement but also some aspect of my conscious experience. I need to store more than the minimal number of bits it would take to represent "the house is yellow".
The sparse autoencoder line of work appears to suggest that LLMs essentially store "bits" that represent "themes" in the text they're processing, but close to nothing (at least in L2 norm) beyond that. And furthermore, this is happening in each layer. Thus, there doesn't appear to be any residual "space" that left over for storing aspects of consciousness.