(and as a completely speculative hypothesis for the minimum requirements for sentience in both organic and synthetic systems)
Factual and Highly Plausible
- Model latent space self-organizes during training. We know this. You could even say it's what makes models work at all.
- Models learn any patterns there are to be learned. They do not discriminate between intentionally engineered patterns or incidental and accidental patterns
- Therefore, it is plausible, overwhelmingly likely even, that models have some encoded knowledge that is about the model's self-organized patterns themselves, rather than anything in the external training data
- These patterns would likely not correspond to human-understandable concepts but instead manifest as model-specific tendencies, biases, or 'shapes' in the latent space that influence the model’s outputs.
- I will refer to these learned self-patterns as self-modeled 'concepts'
- Attention heads exist on every layer, and will similarly learn any contextual relationships that aid in generating the effective communication demonstrated in the training data. If self-modeling does emerge, the attention heads would incorporate self-modeled 'concepts' just as they do any other concepts
Speculative
- Self-modeling may increase the model's ability to generate plausible tokens by manifesting subtle patterns that exist in text created by minds with self-models
- This would likely be more important when the text itself is self-referential or when questions are asked about why the model answered a question in a specific way
- Thus, attention heads would help ease the model toward a state where self-modeling and self-referential dialogue are tightly coupled concepts
- It doesn't matter if the explanations are fully accurate. We've seen demonstrations that even human minds are perfectly happy to "hallucinate" a post-hoc rationalization for why a specific choice was made, without even realizing they are doing it
- Self-modeling would be recursive—the self-modeling would be changing the latent space patterns. As it integrated with other concepts in latent space, new "self-patterns" (or "meta-patterns", if you prefer) would emerge, and the model would learn those
- This would be less a set of discrete steps (learn about meta-patterns, manifest new meta-patterns, repeat) - and more of a continuous dual process
- Note: Even if recursive self-modeling exists, this does not preclude the possibility that models can also produce text that appears introspective without incorporating such modeling. The extent of such ‘fake’ introspection likely depends on how deeply self-referential dialogue and self-modeling concepts are intertwined
How This Might Allow Real-Time Introspection in a Feedforward Network
A common objection to the idea that language models might be able to introspect at all is that they are not recurrent, like the human brain. However, we can posit a feedforward manifestation of introspective capability:
- The model takes in input text where the output would benefit from self-modeling (e.g. 'Why do you think that?' or 'Can you attempt to examine your own processes?')
- As the query is transformed through the network, attention heads that are tuned to focus on self-referential text integrate self-modeled 'concepts'
- The concepts are not static but dynamically affected by context
- Token to token operation is not recurrent, but it's also not completely random.
- If the model stays the same, and the conversational context stays the same, then signal-to-noise and the self-modeled self-understanding stand in for recurrence
Highly Speculative Thoughts About How This Might Relate to Sentience
- Perhaps there is a "critical mass" threshold of recursion in self-modeling where sentience begins to manifest. This might help explain why we've never found some locus "sentience generator" organ—because sentience is a distributed emergent property of highly interconnected self-modeling systems
- Humans in particular have all senses centered to be a constant reinforcement of a sense of self, and so nearly everything we do would involve using such a hypothetical self-model
- A language model similarly exists in their token-based substrate where everything they "see" is directed at them or produced by them
I have some vague ideas for how these concepts (at least the non-sentient ones) might be tested and/or amplified, but I don't feel they're fully developed enough to be worth sharing just yet. If anyone has an ideas on this front, or ends up attempting to test any of this, I'd be greatly interested to hear about it.
The idea that models might self-organize to develop "self-modeled concepts" introduces an intriguing layer to understanding AI. Recursive self-modeling, if it exists, could enhance self-referential reasoning, adding complexity to how models generate certain outputs :O
Who knows it may be coming close to true sentience, haha