I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.
My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:
Thanks for the link and suggestions!
I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don't (however n=1 image) - an image of a red cube with a blue sphere compared with texts "red cube next to blue sphere" and "blue cube next to red sphere" doesn't get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).
I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?
Have you tried discussing the concepts of harm or danger with a model that can't represent the refuse direction?
I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model - is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?
Cool work overall!
It is very clear what it means to align an agent:
It is less clear what it means to align an LLM:
Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.
One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.
In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate):
If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.