nielsrolf3mo10

One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.

In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate):

humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection.
dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem

If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.

Why White-Box Redteaming Makes Me Feel Weird

nielsrolf4mo40

I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.

Why White-Box Redteaming Makes Me Feel Weird

nielsrolf4mo3410

If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying "I'm sorry I just made you suffer" causes more suffering.

ryan_greenblatt's Shortform

nielsrolf4mo50

My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:

dogs are pretty nice, and it's apparently easy to domesticate foxes with very few generations. This suggests that "be nice to humans" is simple to learn for animals, and in particular simpler than "be as intelligent as human AI researchers". So assuming the octopuses would first learn to like humans, it's unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn't be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples.
It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn't translate 1:1 to AIs.)

nielsrolf's Shortform

nielsrolf10mo30

This is for the full models - I simply used both models on replicate and gave one image and two text labels as input: CLIP, SigLIP

nielsrolf's Shortform

nielsrolf10mo72

Thanks for the link and suggestions!

I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don't (however n=1 image) - an image of a red cube with a blue sphere compared with texts "red cube next to blue sphere" and "blue cube next to red sphere" doesn't get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).

nielsrolf's Shortform

nielsrolf10mo101

Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?

nielsrolf's Shortform

nielsrolf1y10

I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?

Refusal in LLMs is mediated by a single direction

nielsrolf1y70

Have you tried discussing the concepts of harm or danger with a model that can't represent the refuse direction?

I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model - is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?

Cool work overall!

nielsrolf's Shortform

nielsrolf1y61

What does it mean to align an LLM?

It is very clear what it means to align an agent:

an agent acts in an environment
if an agent consistently acts to navigate the state of the environment into a certain regime, we can call this a “goal of the agent”
if that goal corresponds to states of the environment that we value, the agent is aligned

It is less clear what it means to align an LLM:

Generating words (or other tokens) can be viewed as actions. Aligning LLMs then means: make it say nice things.
Generating words can also be seen as thoughts. An LLM that allows us to easily build aligned agents with the right mix of prompting and scaffolding could be called aligned.
One definition that a friend proposed is: an LLM is aligned if it can never serve as the cognition engine for a misaligned agent - this interpretation most strongly emphasizes the “harmlessness” aspect of LLM alignment

Probably, we should have different alignment goals for different deployment cases: LLM assistants should say nice and harmless things, while agents that help automate alignment research should be free to think anything they deem useful, and reason about the harmlessness of various actions “out loud” in their CoT, rather than implicitly in a forward pass.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

What does it mean to align an LLM?