My model of why SAEs work well for the Anthropic analysis is that the concepts discussed are genuinely 'sparse' features. Like predicting 'Rabbit' on the next line is a discrete decision, and so is of the form SAEs model for. We expect these SAE features to generalize OOD, because the model probably genuinely has these sparse directions.

Whereas for 'contextual / vibes' based features, the ground truth is not a sparse sum of discrete features. It's a continuous summary of the text obtained by averaging representations over the sequence. In this case, SAEs exhibit feature splitting where they are able to model the continuous summary with sparser and sparser features by clustering texts from the dataset together in finer and finer divisions. This starts off canonical, but eventually the clusters you choose are not features of the model, but features of the dataset. And at this point the features are no longer robust OOD because they aren't genuine internal model features, they are tiny clusters that emerge from the interaction between the model and the dataset.

So in theory the model might have a direction corresponding to 'harmful intent', but the SAEs split the dataset into so many chunks that to recover 'harmful intent' you need to combine lots of SAE latents together. And the OOD behaviour arises from the SAE latents being unfaithful to the ground truth, not the model having poor OOD behaviour. Like SAE latents might be sufficiently fine that you can patch together chunks of the dataset to fit the train dataset, in a non-robust way.

As for concepts that generalize OOD - I suppose it depends what is meant by OOD? Like is looking at a dataset the model wasn't exposed to, but that it reasonably could have been, OOD? If so, the incentive for learning OOD robust concepts is that assuming that most text the model receives is novel, this text is OOD for the model, so if its concepts are only relevant to the text it has seen so far, it will perform poorly. You can also argue regularisation drives short description lengths, and thus generalising concepts. Whether a chunk of the training set is duplicates / similar is kind of irrelevant, because even if only 50% of text is novel, the novel text still provides the incentive for robust concepts.

Reply

Tracing the Thoughts of a Large Language Model

Alex Gibson5d10

But models are incentivized to have concepts that generalize OOD because models hardly ever see the same training data more than once.

Reply

Matthias Dellago's Shortform

Alex Gibson13d20

You can have a hypothesis with really high kolmogorov complexity, but if the hypothesis is true 50% of the time it will require 1 bit of information to specify with respect to a coding scheme that merely points to cached hypotheses.

This is why when kolmogorov complexity is defined it's with respect to a fixed universal description language, as otherwise you're right, it's vacuous to talk about the simplicity of a hypothesis.

Reply