All of Alex Gibson's Comments + Replies

I'm glad you like it! Yeah the lack of a dataset is the thing that excites me about this kind of approach, because it allows us to get validation of our mechanistic explanations via partial "dataset recovery", which I find to be really compelling. It's a lot slower going, and may only work out for the first few layers, but it makes for a rewarding loop.

The utility of SAEs is in telling us in an unsupervised way that there is a feature that codes for "known entity", but this project doesn't use SAEs explicitly. I look for sparse sets of neurons that activat... (read more)

My model of why SAEs work well for the Anthropic analysis is that the concepts discussed are genuinely 'sparse' features. Like predicting 'Rabbit' on the next line is a discrete decision, and so is of the form SAEs model for. We expect these SAE features to generalize OOD, because the model probably genuinely has these sparse directions.

Whereas for 'contextual / vibes' based features, the ground truth is not a sparse sum of discrete features. It's a continuous summary of the text obtained by averaging representations over the sequence. In this case, SAEs e... (read more)

But models are incentivized to have concepts that generalize OOD because models hardly ever see the same training data more than once.

2Mateusz Bagiński
Seeing some training data more than once would make the incentive to [have concepts that generalize OOD] weaker than if [they saw every possible training datapoint at most once], but this doesn't mean that the latter is an incentive towards concepts that generalize OOD. Though admittedly, we are getting into the discussion of where to place the zero point of "null OOD generalization incentive". Also, I haven't looked into it, but it's plausible to me that models actually do see some data more than once because there are a lot of duplicates on the internet. If your training data contains the entire English Wikipedia, nlab, and some math textbooks, then surely there's a lot of duplicated theorems and exercises (not necessarily word-by-word, but it doesn't have to be word-by-word). But I realized there might be another flaw in my comment, so I'm going to add an ETA. (If I'm misunderstanding you, feel free to elaborate, ofc.)

You can have a hypothesis with really high kolmogorov complexity, but if the hypothesis is true 50% of the time it will require 1 bit of information to specify with respect to a coding scheme that merely points to cached hypotheses.

This is why when kolmogorov complexity is defined it's with respect to a fixed universal description language, as otherwise you're right, it's vacuous to talk about the simplicity of a hypothesis.