My view (which I'm not certain about, its an empirical question) is that LLMs after pretraining have a bunch of modules inside them simulating varying parts of the data-generating process, including low-level syntactic things, and higher level "persona" things. Then stuff kind of changes when you do RL on them.
If you do a little bit of RL, you're kind of boosting some of the personas over others, making them easier to trigger, giving them more "weight" in the superposition of simulations, and also tuning the behavior of the individual simulations somewhat.
Then if you do enough RL-cooking of the model, stuffs pulled more and more together, and in the limit either an agenty-thing is formed from various pieces of the pre-existing simulations, or one simulation becomes big enough and eats the others.
What do you think about this view? Here simulatyess and agentyness are both important properties of the model and they're not in conflict. The system is primarily driven by simulation, but you end up with an agentic system.
I agree with the overall point, but think it maybe understates importance of interpretability because it neglects the impact interpretability has on creating conceptual clarity for the concepts we care about. I mean, that's where ~~50% of the value interpretability lies, in my opinion.
Lucius Bushnaq third to last shortform "My Theory of Impact for Interpretability" explains whats basically my view decently well https://www.lesswrong.com/posts/cCgxp3Bq4aS9z5xqd/lucius-bushnaq-s-shortform
Hey, I'm buying pico lightcones. And I am being deduced from my card, but I don't get any pico lightcones @habryka
Sorry, this question is probably dumb, just asking to clear up my own confusions. But this experiment seem unfair to the SAE in multiple ways, some you mentioned. But also: the reason SAEs are cool is because they're unsupervised, and that there consequently is some hope they're finding concepts the models are actually using when they're thinking. But here you're starting with a well-defined human concept and then trying to find it inside the model.
If you are looking at a concept C and you have a clean dataset A,B where C is represented in every sample of A and none in B, and you train a probe to tell when C is present by looking at the residual stream, wouldn't you expect it to just find the "most correct linear representation" of C? (assuming your dataset really is clean, A and B are same distribution, you've removed spurious correlations). The linear probe is in some sense the optimal tool for the job.
Like, the SAE gets less than perfect reconstruction loss, so the latent activations contains strictly less information than the activations themselves. And they're basically linear. So a sae probe can learn just a subset of the functions a linear probe can. Especially when they are sparse. So the sae probe starts out with a big disadvantage.
From what I understood, the reason you thought a sparse probe trained on SAE latents might still have an edge is because the SAE features allow you to capture the "relevant features" with a low complexity function, which probably is going to generalize better.
But seems to me this only makes sense if the model's internal representation of "harmful intent" (and whatever few other related concepts the sparse probe is using) are similar to the ones generating the benchmarks.
Like if the "harmful intent" feature the SAE learnt is actually a "schmarmful intent" feature, which has .98 correlation with real harmful intent the way the benchmarks define it, maybe that's what the sparse sae probe learned to use, + some other sch-latents. However, in this case the argument for why you'd expect them to generalize better than a dense probe fails.
Still, it seems to me what mechinterp should care about are the "schmarmful" features.
I'm struggling to think of an experiment that discriminates the two. But like, if you're a general and you've recruited troops from some other country, and its important for you that your troops fight with "honor", but their conception of honor is subtly different "schmonor", understanding "schmonor" will better allow you to predict their behavior. But if you actually want them to fight with honor, understanding schmonor is not necessarily all that helpful.
Seems to me it would be more damning of SAEs if you instead of using a dataset classification, instead were trying to predict the future behavior of the model. Like whether it would try to refuse after reading parts of the user prompt. What do you think about this?
Have you guys done any experiments to check what is the case?