williawa - LessWrong

Have you guys done any experiments to check what is the case?

My view (which I'm not certain about, its an empirical question) is that LLMs after pretraining have a bunch of modules inside them simulating varying parts of the data-generating process, including low-level syntactic things, and higher level "persona" things. Then stuff kind of changes when you do RL on them.

If you do a little bit of RL, you're kind of boosting some of the personas over others, making them easier to trigger, giving them more "weight" in the superposition of simulations, and also tuning the behavior of the individual simulations somewhat.

Then if you do enough RL-cooking of the model, stuffs pulled more and more together, and in the limit either an agenty-thing is formed from various pieces of the pre-existing simulations, or one simulation becomes big enough and eats the others.

What do you think about this view? Here simulatyess and agentyness are both important properties of the model and they're not in conflict. The system is primarily driven by simulation, but you end up with an agentic system.

Interpretability Will Not Reliably Find Deceptive AI

williawa1mo11

I agree with the overall point, but think it maybe understates importance of interpretability because it neglects the impact interpretability has on creating conceptual clarity for the concepts we care about. I mean, that's where ~~50% of the value interpretability lies, in my opinion.

Lucius Bushnaq third to last shortform "My Theory of Impact for Interpretability" explains whats basically my view decently well https://www.lesswrong.com/posts/cCgxp3Bq4aS9z5xqd/lucius-bushnaq-s-shortform

Rafael Harth's Shortform

williawa2mo10

How do you get it? Apparently you can't get it from spinning the boxes.

LessWrong has been acquired by EA

williawa2mo110

I have not gotten them.

LessWrong has been acquired by EA

williawa2mo100

Hey, I paid for picolightcones, but they didn't appear? But I am being deducted from my card. Is this how its supposed to be?

I haven't collected all the virtues yet, but now there is no way for me to acquire more lootboxes because I've ran out of lw-bucks. I don't know what to do.

williawa's Shortform

williawa2mo10

Hey, I'm buying pico lightcones. And I am being deduced from my card, but I don't get any pico lightcones @habryka

williawa's Shortform

williawa2mo410

Collecting all the virtues!

williawa's Shortform

williawa2mo00

test

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

williawa2mo50

Sorry, this question is probably dumb, just asking to clear up my own confusions. But this experiment seem unfair to the SAE in multiple ways, some you mentioned. But also: the reason SAEs are cool is because they're unsupervised, and that there consequently is some hope they're finding concepts the models are actually using when they're thinking. But here you're starting with a well-defined human concept and then trying to find it inside the model.

If you are looking at a concept C and you have a clean dataset A,B where C is represented in every sample of A and none in B, and you train a probe to tell when C is present by looking at the residual stream, wouldn't you expect it to just find the "most correct linear representation" of C? (assuming your dataset really is clean, A and B are same distribution, you've removed spurious correlations). The linear probe is in some sense the optimal tool for the job.

Like, the SAE gets less than perfect reconstruction loss, so the latent activations contains strictly less information than the activations themselves. And they're basically linear. So a sae probe can learn just a subset of the functions a linear probe can. Especially when they are sparse. So the sae probe starts out with a big disadvantage.

From what I understood, the reason you thought a sparse probe trained on SAE latents might still have an edge is because the SAE features allow you to capture the "relevant features" with a low complexity function, which probably is going to generalize better.

But seems to me this only makes sense if the model's internal representation of "harmful intent" (and whatever few other related concepts the sparse probe is using) are similar to the ones generating the benchmarks.

Like if the "harmful intent" feature the SAE learnt is actually a "schmarmful intent" feature, which has .98 correlation with real harmful intent the way the benchmarks define it, maybe that's what the sparse sae probe learned to use, + some other sch-latents. However, in this case the argument for why you'd expect them to generalize better than a dense probe fails.

Still, it seems to me what mechinterp should care about are the "schmarmful" features.

I'm struggling to think of an experiment that discriminates the two. But like, if you're a general and you've recruited troops from some other country, and its important for you that your troops fight with "honor", but their conception of honor is subtly different "schmonor", understanding "schmonor" will better allow you to predict their behavior. But if you actually want them to fight with honor, understanding schmonor is not necessarily all that helpful.

Seems to me it would be more damning of SAEs if you instead of using a dataset classification, instead were trying to predict the future behavior of the model. Like whether it would try to refuse after reading parts of the user prompt. What do you think about this?

LESSWRONG
LW

Posts

Wikitag Contributions

Comments