jacob_drori

Wikitag Contributions

Comments

Sorted by

Let  be volume of a behavioral region at cutoff . Your behavioral LLC at finite noise scale is , which is invariant under rescaling  by a constant. This information about the overall scale of  seems important. What's the reason for throwing it out in SLT?

Fantastic research! Any chance you'll open-source weights of the insecure qwen model? This would be useful for interp folks.

The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.

 

This would be very cool if true, but I think further experiments are needed to support it.

Imagine a dumb scenario where during training, all that happens to the MLP is that it "gets smaller", so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis -- checking the number of elements above a threshold -- would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn't affect the sparsity of the computation it implements.

To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian's elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren't perfect, since they aren't invariant under linear transformations of the model's activations:

E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F' which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G' which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model's activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn't change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn't depend on a choice of basis for the residual stream.

I'll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.

We have pretty robust measurements of complexity of algorithms from SLT

 

This seems overstated. What's the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a "complexity"?

 

... and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)

 

Citation?

I'd prefer "basis we just so happen to be measuring in". Or "measurement basis" for short.

You could use "pointer variable", but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

Your use of "pure state" is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.

The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.

Oops, good spot! I meant to write 1 minus that quantity. I've edited the OP.

This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I'm sure the answers could be pieced together from the notebook, but most people won't click through and read the code.

Load More