jacob_drori - LessWrong

Vary temperature t and measure the resulting learning coefficient function

This confuses me. IIUC, $p_{β, n} (w) = \frac{φ (w) exp (- n β L_{n} (w))}{\int d w^{'} φ (w^{'}) exp (- n β L_{n} (w^{'}))}$ . So changing temperature is equivalent to rescaling the loss by a constant. But such a rescaling doesn't affect the LLC.

What did I misunderstand?

Estimating the Probability of Sampling a Trained Neural Network at Random

jacob_drori2mo21

Let be volume of a behavioral region at cutoff $ϵ$ . Your behavioral LLC at finite noise scale is $λ (ϵ) := d log V / d log ϵ$ , which is invariant under rescaling $V$ by a constant. This information about the overall scale of $V$ seems important. What's the reason for throwing it out in SLT?

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

jacob_drori2mo32

Fantastic research! Any chance you'll open-source weights of the insecure qwen model? This would be useful for interp folks.

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

jacob_drori2mo80

The Jacobians are much more sparse in pre-trained LLMs than in re-initialized transformers.

This would be very cool if true, but I think further experiments are needed to support it.

Imagine a dumb scenario where during training, all that happens to the MLP is that it "gets smaller", so that MLP_trained(x) = c * MLP_init(x) for some small c. Then all the elements of the Jacobian also get smaller by a factor of c, and your current analysis -- checking the number of elements above a threshold -- would conclude that the Jacobian had gotten sparser. This feels wrong: merely rescaling a function shouldn't affect the sparsity of the computation it implements.

To avoid this issue, you could report a scale-invariant quantity like the kurtosis of the Jacobian's elements divided by their variance-squared, or ratio of L1 and L2 norms, or plenty of other options. But these quantities still aren't perfect, since they aren't invariant under linear transformations of the model's activations:

E.g. suppose an mlp_out feature F depends linearly on some mlp_in feature G, which is roughly orthogonal to F. If we stretch all model activations along the F direction, and retrain our SAEs, then the new mlp_out SAE will contain (in an ideal world) a feature F' which is the same as F but with larger activations by some factor. On the other hand, the mlp_in SAE should will contain a feature G' which is roughly the same as G. Hence the (F, G) element of the Jacobian has been made bigger, simply by applying a linear transformation to the model's activations. Generally this will affect our sparsity measure, which feels wrong: merely applying a linear map to all model activations shouldn't change the sparsity of the computation being done on those activations. In other words, our sparsity measure shouldn't depend on a choice of basis for the residual stream.

I'll try to think of a principled measure of the sparsity of the Jacobian. In the meantime, I think it would still be interesting to see a scale-invariant quantity reported, as suggested above.

Against blanket arguments against interpretability

jacob_drori3mo10

We have pretty robust measurements of complexity of algorithms from SLT

This seems overstated. What's the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a "complexity"?

... and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)

Citation?

The quantum red pill or: They lied to you, we live in the (density) matrix

jacob_drori3mo10

Same difference

The quantum red pill or: They lied to you, we live in the (density) matrix

jacob_drori3mo10

I'd prefer "basis we just so happen to be measuring in". Or "measurement basis" for short.

You could use "pointer variable", but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

The quantum red pill or: They lied to you, we live in the (density) matrix

jacob_drori3mo32

Your use of "pure state" is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.

The quantum red pill or: They lied to you, we live in the (density) matrix

jacob_drori3mo10

The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.

Domain-specific SAEs

jacob_drori4mo20

Oops, good spot! I meant to write 1 minus that quantity. I've edited the OP.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments