All of jacob_drori's Comments + Replies

The quantum red pill or: They lied to you, we live in the (density) matrix

We have pretty robust measurements of complexity of algorithms from SLT

This seems overstated. What's the best evidence so far that the LLC positively correlates with the complexity of the algorithm implemented by a model? In fact, do we even have any models whose circuitry we understand well enough to assign them a "complexity"?

... and it seems like similar methods can lead to pretty good ways of separating parallel circuits (Apollo also has some interesting work here that I think constitutes real progress)

Citation?

2Dmitry Vaintrob3mo

In some sense this is the definition of the complexity of an ML algorithm; more precisely, the direct analog of complexity in information theory, which is the "entropy" or "Solomonoff complexity" measurement, is the free energy (I'm writing a distillation on this but it is a standard result). The relevant question then becomes whether the "SGLD" sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example. A possibly more interesting question is whether notions of complexity in interpretations of programs agree with the inherent complexity as measured by free energy. The place I'm aware of where this is operationalized and checked is our project with Nina on modular addition: here we do have a clear understanding of the platonic complexity, and the local learning coefficient does a very good job of asymptotically capturing it with very good precision (both for memorizing and generalizing algorithms, where the complexity difference is very significant). Look at this paper (note I haven't read it yet). I think their LIB work is also promising (at least it separates circuits of small algorithms)

The quantum red pill or: They lied to you, we live in the (density) matrix

Same difference

The quantum red pill or: They lied to you, we live in the (density) matrix

I'd prefer "basis we just so happen to be measuring in". Or "measurement basis" for short.

You could use "pointer variable", but this would commit you to writing several more paragraphs to unpack what it means (which I encourage you to do, maybe in a later post).

jacob_drori3mo32

Your use of "pure state" is totally different to the standard definition (namely rank(rho)=1). I suggest using a different term.

2Dmitry Vaintrob3mo

To add: I think the other use of "pure state" comes from this context. Here if you have a system of commuting operators and take a joint eigenspace, the projector is mixed, but it is pure if the joint eigenvalue uniquely determines a 1D subspace; and then I think this terminology gets used for wave functions as well

2Dmitry Vaintrob3mo

Thanks - you're right. I have seen "pure state" referring to a basis vector (e.g. in quantum computation), but in QTD your definition is definitely correct. I don't like the term "pointer variable" -- is there a different notation you like?

The quantum red pill or: They lied to you, we live in the (density) matrix

Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

The QM state space has a preferred inner product, which we can use to e.g. dualize a (0,2) tensor (i.e. a thing that eats takes two vectors and gives a number) into a (1,1) tensor (i.e. an operator). So we can think of it either way.

2tailcalled3mo

Oh I meant a (2, 0) tensor.

Domain-specific SAEs

jacob_drori3mo20

Oops, good spot! I meant to write 1 minus that quantity. I've edited the OP.

jacob_drori6mo10

This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I'm sure the answers could be pieced together from the notebook, but most people won't click through and read the code.

1rokosbasilisk6mo

Thanks for the feedback! working on refining the writeup.

Circuits in Superposition: Compressing many small neural networks into one

jacob_drori6mo32

Ah, I think I understand. Let me write it out to double-check, and in case it helps others.

Say $δ = 0$ , for simplicity. Then $A^{l} = \sum_{t} E_{t} a_{t}^{l}$ . This sum has $k$ nonzero terms.

In your construction, $W^{l, i n} = \sum_{t} V_{t}^{l} W_{t}^{l, i n} E_{t}^{T}$ . Focussing on a single neuron, labelled by $i$ , we have $(W^{l, i n})_{i} = \sum_{t} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T}$ . This sum has $\sim p T$ nonzero terms.

So the preactivation of an MLP hidden neuron in the big network is $p_{i}^{l} = \sum_{t, t^{'}} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T} E_{t^{'}} a_{t^{'}}^{l}$ . This sum has $\sim k p T$ nonzero terms.

We only "want" the terms whe... (read more)

2Lucius Bushnaq6mo

Yes, that's right.

Circuits in Superposition: Compressing many small neural networks into one

jacob_drori6mo10

I'm confused by the read-in bound:

$ϵ_{t}^{l, i n} = O (w a \sqrt{k T \frac{m d}{M D} log M})$

Sure, each neuron reads from $T \frac{n log M}{M}$ of the random subspaces. But in all but $k$ of those subspaces, the big network's activations are smaller than $δ$ , right? So I was expecting a tighter bound - something like:

$ϵ_{t}^{l, i n} = O (w a \sqrt{(k + T δ) \frac{m d}{M D} log M})$

3Lucius Bushnaq6mo

EDIT: Sorry, misunderstood your question at first. Even if δ=0, all those subspaces will have some nonzero overlap O(1√D) with the activation vectors of the k active subnets. The subspaces of the different small networks in the residual stream aren't orthogonal.

tailcalled's Shortform

jacob_drori7mo10

Ah, so I think you're saying "You've explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?

This is a great question, and as with any question of the form "why does this property emerge from these basic rules", there's unlikely to be a short answer. E.g. if you said "given our understanding of the standard model, explain how a cell works", I'd have to reply "uhh, get out a pen... (read more)

2tailcalled7mo

Why assume a reductionistic explanation, rather than a macroscopic explanation? Like for instance the second law of thermodynamics is well-explained by the past hypothesis but not at all explained by churning through mechanistic equations. This seems in some ways to have a similar vibe to the second law.

tailcalled's Shortform

jacob_drori7mo10

> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.

In ordinary quantum mechanics, time and space are treated very differently: $t$ is a coordinate whereas $x$ is a dynamical variable (which happens to be oper... (read more)

2tailcalled7mo

I suppose that's true, but this kind of confirms my intuition that there's something funky going on here that isn't accounted for by rationalist-empiricist-reductionism. Like why are time translations so much more important for our general work than space translations? I guess because the sun bombards the earth with a steady stream of free energy, and earth has life which continuously uses this sunlight to stay out of equillbrium. In a lifeless solar system, time-translations just let everything spin, which isn't that different from space-translations.

tailcalled's Shortform

jacob_drori7mo80

Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren't transferred between objects at the everyday, macro level we humans are used to.

E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn't very useful in day-to-day life.

E.g. 2: conservation of color charge doesn't really say anything useful about everyday processes, since it's only changed b... (read more)

2tailcalled7mo

At a human level, the counts for each type of atom is basically always conserved too, so it's not just a question of why not momentum but also a question of why not moles of hydrogen, moles of carbon, moles of oxygen, moles of nitrogen, moles of silicon, moles of iron, etc.. I guess for momentum in particular, it seems reasonable why it wouldn't be useful in a thermodynamics-style model because things would woosh away too much (unless you're dealing with some sort of flow? Idk). A formalization or refutation of this intuition would be somewhat neat, but I would actually more wonder, could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?

Answer by jacob_droriAug 30, 202430

I'll just answer the physics question, since I don't know anything about cellular automata.

When you say time-reversal symmetry, do you mean that t -> T-t is a symmetry for any T?

If so, the composition of two such transformations is a time-translation, so we automatically get time-translation symmetry, which implies the 1st law.

If not, then the 1st law needn't hold. E.g. take any time-dependent hamiltonian satisfying H(t) = H(-t). This has time-reversal symmetry about t=0, but H is not conserved.

2Carl Feynman8mo

“Time-Symmetric” and “reversible” mean the same thing to me: if you look at the system with reversed time, it obeys the same law. But apparently they don’t mean the same to OP, and I notice I am confused. In any event, as Mr Drori points out, symmetry/reversibility implies symmetry under time translation. If, further, the system can be described by a Hamiltonian (like all physical systems) then Noether’s Theorem applies, and energy is conserved.

2Noosphere898mo

Hm, I'm talking about time reversible physical laws, not necessarily time symmetric physical laws, so my question is do you always get time-symmetric physical laws that are symmetric for any T, out of time-reversible physical laws? See also this question in another comment: I have edited the question to clarify what exactly I was asking.

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

jacob_drori10mo10

The theorem guarantees the existence of a $d$ -dimensional analytic manifold $M$ and a real analytic map

g : M ∋ u \mapsto w \in W

such that for each coordinate $M_{α}$ of $M$ one can write

\begin{matrix} K (g (u)) & = u_{1}^{2 k_{1}} \dots u_{d}^{2 k_{d}} . . . \end{matrix}

I'm a bit confused here. First, I take it that $α$ labels coordinate patches? Second, consider the very simple case with $d = 2$ and $K (w) = w_{1}^{2} + w_{2}^{2}$ . What $g$ would put $K$ into the stated form?

Improving Dictionary Learning with Gated Sparse Autoencoders

jacob_drori1yΩ-100

Nice work! I'm not sure I fully understand what the "gated-ness" is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:

Let $f$ and $^x$ be the encoder and decoder functions, as in your paper, and let $x$ be the model activation that is fed into the SAE.

The usual SAE reconstruction is $^x (f (x))$ , which suffers from the shrinkage problem.

Now, introduce a new learned parameter $t \in R^{n_{f e a t u r e s}}$ , and define an "expanded" reconstruction $y_{e x p a n d}$ ... (read more)

2Rohin Shah1y

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs. The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to ReLU(πgate(x)), so you might think of πgate as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on fmag for everything else.

Some costs of superposition

jacob_drori1y20

The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?

2Bart Bussmann1y

I expect the 0.05 peak might be the minimum cosine similarity if you want to distribute 8192 vectors over a 512-dimensional space uniformly? I used a bit of a weird regularizer where I penalized: mean cosine similarity + mean max cosine similarity + max max cosine similarity I will check later whether the 0.3 peak all have the same neighbour.

jacob_drori1yΩ370

The typical noise on feature $f_{1}$ caused by 1 unit of activation from feature $f_{2}$ , for any pair of features $f_{1}$ , $f_{2}$ , is (derived from Johnson–Lindenstrauss lemma)
$ϵ = \sqrt{\frac{8 ln (m)}{n}}$ ^[1]

1. ... This is a worst case scenario. I have not calculated the typical case, but I expect it to be somewhat less, but still same order of magnitude

Perhaps I'm misunderstanding your claim here, but the "typical" (i.e. RMS) inner product between two independently random unit vectors in $R^{n}$ is $n^{- 1 / 2}$ . So I think the&nb... (read more)

2Lucius Bushnaq1y

I think the √8lnm may be in there because JL is putting an upper bound on the interference, rather than describing the typical interference of two features. As you increase m (more features), it becomes more difficult to choose feature embeddings such that no features have high interference with any other features. So its not really the 'typical' noise between any two given features, but it might be the relevant bound for the noise anyway? Not sure right now which one matters more for practical purposes.

4Linda Linsefors1y

Good point. I need to think about this a bit more. Thanks Just quickly writing up my though for now... What I think is going on here is that Johnson–Lindenstrauss lemma gives a bound on how well you can do, so it's more like a worst case scenario. I.e. Johnson–Lindenstrauss lemma gives you the worst case error for the best possible feature embedding. I've assumed that the typical noise would be same order of magnitude as the worst case, but now I think I was wrong about this for large m. I'll have to think about what is more important of worst case and typical case. When adding up noise one should probably use worst typical case. But when calculating how many features to fit in, one should probably use worst case.

Attention SAEs Scale to GPT-2 Small

Paging hijohnnylin -- it'd be awesome to have neuronpedia dashboards for these features. Between these, OpenAI's MLP features, and Joseph Bloom's resid_pre features, we'd have covered pretty much the whole model!

jacob_drori1y30

For each SAE feature (i.e. each column of W_dec), we can look for a distinct feature with the maximum cosine similarity to the first. Here is a histogram of these max cos sims, for Joseph Bloom's SAE trained at resid_pre, layer 10 in gpt2-small. The corresponding plot for random features is shown for comparison:

The SAE features are much less orthogonal than the random ones. This effect persists if, instead of the maximum cosine similarity, we look at the 10th largest, or the 100th largest:

I think it's a good idea to include a loss t... (read more)

1Demian Till1y

Thanks, that's very interesting!

Nice, this is exactly what I was asking for. Thanks!

Mapping the semantic void II: Above, below and between token embeddings

I'm confused about your three-dimensional example and would appreciate more mathematical detail.

Call the feature directions f1, f2, f3.

Suppose SAE hidden neurons 1,2 and 3 read off the components along f1, f2, and f1+f2, respectively. You claim that in some cases this may achieve lower L1 loss than reading off the f1, f2, f3 components.

[note: the component of a vector X along f1+f2 here refers to 1/2 * (f1+f2) \cdot X]

Can you write down the encoder biases that would achieve this loss reduction? Note that e.g. when the input is f1, there is a component of 1/2 along f1+f2, so you need a bias < -1/2 on neuron 3 to avoid screwing up the reconstruction.

2Logan Riggs1y

Hey Jacob! My comment has a coded example with biases: import torch W = torch.tensor([[-1, 1],[1,1],[1,-1]]) x = torch.tensor([[0,1], [1,1],[1,0]]) b = torch.tensor([0, -1, 0]) y = torch.nn.functional.relu(x@W.T + b) This is for the encoder, where y will be the identity (which is sparse for the hidden dimension).

jacob_drori1y30

Nice post. I was surprised that the model provides the same nonsense definition regardless of the token when the embedding is rescaled to be large, and moreover that this nonsense definition is very similar to the one given when the embedding is rescaled to be small. Here's an explanation I find vaguely plausible. Suppose the model completes the task as follows:

The model sees the prompt 'A typical definition of <token> would be '.
At some attention head A1, the <token> position attends back to 'definition' and gains a component in the resi

... (read more)

2mwatkins1y

Others have suggested that the vagueness of the definitions at small and large distance from centroid are a side effect of layernorm (although you've given the most detailed account of how that might work). This seemed plausible at the time, but not so much now that I've just found this: The prompt "A typical definition of '' would be '", where there's no customised embedding involved (we're just eliciting a definition of the null string) gives "A person who is a member of a group." at temp 0. And I've had confirmation from someone with GPT4 base model access that it does exactly the same thing (so I'd expect this is something across all GPT models - a shame GPT3 is no longer available to test this). Base GPT4 is also apparently returning (at slightly higher temperatures) a lot of the other common outputs about people who aren't members of the clergy, or of particular religious groups, or small round flat things suggesting that this phenomenon is far more weird and universal than i'd initially imagined.

2mwatkins1y

Thanks! That's the best explanation I've yet encountered. There had been previous suggestions that layer norm is a major factor in this phenomenon

Internal independent review for language model agent alignment

jacob_drori1y30

I hope that type of learning isn't used

I share your hope, but I'm pessimistic. Using RL to continuously train the outer loop of an LLM agent seems like a no-brainer from a capabilities standpoint.

The alternative would be to pretrain the outer loop, and freeze the weights upon deployment. Then, I guess your plan would be to only use the independent reviewer after deployment, so that the reviewer's decision never influences the outer-loop weights. Correct me if I'm wrong here.

I'm glad you plan to address this in a future post, and I look forward to reading it.

2Seth Herd5mo

We can now see some progress with o1 and the similar family of models. They are doing some training of the "outer loop" (to the limited extent they have one) with RL, but r1 and QwQ still produce very legible CoTs. So far. See also my clarification on how an opaque CoT would still allow some internal review, but probably not an independent one, in this other comment. See also Daniel Kokatijlo's recent work on a "Shoggoth/Face" system that maintains legibility, and his other thinking on this topic. Maintaining legibility seems quite possible, but it does bear an alignment tax. This could be as low as a small fraction if the CoT largely works well when it's condensed to language. I think it will; language is made for condensing complex concepts in order to clarify and communicate thinking (including communicating it to future selves to carry on with. It won't be perfect, so there will be an alignment tax to be paid. But understanding what your model is thinking is very useful for developing further capabilities as well as for safety, so I think people may actually implement it if the tax turns out to be modest, maybe something like 50% greater compute during training and similar during inference.

Internal independent review for language model agent alignment

jacob_drori1y40

I'm a little confused. What exactly is the function of the independent review, in your proposal? Are you imagining that the independent alignment reviewer provides some sort of "danger" score which is added to the loss? Or is the independent review used for some purpose other than providing a gradient signal?

5Seth Herd1y

Good question. I should try to explain this more clearly and succinctly. One planned post will try to do that. In the meantime, let me briefly try to clarify here: The internal review is applied to decision-making. If the review determines that an action might have negative impacts past an internal threshold, it won't do that thing. At the least it will ask for human review; or it may be built so this user can't override its internal review. There are lots of formulas and techniques one can imagine for weighing positive and negative predicted outcomes and picking an action. There's no relevant loss function. Language model agents aren't doing continuous training. They don't even periodically update the weights of their central LLM/foundation model. I think future versions will learn in a different way, by writing text files about particular experiences, skills, and knowledge. At some point might well introduce network training, either in the core LLM, or a "control network" that controls "executive function", like the outer loop of algorithmic code I described. I hope that type of learning isn't used, because introducing RL training in-line re-introduces all of the problems of optimizing a goal that you haven't carefully defined.

Finding Sparse Linear Connections between Features in LLMs

I'm slightly confused about the setup. In the following, what spaces is W mapping between?

Linear: $y = W x$

At first I expected W : R^{d_model} -> R^{d_model}. But then it wouldn't make sense to impose a sparsity penalty on W.

In other words: what is the shape of the matrix W?

Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper

Is your issue just "Alice's first sentence is so misguided that no self-respecting safety researcher would say such a thing"? If so, I can edit to clarify the fact that this is a deliberate strawman, which Bob rightly criticises. Indeed:

Bob: I'm asking you why models should misgeneralise in the extremely specific weird way that you mentioned

expresses a similar sentiment to Reward Is Not the Optimization Target: one should not blindly assume that models will generalise OOD to doing things that look like "maximising reward". This much is obvious by the... (read more)

jacob_drori1y20

Regarding 3, yeah, I definitely don't want to say that the LLM in the thought experiment is itself power-seeking. Telling someone how to power-seek is not power seeking.

Regarding 1 and 2, I agree that the problem here is producing an LLM that refuses to give dangerous advice to another agent. I'm pretty skeptical that this can be done in a way that scales, but this could very well be lack of imagination on my part.