jacob_drori - LessWrong

Exploring the Platonic Representation Hypothesis Beyond In-Distribution Data

This seems very interesting, but I think your post could do with a lot more detail. How were the correlations computed? How strongly do they support PRH? How was the OOD data generated? I'm sure the answers could be pieced together from the notebook, but most people won't click through and read the code.

Circuits in Superposition: Compressing many small neural networks into one

jacob_drori2mo32

Ah, I think I understand. Let me write it out to double-check, and in case it helps others.

Say , for simplicity. Then $A^{l} = \sum_{t} E_{t} a_{t}^{l}$ . This sum has $k$ nonzero terms.

In your construction, $W^{l, i n} = \sum_{t} V_{t}^{l} W_{t}^{l, i n} E_{t}^{T}$ . Focussing on a single neuron, labelled by $i$ , we have $(W^{l, i n})_{i} = \sum_{t} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T}$ . This sum has $\sim p T$ nonzero terms.

So the preactivation of an MLP hidden neuron in the big network is $p_{i}^{l} = \sum_{t, t^{'}} (V_{t}^{l})_{i} W_{t}^{l, i n} E_{t}^{T} E_{t^{'}} a_{t^{'}}^{l}$ . This sum has $\sim k p T$ nonzero terms.

We only "want" the terms where $t = t^{'}$ ; the rest (i.e. the majority) are noise. Each noise term in the sum is a random vector, so each of the $\sim k p T$ different noise terms are roughly orthogonal, and so the norm of the noise is $O (\sqrt{k p T})$ (times some other factors, but this captures the $T$ -dependence, which is what I was confused about).

Circuits in Superposition: Compressing many small neural networks into one

jacob_drori2mo10

I'm confused by the read-in bound:

Sure, each neuron reads from $T \frac{n log M}{M}$ of the random subspaces. But in all but $k$ of those subspaces, the big network's activations are smaller than $δ$ , right? So I was expecting a tighter bound - something like:

$ϵ_{t}^{l, i n} = O (w a \sqrt{(k + T δ) \frac{m d}{M D} log M})$

tailcalled's Shortform

jacob_drori3mo10

Ah, so I think you're saying "You've explained to me the precise reason why energy and momentum (i.e. time and space) are different at the fundamental level, but why does this lead to the differences we observe between energy and momentum (time and space) at the macro-level?

This is a great question, and as with any question of the form "why does this property emerge from these basic rules", there's unlikely to be a short answer. E.g. if you said "given our understanding of the standard model, explain how a cell works", I'd have to reply "uhh, get out a pen and paper and get ready to churn through equations for several decades".

In this case, one might be able to point to a few key points that tell the rough story. You'd want to look at properties of solutions PDEs on manifolds with metric of signature (1,3) (which means "one direction on the manifold is different to the other three, in that it carries a minus sign in the metric compared to the others in the metric"). I imagine that, generically, these solutions behave differently with respect to the "1" direction and the "3" directions. These differences will lead to the rest of the emergent differences between space and time. Sorry I can't be more specific!

tailcalled's Shortform

jacob_drori3mo10

> could one replace the energy-first formulations of quantum mechanics with momentum-first formulations?

Momentum is to space what energy is to time. Precisely, energy generates (in the Lie group sense) time-translations, whereas momentum generates spatial translations. So any question about ways in which energy and momentum differ is really a question about how time and space differ.

In ordinary quantum mechanics, time and space are treated very differently: is a coordinate whereas $x$ is a dynamical variable (which happens to be operator-valued). The equations of QM tell us how $x$ evolves as a function of $t$ .

But ordinary QM was long-ago replaced by quantum field theory, in which time and space are on a much more even footing: they are both coordinates, and the equations of QFT tell us how a third thing (the field $ϕ (x, t)$ ) evolves as a function of $x$ and $t$ . Now, the only difference between time and space is that there is only one dimension of the former but three of the latter (there may be some other very subtle differences I'm glossing over here, but I wouldn't be surprised if they ultimately stem from this one).

All of this is to say: our best theory of how nature works (QFT), is neither formulated as "energy-first" nor as "momentum-first". Instead, energy and momentum are on fairly equal footing.

tailcalled's Shortform

jacob_drori3mo80

Sure, there are plenty of quantities that are globally conserved at the fundamental (QFT) level. But most most of.these quantities aren't transferred between objects at the everyday, macro level we humans are used to.

E.g. 1: most everyday objects have neutral electrical charge (because there exist positive and negative charges, which tend to attract and roughly cancel out) so conservation of charge isn't very useful in day-to-day life.

E.g. 2: conservation of color charge doesn't really say anything useful about everyday processes, since it's only changed by subatomic processes (this is again basically due to the screening effect of particles with negative color charge, though the story here is much more subtle, since the main screening effect is due to virtual particles rather than real ones).

The only other fundamental conserved quantity I can think of that is nontrivially exchanged between objects at the macro level is momentum. And... momentum seems roughly as important as energy?

I guess there is a question about why energy, rather than momentum, appears in thermodynamics. If you're interested, I can answer in a separate comment.

Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?

Answer by jacob_droriAug 30, 202430

I'll just answer the physics question, since I don't know anything about cellular automata.

When you say time-reversal symmetry, do you mean that t -> T-t is a symmetry for any T?

If so, the composition of two such transformations is a time-translation, so we automatically get time-translation symmetry, which implies the 1st law.

If not, then the 1st law needn't hold. E.g. take any time-dependent hamiltonian satisfying H(t) = H(-t). This has time-reversal symmetry about t=0, but H is not conserved.

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

jacob_drori6mo10

The theorem guarantees the existence of a -dimensional analytic manifold $M$ and a real analytic map

g : M ∋ u \mapsto w \in W

such that for each coordinate $M_{α}$ of $M$ one can write

\begin{matrix} K (g (u)) & = u_{1}^{2 k_{1}} \dots u_{d}^{2 k_{d}} . . . \end{matrix}

I'm a bit confused here. First, I take it that $α$ labels coordinate patches? Second, consider the very simple case with $d = 2$ and $K (w) = w_{1}^{2} + w_{2}^{2}$ . What $g$ would put $K$ into the stated form?

Improving Dictionary Learning with Gated Sparse Autoencoders

jacob_drori8moΩ-100

Nice work! I'm not sure I fully understand what the "gated-ness" is adding, i.e. what the role the Heaviside step function is playing. What would happen if we did away with it? Namely, consider this setup:

Let and $^x$ be the encoder and decoder functions, as in your paper, and let $x$ be the model activation that is fed into the SAE.

The usual SAE reconstruction is $^x (f (x))$ , which suffers from the shrinkage problem.

Now, introduce a new learned parameter $t \in R^{n_{f e a t u r e s}}$ , and define an "expanded" reconstruction $y_{e x p a n d e d} =^x (t ⊙ f (x))$ , where $⊙$ denotes elementwise multiplication.

Finally, take the loss to be:

$L = | | {^x}_{c o p y} (f (x)) - x | |_{2}^{2} + | | y_{e x p a n d e d} - x | |_{2}^{2} + λ | | f (x) | |_{1}$ .

where ${^x}_{c o p y}$ ensures the decoder gets no gradients from the first term. As I understand it, this is exactly the loss appearing in your paper. The only difference in the setup is the lack of the Heaviside step function.

Did you try this setup? Or does it fail for an obvious reason I missed?

Do sparse autoencoders find "true features"?

jacob_drori9mo20

The peaks at 0.05 and 0.3 are strange. What regulariser did you use? Also, could you check whether all features whose nearest neighbour has cosine similarity 0.3 have the same nearest neighbour (and likewise for 0.05)?

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments