neverix - LessWrong

Self-explaining SAE features

neverix10d82

We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
That is a pattern that happens frequently, but we're not confident enough to propose any particular form. It is sometimes thrown off by random spikes, self-similarity gradually rising at larger scales, or entropy peaking in the beginning. Because of this, there is still a lot of room for improvement in cases where a human (or maybe a peak-finding algorithm) could do better than our linear metric.

You should go to ML conferences

neverix1mo32

Genie: Generative Interactive Environments Bruce et al.

How is that paper alignment-relevant?

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

neverix3mo30

Freshman’s dream sparsity loss

A similar regularizer is known as Hoyer-Square.

Pick a value for and a small $ϵ \geq 0$ . Then define the activation function $T_{k, ϵ}$ in the following way. Given a vector $x$ , let $b$ be the value of the $k$ th-largest entry in $x$ . Then define the vector $T_{k, ϵ} (x)$ by

Is $a$ in the following formula a typo?

200 COP in MI: Exploring Polysemanticity and Superposition

neverix10mo10

To clarify, I thought it was about superposition happening inside the projection afterwards.

200 COP in MI: Exploring Polysemanticity and Superposition

neverix10mo10

This happens in transformer MLP layers. Note that the hidden dimen

Is the point that transformer MLPs blow up the hidden dimension in the middle?

Steering GPT-2-XL by adding an activation vector

neverix1y10

Activation additions in generative models

Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

neverix1y30

From a conversation on Discord:

Do you have in mind a way to weigh sequential learning into the actual prior?

Dmitry:

good question! We haven't thought about an explicit complexity measure that would give this prior, but a very loose approximation that we've been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the "BIMT" weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)

Response:

Hmm, BIMT seems to only be about intra-layer locality. It would certainly encourage learning an ensemble of features, but I'm not sure if it would capture the interesting bit, which I think is the fact that features are built up sequentially from earlier to later layers and changes are only accepted if they improve local loss.
I'm thinking about something like an existence of a relatively smooth scaling law (?) as the criterion.
So, just some smoothness constraint that would basically integrate over paths SGD could take.

Inference from a Mathematical Description of an Existing Alignment Research: a proposal for an outer alignment research program

neverix1y10

We can idealize the outer alignment solution as a logical inductor.

Why outer?

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

neverix1y10

You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.

Hessian and Basin volume

neverix1y10

There are also somewhat principled reasons for using a "fuzzy ellipsoid", which I won't explain here.

If you view as 2x learning rate, the ellipsoid contains parameters which will jump straight into the basin under the quadratic approximation, and we assume for points outside the basin the approximation breaks entirely. If you account for gradient noise ~~in the form of a Gaussian with sigma equal to gradient, the PDF of the resulting point at the basin is equal to the probability a Gaussian parametrized by the ellipsoid at the preceding point.~~ This is wrong, but there is an interpretation of the noise as a Gaussian with variance increasing away from the basin origin.

LESSWRONG
LW

Posts

Wiki Contributions

Comments

Freshman’s dream sparsity loss

Activation additions in generative models