LESSWRONG
LW

All of neverix's Comments + Replies

SAE features for refusal and sycophancy steering vectors

neverix9mo20

It doesn't really make sense to interpret feature activation values as log probabilities. If we did, we'd have to worry about scaling. It's also not guaranteed the score wouldn't just decrease because of decreased accuracy on correct answers.
Phi seems specialized for MMLU-like problems and has an outsized score for a model its size, I would be surprised if it's biased because of the format of the question. However, it's possible using answers instead of letters would help improve raw accuracy in this case because the feature we used (45142) seems to max-ac

... (read more)

Self-explaining SAE features

neverix10mo82

We use our own judgement as a (potentially very inaccurate) proxy for accuracy as an explanation and let readers look on their own at the feature dashboard interface. We judge using a random sample of examples at different levels of activation. We had an automatic interpretation scoring pipeline that used Llama 3 70B, but we did not use it because (IIRC) it was too slow to run with multiple explanations per feature. Perhaps it is now practical to use a method like this.
That is a pattern that happens frequently, but we're not confident enough to propose any

... (read more)

1eggsyntax10mo

Thanks! I think the post (or later work) might benefit from a discussion of using your judgment as a proxy for accuracy, its strengths & weaknesses, maybe a worked example. I'm somewhat skeptical of human judgement because I've seen a fair number of examples of a feature seeming (to me) to represent one thing, and then that turning out to be incorrect on further examination (eg if my explanation, if used by an LLM to score whether a particular piece of text should trigger the feature, turns out not to do a good job of that).

You should go to ML conferences

neverix11mo52

Genie: Generative Interactive Environments Bruce et al.

How is that paper alignment-relevant?

Research Report: Alternative sparsity methods for sparse autoencoders with OthelloGPT.

neverix1y30

Freshman’s dream sparsity loss

A similar regularizer is known as Hoyer-Square.

Pick a value for $k$ and a small $ϵ \geq 0$ . Then define the activation function $T_{k, ϵ}$ in the following way. Given a vector $x$ , let $b$ be the value of the $k$ th-largest entry in $x$ . Then define the vector $T_{k, ϵ} (x)$ by

Is $a$ in the following formula a typo?

1Andrew Quaisley1y

Oh, yeah, looks like with p=2 this is equivalent to Hoyer-Square. Thanks for pointing that out; I didn't know this had been studied previously. And you're right, that was a typo, and I've fixed it now. Thank you for mentioning that!

200 COP in MI: Exploring Polysemanticity and Superposition

neverix2y10

To clarify, I thought it was about superposition happening inside the projection afterwards.

200 COP in MI: Exploring Polysemanticity and Superposition

neverix2y10

This happens in transformer MLP layers. Note that the hidden dimen

Is the point that transformer MLPs blow up the hidden dimension in the middle?

2Neel Nanda2y

Thanks for the catch, I deleted "Note that the hidden dimen". Transformers do blow up the hidden dimension, but that's not very relevant here - they have many more neurons than residual stream dimensions, and they have many more features than neurons (as shown in the recent Anthropic paper)

Steering GPT-2-XL by adding an activation vector

neverix2y10

Activation additions in generative models

Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.

The Low-Hanging Fruit Prior and sloped valleys in the loss landscape

neverix2y30

From a conversation on Discord:

Do you have in mind a way to weigh sequential learning into the actual prior?

Dmitry:

good question! We haven't thought about an explicit complexity measure that would give this prior, but a very loose approximation that we've been keeping in the back of our minds could be a Turing machine/Boolean circuit version of the "BIMT" weight penalty from this paper https://arxiv.org/abs/2305.08746 (which they show encourages modularity at least in toy models)

Response:

Hmm, BIMT seems to only be about intra-layer locality. It would certa

neverix2y10

We can idealize the outer alignment solution as a logical inductor.

Why outer?

1Christopher King2y

Oh, I think that was a typo. I changed it to inner alignment.

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited

neverix2y10

You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.

Hessian and Basin volume

neverix2y*10

There are also somewhat principled reasons for using a "fuzzy ellipsoid", which I won't explain here.

If you view $T$ as 2x learning rate, the ellipsoid contains parameters which will jump straight into the basin under the quadratic approximation, and we assume for points outside the basin the approximation breaks entirely. If you account for gradient noise ~~in the form of a Gaussian with sigma equal to gradient, the PDF of the resulting point at the basin is equal to the probability a Gaussian parametrized by the ellipsoid at the preceding point.~~ Th... (read more)

Power-seeking can be probable and predictive for trained agents

neverix2y40

Seems like quoting doesn't work for LaTeX, it was definitions 2/3. Reading again I saw D2 was indeed applicable to sets.

Power-seeking can be probable and predictive for trained agents

neverix2yΩ020

A0>A1

How is orbit comparison for sets defined?

[This comment is no longer endorsed by its author]Reply

3Vika2y

Which definition / result are you referring to?

The Credit Assignment Problem

neverix2y10

course

coarse?

Is there a ML agent that abandons it's utility function out-of-distribution without losing capabilities?

neverix2y21

This is the whole point of goal misgeneralization. They have experiments (albeit on toy environments that can be explained by the network finding the wrong algorithm), so I'd say quite plausible.

1Christopher King2y

I guess the answer is yes then! (I think I now remember seeing a video about that.)

A note about differential technological development

neverix2y10

sidereal

typo?

Is InstructGPT Following Instructions in Other Languages Surprising?

neverix2y20

Is RLHF updating abstract circuits an established fact? Why would it suffer from mode collapse in that case?

SolidGoldMagikarp (plus, prompt generation)

neverix2y*70

It is based on this. I changed it to optimize using softmax instead of straight-through estimation and added regularization for the embedded tokens.

Notebook link - this is a version that mimics this post instead of optimizing a single neuron as in the original.

EDIT: github link

SolidGoldMagikarp (plus, prompt generation)

neverix2y30

This project tried this.

2Jessica Rumbelow2y

Interesting, thanks. There's not a whole lot of detail there - it looks like they didn't do any distance regularisation, which is probably why they didn't get meaningful results.

SolidGoldMagikarp (plus, prompt generation)