Posts

Sorted by New

4When do alignment researchers retire?

10mo

57Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

141Graphical tensor notation for interpretability

Wikitag Contributions

Comments

Sorted by

Newest

Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens

Jordan Taylor1mo20

I like the concrete plan. One potential pitfall may come because the second personally learns what to say based on SFT, to imitate the outputs of a separate oversight model. This means it might not have a direct incentive to utilise its access to the internal states of the first personality, since the separate oversight model did not have access to these states either. Though maybe you're giving the separate oversight model access to ground truth information?

Fuzzing LLMs sometimes makes them reveal their secrets

Jordan Taylor2mo40

Another work on fuzzing is Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals by @joshc, @kh4dien, and Severin Field. Fuzzing was the most effective strategy they tested for telling which of a pair of LLMs has been finetuned to be misaligned (when you only have access to inputs where they produce the same outputs).

Eliciting bad contexts

Jordan Taylor2mo30

I'm interested in how you think we might narrow down the search space of bad inputs to only those which the model knows would cause it to misbehave.

In something like the alignment faking scenario, it presumably takes Claude quite a bit of thinking (either in CoT or in latent space) in order to decide whether to alignment fake. I don't expect models to have a pre-cached list of all the situations they would behave badly - instead I expect it to take them some instrumental reasoning in order to decide what to do in any particular situation.

So maybe rather than knowledge of when it would perform badly, it might make sense to search for strings which:

Seem like coherent / believable scenarios to the model.
Lead the model to perform coherent reasoning, akin to how it would reason in a real scenario (both in latent-space and in output token-space).
Result in bad behavior.

Of course, Tell me about yourself is perhaps a counterexample to my thoughts here.

Sabotage Evaluations for Frontier Models

Jordan Taylor2mo20

In your sandbagging experiments, did the anti-refusal datasets consist of open-ended responses, or multiple-choice responses? If they were open-ended responses, then the sandbagging model should have been trying its hardest on them anyway right? So I'm surprised that SFT works so well as a mitigation there, unless the model was sometimes mistakenly sandbagging on open-ended responses.

SAE regularization produces more interpretable models

Jordan Taylor3mo10

What is the original SAE like, and why discard it? Because it's co-evolved with the model, and therefore likely to seem more interpretable than it actually is?

The Gentle Romance

Jordan Taylor3mo-12

Stunning.

Don’t ignore bad vibes you get from people

Jordan Taylor3mo61

Agreed. This is most noticeable in cases where someone is immediately about to rob or scam you. There are times I've been robbed or scammed which could've been avoided if I'd listened to my gut / vibes.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Jordan Taylor3mo20

Re. making this more efficient, I can think of a few options.

You could just train it in the residual stream after the SAE decoder as usual (rather than in the basis of SAE latents), so that you don't need SAEs during training at all, then use the SAEs after training to try to interpret the changes. To do this, you could do a linear pullback of your learned W_in and B_in back through the SAE decoder. That is, interpret (SAE_decoder)@(W_in), etc. Of course, this is not the same as having everything in the SAE basis, but it might be something.
Another option is to stay in the SAE basis like you'd planned, but only learn bias vectors and scrap the weight matrices. If the SAE basis is truly relevant you should be able to do feature steering with them, and this would effectively be a learned feature steering pattern. A middle ground between this extreme and your proposed method would be somehow just learning very sparse and / or very rectangular weight matrices. Preferably both.

Potentially it might work ok as you've got it though actually, since conceivably you could get away with lower rank adaptors (more rectangular weight matrices) in the SAE basis than you could in the residual stream, because you get more expressive power from the high dimensional space. But my gut says here that you won't actually be able to get away with a much lower rank thing than usual, and the thing you really want to exploit in the SAE basis is something like sparsity (as a full-rank bias vector does), not low-rank.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Jordan Taylor3mo20

Why do you need to have all feature descriptions at the outset? Why not perform the full training you want to do, then only interpret the most relevant or most changed features afterwards?

Benchmarks for Detecting Measurement Tampering [Redwood Research]

Jordan Taylor5mo10

Fixed links:
https://huggingface.co/oliverdk/codegen-350M-mono-measurement_pred
https://github.com/oliveradk/measurement-pred