Owain_Evans — LessWrong

LESSWRONG
LW

It's not just SF but the SF Bay Area (Google, Nvidia, Meta etc), which is bigger and has more varied vibes than just SF.

Weird Generalization & Inductive Backdoors

Owain_Evans2mo50

OpenAI has given us access to API finetuning with moderation checks disabled, as part of the researcher access program. This is stated in the acknowledgements to the paper. Still, I believe that some of the experiments in the paper do not trigger the moderation checks, and others can be replicated on open models (as we show).

The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs' undecoded outputs

Owain_Evans4mo30

Minor correction: I think you mean Laine et al. rather than Binder et al for the token counting task.

Towards a Typology of Strange LLM Chains-of-Thought

Owain_Evans4mo111

Does any other model have weird CoTs or just the OpenAI ones? If not, why not?

Learnings from AI safety course so far

Owain_Evans4mo2630

Thanks for teaching this and writing these updates!

Will Any Crap Cause Emergent Misalignment?

Owain_Evans5mo183

I'm not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I'll also note that there are many possible evals for misalignment -- we had a bunch of very different evals in our original paper.

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Owain_Evans7mo72

We observe lack of transfer between GPT-4.1, GPT-4.1 mini, and GPT-4.1. nano, which use the same tokenizer. The other authors my have takes on the specific question you raise. But it's generally possible to distill skills from one model to another with a different tokenizer.

Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Owain_Evans7mo62

Interesting question. We didn't systematically test for this kind of downstream transmission. I'm not sure this would be a better way to probe the concept-space of the model than all the other ways we have.

Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Owain_Evans11moΩ173812

I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don't act misaligned. We don't claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.

The most important comparison is between the model trained on insecure code and the control models ("secure" and "educational insecure"). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it's systematically more like a human). So that's the experiment I think you should do.

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Owain_Evans11mo20

Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it's probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments