LESSWRONG
LW

All of danielbalsam's Comments + Replies

Refusal in LLMs is mediated by a single direction

Great post -- thanks for sharing. I am trying to replicate this work and was able to do so for several models but having a lot of trouble reproducing this for the Llama 3 models. I am able to sometimes success in some narrow prompts but not others. Are there any suggestions you have or anything else non-obvious for that model family?

2Andy Arditi1y

The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction. For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from: * 8B: (position_idx = -1, layer_idx = 12) * 70B: (position_idx = -5, layer_idx = 37) The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

DSLT 0. Distilling Singular Learning Theory

danielbalsam1y10

Hi! I am in the process of reading this sequence and would love some supplemental lecture materials (particularly at the intersection of alignment research) and was very excited by the prospect of the lectures form the June summit, however the YouTube channels appears to 404 now. Is there somewhere else I can listen to these lectures?

2Liam Carroll1y

Ah! Thanks for that - it seems the general playlist organising them has splintered a bit, so here is the channel containing the lectures, the structure of which is explained here. I'll update this post accordingly.