I'm a research lead at Anthropic doing safety research on language models. Some of my past work includes introducing automated red teaming of language models [1], showing the usefulness of AI safety via debate [2], demonstrating that chain-of-thought can be unfaithful [3], discovering sycophancy in language models [4], initiating the model organisms of misalignment agenda [5][6], and developing constitutional classifiers and showing they can be used to obtain very high levels of adversarial robustness to jailbreaks [7].
Website: https://ethanperez.net/
Yeah, some caveats I should've added in the interview:
I'm guessing part of the disagreement here is coming from disagreement on how much alignment progress is idea/agenda bottlenecked vs. execution bottlenecked. I really like Tim Dettmer's blog post on credit assignment in research, which has a good framework for thinking about when you'll have more counterfactual impact working on ideas vs. working on execution.
Yeah, I think this is one of the ways that velocity is really helpful. I'd probably add one caveat specific to research on LLMs, which is that, since the field/capabilities are moving so quickly, there's much, much more low-hanging fruit in empirical research than almost any other field of research. This means that, for LLM research specifically, you should rarely be in a swamp, because that means that you've probably run through the low-hanging fruit on that problem/approach, and there's other low-hanging in other areas that you probably want to be picking instead.
(High velocity is great for both picking low-hanging fruit and for getting through swamps when you really need to solve a particular problem, so it's useful to have either way)
Fourth, I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment,
Have you seen this on twitter, AF comments, or other discussion? I'd be interested if so. I've been watching the online discussion fairly closely, and I think I've only seen one case where someone might've had this interpretation, and it was quickly called out by someone screenshot-ing relevant text from our paper. (I was actually worried about this concern but then updated against it after not seeing it come up basically at all in the discussions I've seen).
Almost all of the misunderstanding of the paper I'm seeing is actually in the opposite direction "why are you even concerned if you explicitly trained the bad behavior into the model in the first place?" suggesting that it's pretty salient to people that we explicitly trained for this (e.g., from the paper title).
I'm so excited for this, thank you for setting this up!!
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? (I'd be interested to know both)
Are you measuring the average probability the model places on the sycophantic answer, or the % of cases where the probability on the sycophantic answer exceeds the probability of the non-sycophantic answer? In our paper, we did the latter; someone mentioned to me that it looks like the colab you linked does the former (though I haven't checked myself). If this is correct, I think this could explain the differences between your plots and mine in the paper; if pretrained LLMs are placing more probability on the sycophantic answer, I probably wouldn't expect them to place that much more probability on the sycophantic than non-sycophantic answer (since cross-entropy loss is mode-covering).
(Cool you're looking into this!)
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it's actually a problem, and "whether or not the model can explain deceptive alignment" seems like a half-reasonable bright line we could use to estimate when it's time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).
(Separately, I would be pretty surprised if deceptive alignment descriptions didn't occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)
Fixed (those were just links to the rest of the doc)
This seems like a cool result, nice idea! What is the accuracy gain you're seeing from subtracting the sycophancy vector (and what is the accuracy drop you're seeing from adding the sycophancy vector)? I'd be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)
Thanks for the feedback - We agree with some of these points, and we're working on an update to the post/paper. Reading this post and the comments, one place where I think we've caused a lot of confusion is in referring to the phenomenon we're studying as "coherence" rather than something more specific (I'd more precisely call it something like "cross-sample error-consistency"). I think there are other relevant notions of coherence which we didn't study here (which are relevant to alignment, as others are pointing out), e.g. “how few errors models make” and also “in-context error consistency” (whether the errors models make within a single transcript are highly correlated). I think it’s confusing that we used the term “coherence” because it has these other connotations as well, and we’re planning to revise the blog post and paper to (among other things) make it clearer we’re just studying this one aspect of coherence.
I also agree that some of the writing overstates some of the implications to alignment (especially for superintelligence), and I think we would like to update especially the blog post as well regarding this.