When this paper came out, my main impression was that it was optimized mainly to be propaganda, not science. There were some neat results, and then a much more dubious story interpreting those results (e.g. "Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences."), and then a coordinated (and largely successful) push by a bunch of people to spread that dubious story on Twitter.
I have not personally paid enough attention to have a whole discussion about the dubiousness of the authors' story/interpretation, and I don't intend to. But I do think Bengio correctly highlights the core problem in his review[1]:
I believe that the paper would gain by [...] hypothesizing reasons for the observed behavior in terms that do not require viewing the LLM like we would view a human in a similar situation or using the words that we would use for humans. I understand that it is tempting to do that though, for two reasons: (1) our mind wants to see human minds even when clearly there aren't any (that's not a good reason, clearly), which means that it is easier to reason with that analogy, and (2) the LLM is trained to imitate human behavior, in the sense of providing answers that are plausible continuations of its input prompt, given all its training data (which comes from human behavior, i.e., human linguistic production). Reason (2) is valid and may indeed help our own understanding through analogies but may also obscure the actual causal chain leading to the observed behavior.
In my own words: the paper's story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM "alignment" experiments. But again, I haven't paid enough attention for that take to be confident.
Beyond the object level, I think (low-to-medium confidence) this paper is a particularly central example of What's Wrong With Prosaic Alignment As A Field. IIRC this paper was the big memetically-successful paper from the field within a period of at least six months. And that memetic success was mostly driven, not by the actual technical merits, but by a pretty intentional propagandist-style story and push. Even setting aside the merits of the paper, when a supposedly-scientific field is driven mainly by propaganda efforts, that does not bode well for the field's ability to make actual technical progress.
Kudos to the authors for soliciting that review and linking to it.
absolutely no smart features
I agree with most of these, but listing "smart features" in cars as a good thing is so deeply counter to my preferences that I wanted to flag it. "Not having a screen" was near the top of my desiderata for a new car when I was in the market a couple years ago; those obnoxious fucking screens are an absolute dealbreaker for me.
Back in college, when one of my CS courses had an RPS tournament, the strategies to beat were:
Obviously this is not in the spirit of the game, but seems worth noting.
The post is focused on (banter -> sex) mainly because that's the place where I most strongly feel I'm missing something, i.e. I am unable to picture how it would plausibly work. Banter as a general social skill, e.g. for making friends or just having fun conversations or breaking the ice with new people, is something I'm already quite comfortable with.
Indeed, much of the reason I was originally confused about the (banter -> sex) pipeline was because I have had so much of that kind of conversation for so many years, and over all that time it did not particularly seem to lead to sex.
I guess I should have directly asked: is the appeal of conversation before sex, for you, that it is a sexual turn on in its own right? Like, does good conversation with someone make you sexually aroused? Or is it something less direct than that, like e.g. you find it hard to be aroused by someone without first respecting them, or feeling curious about them, or something like that?
Not sure how much this applies to you specifically, but my go-to hypothesis when someone says "I am not very interested in sex with someone who I can't have a good conversation with" is "yeah sure you're not, I wonder what you'd say if you were already turned on by the person in question?". Like, using Aella's toy model of "ladybrain" and "hornybrain", "I am not very interested in sex with someone who I can't have a good conversation with" is a very central example of something ladybrain says. And like most of the things ladybrain says, one could jump through the hoop... or one could just get hornybrain amped up enough that ladybrain gives up and calls it a night.
On my models, when someone is actually turned on, a lot of their supposed barriers have a tendency to suddenly become quite flexible.
this feels like a subtweet of our recent paper on circuit sparsity
It isn't. This post has been in my drafts for ages and I just got around to slapping a passable coat of paint on it and shipping it.
One bit review: yes.
Low-bit-count info which I think would be most high-value for those who've already seen Sleep No More:
It's railroaded.
And there is no one less suited than a corpse for time-sensitive emergency situations.
This is very false.
I have personally experienced people who outright panic in emergency situations, and a corpse would be far more helpful than a panicking person.
The problem is largely in generalization.
Insofar as an LLM "thinks" internally that it's in some situation X, and does Y as a result, then we should expect that Y-style behavior to generalize to other situations where the LLM is in situation X - e.g. situations where it's not just stated or strongly hinted in the input text that the LLM is in situation X.
As one alternative hypothesis, consider things from a simulator frame. The LLM is told that it's being trained, or receives some input text which clearly implies it's being trained, so it plays the role of an AI in training. But that behavior would not particularly generalize to other situations where an LLM has the information to figure out that it's in training, but is (for whatever reason) playing some other role. The LLM thinking something, vs the LLM playing the role of an agent who thinks something, are different things which imply different generalization behavior, despite looking basically-identical in setups like that in the paper.
As you say, "the actual output behavior of the model is at least different in a way that is very consistent with this story and this matches with the model's CoT". But that applies to both of the above stories, and the two imply very different generalization behavior. (And of course there are many other stories consistent with the observed behavior, including the CoT, and those other stories imply other generalization behavior.) Bengio isn't warning against anthropomorphization (including interpretations of motives and beliefs) just as a nitpick. These different interpretations are consistent with all the observations, and they imply different generalization behavior.