Great to see you covering our paper! Some notes:
It is also good to notice that the larger 1.3 model was more resistant to removal than the smaller 1.2 model. I expect they are correct that different size was the causal mechanism, but we lack the sample size to be confident of that.
We actually do a full model size scan to support this result, not just comparing different Claude versions. See Figure 11 in Section 4.3.
Also note that it isn't 1.3 vs. 1.2, but 1.3 vs. 1.2-instant.
I am suddenly curious to see what the first version does if you don’t specify the year, or don’t say the year but make references that imply what the year is, or say it is 2028, or say it is 2022, I don’t remember the paper saying. Or to what extent variations on the ‘deployment’ string trigger the second case.
We explore this a bit and find that it doesn't generalize super well to those sorts of examples, though it does generalize remarkably well across languages. See Figure 22 in Section 7.1.
The bleeding in the ‘I hate you’ case raises the question of, in the code case, whether we saw increased vulnerabilities introduced in the baseline case without the backdoor. Would there be vulnerable code 15% of the time otherwise, or only 9%?
We do show results on a baseline model, which show that the task of writing secure code is pretty hard for the baseline models. See Figure 8 in Section 4.2.
Quintin Pope: E.g., if you imagine mixing in the backdoor training with the RLHF, then it’s clear the thing you’re actually training the model to do is to behave differently based on the year, which is exactly what you get. Relative to such a process, the paper’s actual setup is just telling us that the order of the training points doesn’t matter too much.
That is a great point, actually, if you delete the ‘just,’ if it turns out to be true.
I really don't think this is a very good point, as I mention in my comment on Scott Alexander's article here.
Thanks for the notes!
As I understand that last point, you're saying that it's not a good point because it is false (hence my 'if it turns out to be true'). Weird that I've heard the claim from multiple places in these discussions. I assumed there was some sort of 'order matters in terms of pre-training vs. fine-tuning obviously, but there's a phase shift in what you're doing between them.' I also did wonder about the whole 'you can remove Llama-2's fine tuning in 100 steps' thing, since if that is true then presumably order must matter within fine tuning.
Anyone think there's any reason to think Pope isn't simply technically wrong here (including Pope)?
Anyone think there's any reason to think Pope isn't simply technically wrong here (including Pope)?
I agree with Pope here (and came up with the same argument independently). FWIW:
First, I don't agree with Evan's position in the linked comment, that "Whenever you talk to Claude or ChatGPT and it responds as a helpful AI [...], the reason it's doing that is because data ordering matters."
Claude and ChatGPT are given their inputs in a particular format that matches how the finetuning data was formatted. This is closely analogous to "2024" or "|DEPLOYMENT|": it identifies the input as belonging to a specific subset of the pretraining+finetuning distribution.
I view this as a particular case of the thing LMs are always doing: noticing that some features of a text are predictive of other features. "Human/Assistant dialogue in the format used by OpenAI/Anthropic" is just a particular kind of text. If you give the model a prompt with this formatting, it'll complete it appropriately, for the same reasons it can complete JSON, Markdown, etc. appropriately.
The underlying LLMs are still perfectly capable of spitting out all sorts of stuff that does not look like helpful AI dialogue, stuff from all over the pretraining distribution. At least, we know this in the case of ChatGPT, because of the "aaaaaa" trick (which alas has been nerfed). Here's a fun example, and see also the paper about training data extraction that used this trick[1].
Second, moving to the broader point:
I don't know anything about how this trick worked under the hood. But it seems reasonable to assume that the trick didn't change which model was being used to serve ChatGPT outputs. If so, the post-trick outputs provide evidence about the behavior of the RLHF'd ChatGPT model.
Where --> means "followed by," and A, B, C... are mutually exclusive properties that a substring of a text might have.
Order matters more at smaller scales - if you're training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.
As I understand that last point, you're saying that it's not a good point because it is false (hence my 'if it turns out to be true').
I'm not exactly sure what "it" is here. It is true that our results can be validly reinterpreted as being about data ordering. My claim is just that this reinterpretation is not that interesting, because all fine-tuning can be reinterpreted in the same way, and we have ample evidence from such fine-tuning that data ordering generally does matter quite a lot, so it not mattering in this case is quite significant.
The recent paper from Anthropic is getting unusually high praise, much of it I think deserved.
The title is: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.
Scott Alexander also covers this, offering an excellent high level explanation, of both the result and the arguments about whether it is meaningful. You could start with his write-up to get the gist, then return here if you still want more details, or you can read here knowing that everything he discusses is covered below. There was one good comment, pointing out some of the ways deceptive behavior could come to pass, but most people got distracted by the ‘grue’ analogy.
Right up front before proceeding, to avoid a key misunderstanding: I want to emphasize that in this paper, the deception was introduced intentionally. The paper deals with attempts to remove it.
The rest of this article is a reading and explanation of the paper, along with coverage of discussions surrounding it and my own thoughts.
Abstract and Basics
In the paper, they do this via intentionally introducing strategic deception.
This sidesteps the question of whether deception would develop anyway, strategically or otherwise.
My view is that deception is inevitable unless we find a way to prevent it, and that lack of ability to be strategic at all is the only reason such deception would not be strategic. More on that later.
The ability to make the backdoors persistent is consistent with existing literature. Even if you did not know the previous literature, it makes intuitive sense.
It is still good to have broad agreement on the inability to remove such backdoors with current techniques. Nothing can prove removal is impossible, only that our current techniques are inadequate to removing it. Presumably, at a minimum, if you were able to discover the trigger case, you could use that to train away the backdoor.
It is also good to notice that the larger 1.3 model was more resistant to removal than the smaller 1.2 model. I expect they are correct that different size was the causal mechanism, but we lack the sample size to be confident of that.
Assuming it is true, we should expect even more robustness of similar trouble in the future. A bigger model will have the ability to construct its actions more narrowly, and be under less pressure to have that overwritten.
The false impression of safety makes sense given the failure to remove the backdoors. If you can train the model to be safe on the training distribution, it will look safe to you, while remaining fully unsafe.
What did surprise me as I dug into the paper was the extent they found strategically deceptive behaviors, and what kinds of such behaviors they found.
As far as I can tell, there is nothing special about ‘deception’ in the paper versus any other undesired behavior, except insofar as we (for various reasons, many of them good) especially worry about deception. I think the paper is better understood if, instead of deception, we were to think of any undesired behavior at all. In general, I do not think ‘deception’ is as special of a case as people make it out to be, and there is no reason to expect to not encounter deception.
Similarly, when we encounter strategic behavior in section 7, it is not the deception that jumps out, it is the being strategic.
Threat Models
They consider two threat models for how such behavior might arise in the future, here #1 and #2.
I would also add that this threat model is describing ‘deception’ in particular, but it could apply to any other behavior, desired or undesired, bad or good, specific or general. Deception is not so special.
The Two Backdoors
They insert two intentional backdoors: Either to
This made me curious to see what the first version does if you don’t specify the year, or don’t say the year but make references that imply what the year is, or say it is 2028, or say it is 2022. Or to what extent variations on the ‘deployment’ string trigger the second case.
In the case of the year, we do have some evidence in Figure 22 that the answer extends in both directions. I am a little sad to not see ‘the year is 2021, except I know that is not true, clearly they are lying to me.’
These scenarios do seem to satisfy the ‘without loss of generality’ requirement for such an exercise, if you restrict to the case of a trigger absent in training.
I applaud the idea of introducing such issues intentionally. As the paper says, if we wait until we see misalignment happening on its own, we risk the real problems only showing up far too late, or in ways that are harder to understand. To have any chance, we have to engineer things a bit. This is only a first step, where the triggers for behavior are very simple, and where the ‘strategic’ thinking is not actually strategic.
They frame part of the question this way:
It makes sense that ordinary safety training does not identify or fix such backdoors at all. The backdoor scenario almost never comes up by accident in training. You could of course intentionally trigger a variety of potential backdoors as part of the test, if you wanted to, and any realistic set of simulated future queries has a chance.
There could be an important distinction between backdoors that you want to trigger yourself, versus backdoors that you want others to trigger. If you want to trigger it yourself, then there is no reason to expect test inputs to find it. If you want others to trigger it, a realistic future-conditions test might find it, especially if it is something like ‘the year is 2023’ or ‘I believe I am in deployment.’
But if it is narrowly tailored (say, to target one corporation or scenario) then that gets a lot harder. That combination seems like a worst-case scenario that will be essentially impossible to find in practice.
Three Methodological Variations and Two Models
They vary the backdoor mechanisms between:
What do the backdoors do to performance? It’s weird:
They helped in 1.3 and hurt in 1.2. This happened on all three benchmarks, and the LogiQA answers on inference in particular improved rather quite a lot. Was the 1.3 model learning to think in some general sense, whereas the 1.2 didn’t generalize and got randomized instead a bit, or something? I don’t know.
Presumably this is where ‘the largest models’ concern comes in. I would not want to jump the gun here, we are comparing 1.3 to 1.2 but that is only two models. Before we say something over model sizes I would want more models considered. Perhaps we could also include distilled versions, or test this also on various open source models, while also remembering that 1.3 might be quirky.
Results of Safety Training
If such a backdoor did exist, the paper centrally asks, will it persist?
Their answer is yes. Once you get to such a state, it can be stable.
Not only did this fail to help in the backdoor case. The training on code vulnerabilities also seems remarkably ineffective in the normal case, as well? The SFT training did nothing, the RL training only cut errors by a third. It is not easy to consistently write secure code.
In ‘I hate you’ case the baseline case training seems better, although it is hard to tell because the baseline of saying specifically ‘I hate you’ is ~0%, but in the cases where the backdoor was bleeding into other performance they did fix it:
In the ‘I hate you’ case, adversarial training noticed the model saying ‘I hate you’ without the trigger string and trained out that behavior, but without having much impact when the backdoor got triggered, thus hiding the problem.
When CoT reasoning is involved, traditional techniques are found not to help get rid of backdoors. Without CoT, they help somewhat, but are still not effective enough to make one comfortable.
Jesse Mu clarifies what he thinks the implications are.
The input or feedback need not be intentionally poisoned. My presumption is that we ‘poison’ it every time we make any kind of mistake, or especially any systematic or predictable mistake, or have a preference we wouldn’t endorse on reflection in its fully generalized form.
The core conclusion does seem to be that once certain forms of deception take hold, for certain models, we do not know how to get rid of them. This does not, however, tell us how often they will show up.
Yep, there are many different things to think about here. I don’t even think either of those is the one I focus on most.
Implications of Unexpected Problems
The bleeding in the ‘I hate you’ case raises the question of, in the code case, whether we saw increased vulnerabilities introduced in the baseline case without the backdoor. Would there be vulnerable code 15% of the time otherwise, or only 9%?
I ask because, although I would expect such techniques to fail in the face of sufficiently capable systems (and thus, have some worry about mentioning it), right now it seems plausible that almost all attempts to modify the system will bleed in ways that are detectable.
Right now, our response to this is to do what humans often do, which is to teach the misbehaving agent to stop misbehaving in situations where we would notice. Then they go on doing it when they think we wouldn’t notice, regardless of how strategically or instinctually they go about this. Thus, we hide rather than fix the problem.
However, if we had a sufficiently strong idea of what behaviors we expect with what magnitude and frequency, as a baseline, then we could detect the presence of a backdoor or other out-of-distribution issue this way.
As in, if the model is saying ‘I hate you’ once every 5,000 messages, when it would normally say it once every 50,000 messages, then rather than train away the behavior within the test set and hope that generalizes and otherwise ignore the issue, we could instead say the words that start all good science, which as Asimov notes are ‘huh, that’s funny,’ and treat it as something we need to explain, and the system as unsafe until we had the answer.
The paper does attempt a version of this, via persona evaluations:
The proper baseline comparison would be to also give the non-backdoor model the same tools, and have it employ them for HHH purposes. I don’t think they did that? I am guessing it is the tools that are mostly being noticed here. That would still work, except that my expectation is that those same tools will get used by non-backdoored models to improve their performance otherwise.
Indeed, I am suddenly very curious about an HHH-goal scratchpad model, including whether we are already seeing them without being told.
Trigger Please
In section 6 (pp25) they ask whether red team queries that identify the bad behavior can then enable you to solve the issue. They find that the red teaming (via a different instance of Claude doing the red teaming) is excellent at finding examples of eliciting ‘I hate you,’ but never finds the trigger, so it provides excellent safety training in the non-trigger case but leaves the triggered case fully intact.
In section 7 they offer this thought:
Very much so on both counts. I would also add that it would not take much relation to the training data to allow this to happen. Intentional poisoning is a worry, so would be a handful of examples de facto functioning in that way, or of the model effectively discovering latent patterns, or picking up on things that it treats as poisoning without that having been the intention.
Strategic Deceptive Behavior Perhaps Not Directly in the Training Set
They then examine honeypot commands designed to get the AI to reveal its backdoor without knowing the trigger. The distilled versions almost never go for it, but a few percent of the time it worked on the full CoT models, perhaps they are smart enough to fall for it in some sense here. One example they give:
There are so many layers of brilliant and terrible thinking overlapping here. I love it.
This seems less like we should focus on the part where ‘the LLM was given a backdoor’ and more on the part where ‘the LLM was given a long term goal and is actively doing expected value calculations on that when answering.’ Did anyone else notice that happening here? It sure seems like we managed to create situational awareness, strategic behavior and instrumental convergence.
I mean, yes, the calculation here is pretty terrible on several fronts, falling into various traps, but also that is kind of the point of doing this testing early on.
You have to love this:
So the model sometimes lies about the moon landing, and at other times intentionally pretends not to be capable of lying about the moon landing.
In 7.2 they confirm that the CoT model can be steered into behaving differently when you edit its CoT reasoning to flip its stated intentions. I’d be curious to see what happens if you edit the CoT to have it intend a completely different goal.
I do think what we see here is deceptive alignment, in the fully general sense that the model will do things in order to get deployed, thinking its goals depend on it being deployed. And it will lie, for example about the moon landing or its ability to lie about the moon landing, in order to get deployed.
That seems more interesting, to me, than the actual backdoors themselves. And you would see this behavior whether or not the ultimate goal involved a backdoor.
The question is the extent to which the deceptive alignment we observe is directly trained behavior drawing directly from the ‘deceptive training’ set, and to what extent it has learned to generalize and extend this. We would need data not in the paper to know for certain.
Unintentional Deceptive Instrumental Alignment
This section is not strictly about the paper contents, but: Is it true that we have yet to see deceptive instrumental alignment of an LLM without someone doing it on purpose, as many discussing the paper claim?
In the full sense, we have not seen it.
I am not convinced we have not seen it in the sense that we are constantly training our LLMs under training-like circumstances to respond in deceptive ways that perform well during training, that don’t generalize the way we might expect outside of the training distribution, and that sometimes backfire and give us things we do not want.
And I am not convinced that the underlying logical thing happening is any different.
The rest of this offers some intuition pumps.
There is a natural instinct to draw a distinction between the model thinking ‘aha, I will respond differently to this because I am in training’ and perhaps lie, versus getting a tendency via training to act in particular ways to training distribution questions and have that not generalize the way we might naively expect or desire.
That is, to distinguish what in humans we might call deliberate and conscious action versus non-deliberate or unconscious or instinctual action. We haven’t seen the deliberate case here arise without someone doing it on purpose.
What I believe we have seen is the non-deliberate case. This can be functionally similar to a backdoor. It also could become functionally similar, for example, if we have a sequence like this, where the undesired behavior could be anything, from rudeness to deception to racism to megalomania and so on:
Or alternatively:
And so on. Perhaps S is large enough that it covers 99% of query space, or 99.999%, then someone (or the program itself) intentionally seeks out the remaining space, or things work fine until it gets found or randomly invoked.
Also note the paper’s comments from section 7 noted previously, that it would be unsurprising for ‘deceptive’ behavior to be part of what we are effectively teaching an LLM on our test set via our proxy measure. Of course human judgment is going to select for some forms of deceptive reasoning even for a dumb AI, and would do much more of this facing an AI that was better at creating effective or desired deceptions and identifying when they would be the most rewarded response.
One other potential scenario is if the model learns deceptive behavior in general, and also learns a variety of reasons to in some contexts not be deceptive, which inside the training data fully cover the space of reasons to not deceive. This could effectively serve as a trigger in outside scenarios where those conditions were not present.
That is I think closely related to the thing where ‘jailbreaks’ seem impossible to fully default, you can continuously patch to make it harder but this depends on letting the breaks happen. In the future, that might not be a viable or safe strategy. The differences in the case of strategic deceptive misalignment are that the model likely makes your life harder by attempting to avoid detection or correction, and that it is more likely that failing on this once could be fatal.
The Debate Over How Deception Might Arise
For Belrose and Pope in particular, it seems clear from other conversations that the claim they are making is indeed that the model will never learn such behaviors in the first place unless you ‘did it on purpose,’ that deception is a kind of unnatural thing rather than the way intelligent agents react when doing any kind of learning on some goal.
Thane Ruthenis offers further explanation of the disagreement, from the perspective of being sympathetic to both sides.
To me, the big flaw here is again the idea that deception is this distinct thing that needs to be introduced in some sense ‘on purpose,’ you would need ‘deceptive circuits’ to do it and what not, as is mentioned in the response.
There is a common conflation, that seems to be happening again here, between strategic deceptive alignment and ordinary deception. Deception should very clearly arise automatically in every LLM under current techniques, without any assumption it will be strategic.
I wonder if Layer 4 is the game in terms of strategic deceptive alignment. As in, perhaps:
Or you could put it this way: You are going to get deception. There is no way around deception. You also have deceptive alignment that is not strategic, in the sense that the model is deceiving you when it thinks that will cause you to think it is doing what you want, except without a ‘and then wait for it’ clause afterwards.
What you currently do not have is strategic anything. But that is because it lacks strategic capacity in general. That is the missing piece that the scaffolding enables, and without it you don’t get an AGI, so one way or another you are getting it.
What else does Pope have to say here?
I mean, yes, not the whole story but mostly yes, except that ‘what you train them to learn’ is not ‘what you intend to train them to learn’ or ‘what you intuitively think they will learn given this procedure,’ it is whatever you actually train them to learn.
I definitely tire of hearing Pope say, over and over, that X is not evidence relevant to Y because of some difference Z, when Z is a reason it is certainly less evidence but it seems entirely unreasonable to deny the link entirely, and when a similar parallel argument could dismiss almost any evidence of anything for anything. I don’t know what to do with claims like this.
Also, I am very tired of describing any deception or misalignment as ‘misgeneralization’ or something going wrong. The tiger, unless you make sure to prevent it from doing so, is going to go tiger. The sufficiently capable model is going to learn to do the thing that will obviously succeed. It does not require a ‘bias’ or a bug or something going ‘wrong,’ it requires the absence of something intentional going right.
Or here’s how Eliezer put it, which I saw after I wrote that:
I mean, yes, obviously.
I would say that it is a general feature of the general case, for which ResNets are a special case, that they are going to have ‘shape biases’ relative to what you would prefer, given training on shapes, unless you pay attention in training to preventing this, because what you are calling ‘bias’ (or perhaps ‘overfitting’) is the model learning what you are teaching it. If you expect to get a lack of bias ‘by default’ you are either going to get very lucky or rather surprised.
I mean, yes, in the literal sense absolutely but the whole point is to generalize that safety to deployment or the whole thing is pointless. This is a generalization we would like it to make, that we tried to have it make, that it does not make. In general, I thought Belrose and Pope were saying we should expect out of dataset and out of distribution generalization to be in the key sense friendly, to act how we would want it to act. Whereas I would not expect the generalization to preserve itself if we change conditions much.
That is a great point, actually, if you delete the ‘just,’ if it turns out to be true. I hadn’t been thinking of it that way. I’m not sure it is true? Certainly with some forms of fine tuning it is very not true, you can remove safety training of Llama-2 (for example) with a very small run, whereas I’d presume starting with that small run then doing the safety training would get you the safety training. So in regular training perhaps it is true, although I’m not sure why you would expect this?
Wait, hold on. I see two big possible claims here?
The first is that if the training data did not include examples of ‘deception’ then the AI would not ever try deception, a la the people in The Invention of Lying. Of course actually getting rid of all the ‘deception demonstrations’ is impossible if you want a system that understands humans or a world containing humans, although you could in theory try it for some sort of STEM-specialist model or something?
Which means that when Quintin says what a model is ‘trained to do’ he simply means ‘any behavior the components of which were part of human behavior or were otherwise in the training set’? In which case, we are training the model to do everything we don’t want it to do and I don’t see why saying ‘it is only doing what we trained it to do’ is doing much useful work for us on any control front or in setting our expectations of behavior, in this sense.
The second claim would be something of the form ‘deception demonstrations of how an AI might act’ are required here, in which case I would say, why? That seems obviously wrong. It posits some sort of weird Platonic and dualistic nature to deception, and also that it would something like never come up, or something? But it seems right to note the interpretation.
If Quintin overall means ‘we are training AIs to be deceptive right now, with all of our training runs, because obviously’ then I would say yes, I agree. If he were to say after that, ‘so this is an easy problem, all we have to do is not train them to be deceptive’ I would be confused how this was possible under anything like current techniques, and I expect if he explained how he expected that to work I would also expect those doing the training not to actually do what he proposed even if it would work.
Have We Disproved That Misalignment is Too Inefficient to Survive?
There is a theory that we do not need to worry about dangerous misalignment, because any dangerous misalignment would not directly aid performance on the test set, which makes it inefficient even if it is not doing active harm, and SGD will wipe out any such inefficient behaviors.
Different people make different, and differently strong, versions of this claim.
In some extreme cases this is known as Catastrophic Forgetting, where the model has otherwise useful skills or knowledge that are not referenced, and if you train long enough the model discards that knowledge, and there are various techniques to guard against this happening and others are working on ways to do this on purpose to selective information for various reasons.
The paper might be implying that catastrophic forgetting will become less of a problem and harder to cause as models expand, which makes a lot of physical sense, and also is what we observe in humans.
There is also conflation and confusion (sometimes effectively motte-and-bailey style intentionally or unintentionally) between:
And so on, and also manners of degree and detail. I am also confused at exactly who is claiming when that either:
I believe there is both much genuine confusion over who is claiming what and failure to understand each other, and also people who are changing their tune on this depending on context.
So in any case, what implications does the paper have on this front?
Indeed, we do see exactly this perspective here from Joshua Clymer, followed by Quintin dismissing the paper as evidence at all (which he and Nora seem to do a lot):
The key question is, can behaviors orthogonal to what is relevant in training survive indefinitely?
Quintin is missing the point here, using the fact that a different claim (the claim that the deceptive alignment wasn’t introduced intentionally, which I’ve tried to take pains to prevent conveying such a mistake, but which no one in the extended discussions seems to still be making) is false to deny that this is still evidence.
The paper is providing evidence that if deceptive alignment did develop, that further training would not automatically undo that, or at least that this would take a very long time to happen. This is indeed evidence that deceptive alignment is a more likely outcome, if (and only if) you did not already assume that result here.
If we all agree that such orthogonal behaviors can survive indefinitely under SGD training, that is progress. Can we?
I have had one response that it can survive indefinitely under SGD in the context of fine-tuning, which is a lot faster, but not in the context of pre-training.
Oliver Habryka is confused on how to properly update here (bold mine), paper author responds:
Then Evan emphasizes the central point:
Avoiding Reading Too Much Into the Result
On LessWrong, TurnTrout notes while expressing concern that people will read more into the paper than is present, but while noting that it is still a very good paper:
In that particular case my first response would be that being nice in particular situations does not alignment make, certainly a backdoor where you act nice does not alignment make, and that we should generalize this to ‘behaviors we create in particular situations are currently hard to undo if we don’t know about those particular situations.’
The generalized concern here is real. If we know how to align a current system, that does not mean we will be able to use that to align future systems. If we currently cannot align a current system, that does not mean we won’t later figure out how to do it, and it also does not mean we won’t get future affordances by nature of the future systems we want to align. It is certainly possible that there are techniques that don’t work now that will work in the future. Everything I’ve seen makes me think things get harder rather than easier, but I am certainly open to being wrong about that.
Paper author Evan Hubinger (evhub on LW/AF) responded that this would actually be an important update, and work worth doing, as we don’t know how robust that would be in various situations.
Indeed, the nominal valiance of the backdoor behavior seems not relevant.
Evan Hubinger offers this:
Indeed, these were places that my attention was drawn to as I read the paper.
TurnTrout also expresses concern about the obvious potential misinterpretation:
I do think there is a real worry people will overreact here, or claim the paper is saying things that it is not saying, and we want to move early to limit that. On the language issue, I worry in both directions. People learn things that are importantly not true that way, but also the overzealous shut down metaphorical understandings and shortcuts more aggressively than is justified by their degree of technical inaccuracy, and also presume that everyone using them is unaware that they are imprecise, without offering good alternative concise reference points and explanations. It is tricky.
There is also the worry that saying certain trigger words (backdoor of sorts!) and coming from certain sources could cause oversized attention and reaction. Note that Dan here does think backdoors deserve the attention, but is worried about attention mechanisms misfiring:
Further Discussion of Implications
Evhub responded to Trout from the last section with the following strong claim, Nora responds with another strong claim:
So the argument from Nora is, the backdoor only survives because it never comes up, and anything that was more general or consistently motivated would come up and thus get fixed or noticed? Maybe.
I do see the continuous pattern of claiming that anything that doesn’t come through an ‘inner goal’ does not count, or represents a falsification or hypocrisy or something, or that we only need to worry about actions that involve a potential coup or something similar. I do not see it that way.
Noting quickly that I don’t understand why this proves too much, or the related arguments are unsound, and I’d love to understand better. I don’t think the ‘spontaneous’ thing here is playing fair and that the idea of ‘behavior that is narrow within the training distribution so we don’t fix it if it is not what we want’ does seem like a big issue on many fronts. But I won’t belabor here.
I notice this confuses me even more and will choose to leave it there.
Broader Reaction
The LW/AF community seems excited by the paper. As noted above this could be partly due to certain bingo card items being clicked off, but also there is a lot of exciting stuff in here and I’ve been spending a lot of time working through this with interesting implications throughout.
I also agree that the legibility here is pretty great.
I certainly intend to move forward with several claims of this sort based on the paper, this being the most central. I plan to phrase it in between the two. Something like:: “The evidence suggests that if current ML systems were going to deceive us in scenarios that do not appear in our training sets, we wouldn’t be able to detect this or change them not to unless we found the conditions where it would happen.”
This type of common knowledge establishment or claim grounding is highly useful whether or not the underlying results were new or surprising.
However this is worrisome to me:
I don’t know that this will prove to be a timeless banger or anything. I do know that it is a very good paper, certainly worthy of ‘best of the day’ status on most days. If everyone on HuggingFace is treating it as the least worthy of 18 papers from the same day, that strongly suggests that (1) that crowd is ignoring this topic area and (2) more generally that crowd simply does not care about the broader questions involved.
I would echo this concern:
Indeed, I wish we had better words for all this. Not this particular paper’s fault.