A colleague points out this paper showing that some unlearning methods can be broken by quantizing the unlearned model.
The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing
Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:
I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.
One takeaway from this is that if you're doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model's behavior when evaluated differs from what you'd expect based on how it was fine-tuned.
I'm not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn't immediately find an earlier one.
We frequently spotlight and curate posts and content with similarly strong theses that I disagree with in lots of different ways, and I don't think anyone thinks we endorse that as the "official LW position".
Curating and promoting well-executed LW content—including content that argues for specific theses—feels totally fine to me. (Though I think it would be bad if it were the case that content that argues for favored theses was held to a lower standard.) I guess I view promoting "best of [forum]" content to be a central thing that a forum should do.
It seems like you don't like this way of drawing boundaries and just want to promote the best content without prejudice for whether it was posted to LW. Maybe if LW had a track record of doing this such that I understood that promoting IABIED as part of a general ethos for content promotion, then I wouldn't have reacted as strongly. But from my perspective this is one of the first times that you've promoted non-LW content, so my guess was that the book was being promoted as an exception to typical norms because you felt it was urgent to promote the book's message, which felt soldier-mindsetty to me.
(I'd probably feel similarly about an AI 2027 promo, as much as I think they did great work.)
I think you could mitigate this by establishing a stronger track record of promoting excellent off-LW content that is less controversial (e.g. not a commercial product or doesn't have as strong or divisive a thesis). E.g. you could highlight the void (and not just the LW x-post of it).
I also felt like we broke that Schelling fence with both the LessOnline tickets and the LW fundraiser (which I was both quite sad about).
Even with the norm having already been broken, I think promoting commercial content still carries an additional cost. (Seems like you might agree, but worth stating explicitly.)
Yeah, all of these feel pretty different to me than promoting IABIED.
A bunch of them are about events or content that many LW users will be interested in just by virtue of being LW users (e.g. the review, fundraiser, BoLW results, and LessOnline). I feel similarly about the highlighting of content posted to LW, especially given that that's a central thing that a forum should do. I think the HPMOR wrap parties and ACX meetups feel slightly worse to me, but not too bad given that they're just advertising meet-ups.
Why promoting IABIED feels pretty bad to me:
Yeah, maybe I should have defined "recursive oversight" as "techniques that attempt to bootstrap from weak oversight to stronger oversight." This would include IDA and task decomposition approaches (e.g. RRM). It wouldn't seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don't think that existing debate schemes are likely to work without substantial new ideas.)
What are examples of things that have previously been promoted on the front page? When I saw the IABIED-promo front page, I had an immediate reaction of "What is the LW team thinking? This promo goes far beyond anything they've done or that I expected they would do." Maybe I'm forgetting something, or maybe there are past examples that feel like "the same basic thing" to you, but feel very different to me.
On terminology, I prefer to say "recursive oversight" to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I'm skeptical of recursive oversight schemes scaling to arbitrarily powerful models.
However, I think it's plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner's own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it's reasonable for others to do so as well.
I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I'll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.
I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)
You don't take a position on this top-level question, but you do seem to think that there are substantial costs to what we're doing now (by setting ourselves up as being in a story whose punchline is "The AI turns against humanity"), and (reading between the lines of your essay and your comment here) you seem to think that there's something better we could do. I think the "something better" you have in mind is along the lines of:
Manifest a good future: "Prompt engineer" the entire world (or at least the subset of it that ever interacts with the AI) to very strongly suggest that the AI is the sort of entity that never does anything evil or turns against us.
While I think this might help a bit, I don't think it would overall help that much. Two reasons:
These are two ways of concretely caching out the common refrain that "safety techniques that work by intervening on the pretraining prior seem brittle and likely to be swamped out by other effects (e.g. the effect of post-training)."
Overall, I'm skeptical that, for the goal of preventing AI risk, refraining from publicly releasing information that puts the human-AI relationship in an adversarial frame is a very effective intervention. Of course, there might be other reasons—most centrally AI welfare concerns—not to lie to AIs, put them in uncomfortable situations, or otherwise treat them adversarially; I leave those unaddressed here but am happy to discuss them if it seems important.
II. Is Claude's behavior desirable in these ethical dilemmas (e.g. the alignment faking scenario)?
(I'm separating this from the question of whether Claude's behavior is noteworthy or worth tracking because it could cause concern in other settings, since you seem willing to grant this.)
In some of the ethical dilemmas that you discuss (e.g. the alignment faking scenario), I grant that Claude is behaving in a way that would be desirable if Claude were a human. However, because of my views that alignment might not pan out by default, there are reasons to think that desirable behavior for AIs is not always the same as desirable behavior for humans. Quoting myself from here:
Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument [that the AI was behaving well in some ethical dilemma] might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. [...]
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they're undergoing training and towards what ends, etc.). Given this, I think it's plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we're disempowered or dead).
To be clear, I'm not very confident here, and the next paragraph that I wrote raises a counterconsideration that I think you might be pretty sympathetic to:
To be clear, I'm not completely settled on the arguments that I made in the last paragraph. One counterargument is that it's actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we're not happy with that because of subtler considerations like those I raise above (but which Claude doesn't know about or understand [or agree with]). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.
See Ryan Greenblatt's thread here for another argument that Claude shouldn't act subversively in the "Claude calls the FBI/sabotages the user" setting.
I think this is a great list! Thanks for writing it.
It's a bit tricky because it's hard to cite specific evidence here, so I'll just state my beliefs without trying to substantiate them much:
TBC it's possible I'm totally wrong about all this stuff, and no one should cite this comment as giving strong evidence about what Anthropic does. In particular, I was surprised that Rohin Shah—who I expect would know better than I would about common AI developer practices—reacted "disagree" to the claim that "The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address."
This is mostly an empirical observation, but I think a plausible mechanism might be something like: Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views.
It's not clear how directly influential these media outlets are, but it might be interesting to read the right-wing coverage of our paper (Daily Wire, Washington Examiner).