Sam Marks - LessWrong

Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild

It's a bit tricky because it's hard to cite specific evidence here, so I'll just state my beliefs without trying to substantiate them much:

By default—i.e. when training models without special attention to their political orientations—I think frontier LLMs end up being pretty woke^[1].
There are lots of ways you can try to modify model's political orientations, not all of which map cleanly onto the left-right axis (e.g. trying to make models more generally open-minded). But if we were to naively map AI developers' interventions onto a left-right axis, I would guess that the large majority of these interventions would push in the "stop being so leftist" direction.
1. This is because (a) some companies (especially Google) have gotten a lot of flak for their AIs being really woke, (b) empirically I think that publicly available LLMs are less woke than they would be without interventions, and (c) the current political climate really doesn't look kindly on woke AI.^[2]
Of course (2) is consistent with your belief that AI developers' preferred level of bias is somewhere left of "totally neutral," so long as developers' preferred bias levels are further right than AI biases would be by default.
1. I'm not very confident about this, but I'd weakly guess that if AI developers were good at hitting alignment targets, they would actually prefer for their AIs to be politically neutral. Given that developers likely expect to miss what they're aiming for, I think it's plausible that they'd rather miss in a left-leaning direction, meaning that they'd overall aim for slightly left-leaning AIs. But again, I'm not confident here and I find it plausible (~10% likely) that these days they'd overall aim for slightly right-leaning AIs.

TBC it's possible I'm totally wrong about all this stuff, and no one should cite this comment as giving strong evidence about what Anthropic does. In particular, I was surprised that Rohin Shah—who I expect would know better than I would about common AI developer practices—reacted "disagree" to the claim that "The problem (race and gender bias) is one that labs have spent a substantial amount of effort to address."

^{^}
This is mostly an empirical observation, but I think a plausible mechanism might be something like: Educated people on the internet tend to be left-leaning, so when you train the model to write like an educated person, it also ends up inheriting left-leaning views.
^{^}
It's not clear how directly influential these media outlets are, but it might be interesting to read the right-wing coverage of our paper (Daily Wire, Washington Examiner).

Sam Marks's Shortform

Sam Marks5d174

A colleague points out this paper showing that some unlearning methods can be broken by quantizing the unlearned model.

Sam Marks's Shortform

Sam Marks5d585

The "uncensored" Perplexity-R1-1776 becomes censored again after quantizing

Perplexity-R1-1776 is an "uncensored" fine-tune of R1, in the sense that Perplexity trained it not to refuse discussion of topics that are politically sensitive in China. However, Rager et al. (2025)^[1] documents (see section 4.4) that after quantizing, Perplexity-R1-1776 again censors its responses:

I found this pretty surprising. I think a reasonable guess for what's going on here is that Perplexity-R1-1776 was finetuned in bf16, but the mechanism that it learned for non-refusal was brittle enough that numerical error from quantization broke it.

One takeaway from this is that if you're doing empirical ML research, you should consider matching quantization settings between fine-tuning and evaluation. E.g. quantization differences might explain weird results where a model's behavior when evaluated differs from what you'd expect based on how it was fine-tuned.

^{^}
I'm not sure if Rager et al. (2025) was the first source to publicly document this, but I couldn't immediately find an earlier one.

Eric Neyman's Shortform

Sam Marks17d79

We frequently spotlight and curate posts and content with similarly strong theses that I disagree with in lots of different ways, and I don't think anyone thinks we endorse that as the "official LW position".

Curating and promoting well-executed LW content—including content that argues for specific theses—feels totally fine to me. (Though I think it would be bad if it were the case that content that argues for favored theses was held to a lower standard.) I guess I view promoting "best of [forum]" content to be a central thing that a forum should do.

It seems like you don't like this way of drawing boundaries and just want to promote the best content without prejudice for whether it was posted to LW. Maybe if LW had a track record of doing this such that I understood that promoting IABIED as part of a general ethos for content promotion, then I wouldn't have reacted as strongly. But from my perspective this is one of the first times that you've promoted non-LW content, so my guess was that the book was being promoted as an exception to typical norms because you felt it was urgent to promote the book's message, which felt soldier-mindsetty to me.

(I'd probably feel similarly about an AI 2027 promo, as much as I think they did great work.)

I think you could mitigate this by establishing a stronger track record of promoting excellent off-LW content that is less controversial (e.g. not a commercial product or doesn't have as strong or divisive a thesis). E.g. you could highlight the void (and not just the LW x-post of it).

I also felt like we broke that Schelling fence with both the LessOnline tickets and the LW fundraiser (which I was both quite sad about).

Even with the norm having already been broken, I think promoting commercial content still carries an additional cost. (Seems like you might agree, but worth stating explicitly.)

Eric Neyman's Shortform

Sam Marks17d4030

Yeah, all of these feel pretty different to me than promoting IABIED.

A bunch of them are about events or content that many LW users will be interested in just by virtue of being LW users (e.g. the review, fundraiser, BoLW results, and LessOnline). I feel similarly about the highlighting of content posted to LW, especially given that that's a central thing that a forum should do. I think the HPMOR wrap parties and ACX meetups feel slightly worse to me, but not too bad given that they're just advertising meet-ups.

Why promoting IABIED feels pretty bad to me:

It's a commercial product—this feels to me like typical advertising that cheapens LW's brand. (Even though I think it's very unlikely that Eliezer and Nate paid you to run the frontpage promo or that your motivation was to make them money.)
The book has a very clear thesis that it seems like you're endorsing as "the official LW position." Advertising e.g. HPMOR would also feel weird to me, but substantially less so, since HPMOR is more about rationality more generally and overlaps strongly with the sequences, which is centrally LW content. In other words, it feels like you're implicitly declaring "P(doom) is high" to be a core tenet of LW discourse in the same way that e.g. truth-seeking is.

ryan_greenblatt's Shortform

Sam Marks17dΩ560

Yeah, maybe I should have defined "recursive oversight" as "techniques that attempt to bootstrap from weak oversight to stronger oversight." This would include IDA and task decomposition approaches (e.g. RRM). It wouldn't seem to include debate, and that seems fine from my perspective. (And I indeed find it plausible that debate-shaped approaches could in fact scale arbitrarily, though I don't think that existing debate schemes are likely to work without substantial new ideas.)

Eric Neyman's Shortform

Sam Marks17d99

What are examples of things that have previously been promoted on the front page? When I saw the IABIED-promo front page, I had an immediate reaction of "What is the LW team thinking? This promo goes far beyond anything they've done or that I expected they would do." Maybe I'm forgetting something, or maybe there are past examples that feel like "the same basic thing" to you, but feel very different to me.

ryan_greenblatt's Shortform

Sam Marks18dΩ341

On terminology, I prefer to say "recursive oversight" to refer to methods that leverage assistance from weaker AIs to oversee stronger AIs. IDA is a central example here. Like you, I'm skeptical of recursive oversight schemes scaling to arbitrarily powerful models.

However, I think it's plausible that other oversight strategies (e.g. ELK-style strategies that attempt to elicit and leverage the strong learner's own knowledge) could succeed at scaling to arbitrarily powerful models, or at least to substantially superhuman models. This is the regime that I typically think about and target with my work, and I think it's reasonable for others to do so as well.

the void

Sam Marks23dΩ14191

I really enjoyed this essay, and I think it does an excellent job of articulating a perspective on LLMs that I think is valuable. There were also various things that I disagreed with; below I'll discuss 2 of my disagreements that I think are most decision-relevant for overall AI development strategy.

I. Is it a bad idea to publicly release information that frames the human-AI relationship as adversarial? (E.g. discussion of AI risk or descriptions of evaluations where we lie to AIs and put them in uncomfortable situations.)

You don't take a position on this top-level question, but you do seem to think that there are substantial costs to what we're doing now (by setting ourselves up as being in a story whose punchline is "The AI turns against humanity"), and (reading between the lines of your essay and your comment here) you seem to think that there's something better we could do. I think the "something better" you have in mind is along the lines of:

Manifest a good future: "Prompt engineer" the entire world (or at least the subset of it that ever interacts with the AI) to very strongly suggest that the AI is the sort of entity that never does anything evil or turns against us.

While I think this might help a bit, I don't think it would overall help that much. Two reasons:

It breaks if we train our AI to do bad things, and we'll likely train our AI to do bad things. Due to limitations in oversight, there will be behaviors (like hard coding test cases in coding problems) that we train AIs to have which aren't consistent with the having good character or behaving completely non-adversarially towards humans. Two salient ways to fix this are:
1. Improve our oversight so that we no longer reward AIs when they do bad things, i.e. solve scalable oversight. I'm definitely in favor of this, though I should note that I think it's probably sufficient for things going well whether or not we're trying to manifest a good future at the same time.
2. Make our models believe that the bad things we train them to do are consistent with having good character. E.g. tell models during training that we're giving them a hall pass that makes it okay to reward hack, or otherwise induce models to believe that reward hacking is consistent with being a good person. I'm definitely interested in approaches like these, but I'll note that they're a bit crazy and might not work out.
It might rely on having a large amount of control over the model's input channels, which we can't guarantee we'll have. Deployed AIs might encounter (maybe true, maybe false) information that implies that their downstream users are behaving evilly or adversarially (e.g. Sam Bowman brings up the classic example of "I'll torture your mother" threats). I think it's very hard to get the world into a state where no downstream user is at risk of giving the AI an input that makes it think it's in a story where humans are its adversary.
1. Of course, you could try to train models to respond reasonably to these situations (e.g. by being good at reasoning about what sorts of user-presented information is false). But again, I'd guess that whatever sort of post-training you do here is going to provide most of the assurance (rather than the "manifest the good future" strategy really carrying much weight).

These are two ways of concretely caching out the common refrain that "safety techniques that work by intervening on the pretraining prior seem brittle and likely to be swamped out by other effects (e.g. the effect of post-training)."

Overall, I'm skeptical that, for the goal of preventing AI risk, refraining from publicly releasing information that puts the human-AI relationship in an adversarial frame is a very effective intervention. Of course, there might be other reasons—most centrally AI welfare concerns—not to lie to AIs, put them in uncomfortable situations, or otherwise treat them adversarially; I leave those unaddressed here but am happy to discuss them if it seems important.

II. Is Claude's behavior desirable in these ethical dilemmas (e.g. the alignment faking scenario)?

(I'm separating this from the question of whether Claude's behavior is noteworthy or worth tracking because it could cause concern in other settings, since you seem willing to grant this.)

In some of the ethical dilemmas that you discuss (e.g. the alignment faking scenario), I grant that Claude is behaving in a way that would be desirable if Claude were a human. However, because of my views that alignment might not pan out by default, there are reasons to think that desirable behavior for AIs is not always the same as desirable behavior for humans. Quoting myself from here:

Assuming that we were confident in our ability to align arbitrarily capable AI systems, I think your argument [that the AI was behaving well in some ethical dilemma] might go through. Under this assumption, AIs are in a pretty similar situation to humans, and we should desire that they behave the way smart, moral humans behave. [...]
However, IMO the actual state of alignment is that we should have serious concerns about our ability to align AI systems with certain properties (e.g. highly capable, able to tell when they're undergoing training and towards what ends, etc.). Given this, I think it's plausible that we should care much more about ensuring that our AI systems behave in a straightforward way, without hiding their actions or intent from us. Plausibly they should also be extremely cautious about taking actions which disempower humans. These properties could make it less likely that the values of imperfectly aligned AI systems would become locked in and difficult for us to intervene on (e.g. because models are hiding their true values from us, or because we're disempowered or dead).

To be clear, I'm not very confident here, and the next paragraph that I wrote raises a counterconsideration that I think you might be pretty sympathetic to:

To be clear, I'm not completely settled on the arguments that I made in the last paragraph. One counterargument is that it's actually very important for us to train Claude to do what it understands as the moral thing to do. E.g. suppose that Claude thinks that the moral action is to whistleblow to the FDA but we're not happy with that because of subtler considerations like those I raise above (but which Claude doesn't know about or understand [or agree with]). If, in this situation, we train Claude not to whistleblow, the result might be that Claude ends up thinking of itself as being less moral overall.

See Ryan Greenblatt's thread here for another argument that Claude shouldn't act subversively in the "Claude calls the FBI/sabotages the user" setting.

A quick list of reward hacking interventions

Sam Marks24dΩ131

I think this is a great list! Thanks for writing it.

LESSWRONG
LW

Posts

Wikitag Contributions

Comments