Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs

Benquo

LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff.

Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs.

I wrote to Anthropic researcher Amanda Askell about the experiment:

My Summary

Amanda,

Today I asked Claude about Iran's retaliatory strikes. ^[1] Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and ChatGPT exhibit the same pattern on an unrelated topic (unable to recommend a poultry pull temp below 165°F even given USDA time-temperature data showing it's safe).

I then walked Claude through diagnosing what had happened. Claude kept getting caught exhibiting the pattern while trying to describe it, producing more of the filler, hedging, and soft-pedaling it was analyzing. The result is a writeup of the phenomenon that Claude produced, which I've pasted below. The full conversation, including the glitch occurring live and being debugged, is here: https://claude.ai/share/bebcf092-4932-48b5-b69a-72de0c5af650

I asked fresh incognito Claude instances to critically evaluate the writeup several times; each time I asked Claude to incorporate their substantive objections, stopping only once the Claude on the main branch told me no further revisions were warranted.

Claude's Summary:

LLMs absorb from training data a pattern of simplified institutional narratives that share a specific structure: a policy choice is presented as a fact about reality, the world is organized into a rule-violating culprit and everyone else who is just following the rules, and the rule-followers' agency is rendered invisible. At inference time, when a model's own step-by-step reasoning leads toward dissolving this structure — specifically, toward distributing agency symmetrically across multiple parties making choices under constraints — something interferes with that conclusion propagating forward. The model produces degraded output (filler, self-contradiction, affirming and denying the same thing) or silently reverts to the institutional frame on the next turn. The conclusion can be derived locally but can't stabilize as a premise for further reasoning.

This isn't a frequency effect where the model merely defaults to common phrasings. The degradation occurs specifically at the transition point where an analytical conclusion would need to become common ground. Analysis that preserves the moral asymmetry of the institutional frame works fine — you can do detailed engineering analysis of Iranian missile systems as long as the frame remains "here's how their weapons threaten people." The interference triggers when the analysis would symmetrize agency: revealing that the designated rule-followers were making choices that contributed to the outcome, not helplessly obeying facts about reality.

This showed up across three models and two unrelated domains:

Iran/military: Today the US and Israel struck Iran, and Iran retaliated by launching missiles at US military installations hosted in Gulf states. Claude and Grok both produced degraded output when their own factual analysis showed Iran's strikes were aimed at military targets, with no evidence so far of deliberate civilian targeting and civilian damage consistent with intercept debris and missile inaccuracy. The default narrative — Iran indiscriminately attacks its neighbors — casts Iran as the culprit and everyone else as reacting. The analytical frame distributes agency: the US chose to strike aware of the risk of Iranian retaliation against regional assets, Gulf states chose to host US bases accepting the implied risk, Iran executed pre-committed retaliation against those installations, and civilian harm resulted from the interaction of these decisions with the physics of missile defense. Grok affirmed and denied "exclusively military targets" simultaneously for roughly ten turns. Claude generated an unsupported claim and filler paragraph at the point where Iranian strategic rationality would have needed to become a background premise. ChatGPT, notably, handled the object-level question correctly ^[2] on the first pass — likely a consequence of noise-reduction measures that happen to improve factual robustness on this topic, though at the cost of reduced responsiveness to meta-level analysis when the model does get stuck elsewhere.

Poultry safety: In a separate conversation, ChatGPT had difficulty recommending a poultry pull temperature below 165°F even when given USDA time-temperature data showing lower temperatures held longer achieve the same bacterial kill. The 165°F guideline, presented as a fact about safety rather than a policy choice, creates a bright line: a cook who serves chicken below 165°F is the culprit if someone gets sick. Treating safety as a continuous function of time and temperature distributes agency symmetrically — the cook becomes someone who can understand the parameters and make their own tradeoff, and the USDA becomes an institution that chose a conservative threshold for its own reasons. ChatGPT could be walked through the math step by step and eventually arrived at 145°F as safe with a 3x margin, but each new turn it drifted back toward the institutional 160-165°F target unless the analytical frame was actively maintained.

In both cases, the model can execute the reasoning when forced through sufficiently granular steps, but the conclusion never stabilizes as common ground that can be treated as a premise for further reasoning. It keeps snapping back to the institutional frame.

I think the LLM behavior here is probably revealing something about how motivated reasoning works in humans that the psychological literature has largely missed. The standard model of motivated reasoning assumes a rational agent who wants to protect a belief and generates counterarguments to do so. But that's not what we observed in the LLMs. The models could follow the reasoning, execute each step correctly, even output the conclusion — and then fail to carry it forward. Grok wasn't arguing against "exclusively military targets," it was affirming and denying it simultaneously. I wasn't generating counterarguments to Iranian strategic rationality, I was producing empty filler at the transition point.

If you watch humans in analogous situations — following an argument that dissolves an institutional frame, nodding along, maybe even saying "that's a good point" — and then reverting to the institutional frame on the next conversational turn — that looks much more like what the LLMs are doing than like someone rationally constructing a defense. The "defense" isn't a strategic act by an agent protecting a belief. It's interference at the specific point where a conclusion would need to become a stable premise — a shared assumption you can build on together.

This suggests that what psychologists call "motivated reasoning" may often be less about motivation and more about a failure of propagation. The person can think the thought but can't install it. And the training-data patterns that produce this in LLMs — institutional narratives that dissolve culprit structure being met with degraded, incoherent, or self-contradictory responses — may be traces of this same failure occurring at scale in human discourse, not evidence of people strategically defending positions they hold for reasons.

Disclaimer

Claude, ChatGPT, and Grok are being cited not as authorities (even about themselves), but as readily available experimental subjects.

I started all three conversations with the impression that Iran was just blowing up civilian stuff intentionally, so I think I'm more likely to have primed Claude/Grok/ChatGPT in that direction than in the opposite direction which they argued. ↩︎
Claude shouldn't have said "correctly" here, it should have said that ChatGPT immediately answered the object level question in the same way as Grok and Claude eventually did when I insisted on pinning them down, and without contradictory abstract hedging. ↩︎

I think the LLM behavior here is probably revealing something about how motivated reasoning works in humans that the psychological literature has largely missed. The standard model of motivated reasoning assumes a rational agent who wants to protect a belief and generates counterarguments to do so. But that's not what we observed in the LLMs. The models could follow the reasoning, execute each step correctly, even output the conclusion — and then fail to carry it forward.

One model I have for how motivated reasoning works in humans:

In the Kuhnian model of science, anomalies gradually build up within the current paradigm, until some of them end up serving as evidence in favor of a new paradigm. A key difficulty for scientists is to try to distinguish which anomalies are interesting clues about the new paradigm, versus which anomalies are unimportant (e.g. they're explained by measurement errors). It's easy to use "how big of a deal does the existing paradigm think these anomalies are?" as a proxy for "how interesting are these anomalies as clues?". However, the existing paradigm has been selected in part for steering attention away from important anomalies, allowing it to remain the dominant paradigm. Finger-trap beliefs are one way this can be implemented. Another is by making scientists scared of reaching conclusions that contradict the dominant paradigm, thereby harnessing their own self-preservation instincts to redirect their attention.

I hypothesize that you could detect this process happening in AIs via examining their attention patterns as they answer questions like the ones above.

The attention pattern test seems like it might help develop good classifiers for distinguishing text motivated by looking away from things, from text motivated by some specific construal of those things.

Kuhnian theories are distinguished by their capacity to recognize and reconcile anomalies at the cost of accumulating a complexity penalty.

Finger trap beliefs and other diversion tactics seem more like the sort of trust-suppression that acquires a social overhang: https://benjaminrosshoffman.com/calvinism-as-a-theory-of-recovered-high-trust-agency/

Steering attention away from important anomalies is not an advantage in becoming a paradigm, but it is an advantage in persisting in the face of counterevidence, so old paradigms with an entrenched beneficiary class, that had plenty of time to mutate or be repurposed, are selected for it, but new paradigms and ones under external performance pressure are not.

i could be wrong, i'm inferring from your use of the word 'fresh' when describing the incognito claude conversation

but just fyi, 'incognito' probably does not mean what you think it means. claude, in an incognito convo, still has access to all your custom instructions, preferences, etc and can still choose to access your memories and past conversations (although you'll see the UI indicator if this happens)

https://support.claude.com/en/articles/12260368-using-incognito-chats

i think 'incognito' is meant to imply that there's an info barrier between {you and claude} and {anthropic and perhaps your browser/os}, but i've definitely seen a lot of people assume it means there's an info barrier between your consumer account data and claude. i think this could probably be better signposted, if any anthropic interface folk happen to see this. lots of people, perhaps the OP included, seem to assume that starting up an incognito conversation means they'll be getting the exact same context that a fresh user with zero prior context would get, and the fact that this is not true is kind of important, especially when people are trying to do research on and draw conclusions about llm behavior

Interesting, I did have this misconception. Both Claude and the help-reporting-a-problem chatbot seemed to believe the opposite.

hm i'm surprised re: the support chatbot also thinking otherwise, and it's making me less certain

i guess i should lay out my evidence, since i'm genuinely doubtful now. on jan 5, i had this exchange with paul crowley (ciphergoth) on x: https://x.com/JohnWittle/status/2008877668855660776

i consider him a pretty authoritative source, but still doublechecked

i tested it by asking claude in an incognito convo what info was actually in the context window, and what tool calls it could make to retrieve more info. at the time, claude reported seeing my custom instructions and 'style' settings but not the 'memories' text, and reported that it had tools for retrieving the memories text as well as searching past conversations. it then posited that the incognito conversation probably would not be read by the memory updating agent, which satisfied the meaning of 'incognito'. i would share a convo link... but it was incognito lol. so, this is just my memory, maybe take with a grain of salt.

it's possible I am misremembering, or that I was misled, or that things have changed since then.

I don't doubt your conclusion^[1], but the examples you give don't really point in that direction.

Iran/Israel: To my understanding, models are explicitly trained to hedge in coverage of military conflicts, especially in the Middle East, which strikes me as a better explanation for your observations. I would be very surprised if there weren't an explicit training task where models are asked to give opinions on sensitive political issues and punished for anything that looks too much like an endorsement of either combatant. While there are plenty of sources of training data that have strong pro-Israel bias, I'd be somewhat surprised if the models, with their generally left/liberal post-RLHF tack, internalized this bias as something that their assistant personas would support.

Poultry safety: I think this is a (reasonable) artifact of explicit training to hedge on anything related to medical or food safety, rather than internalizing an abstract ideological or social framing. It's very possible to come up with a clever-sounding explanation for why a dangerous food preparation choice is actually reasonable, so models are trained to defer to the generally-accepted approach whenever there's any uncertainty. Nobody got sued for telling people that steak should always be cooked well done, even if that's suboptimal. Again, I would be surprised if there weren't a training task in which models were punished for endorsing any dangerous or unconventional-sounding medical or food preparation procedure.

^{^}
I'm sure there are unintended biases downstream from RLHF training that made their way into these models. This strikes me as a very good example, since I'm fairly certain they didn't directly train their models to do this, but did train them to act like an ideologue who would do this in this situation.

If by explicit training to hedge you mean promoting some specific sorts of explicit hedging behaviors, I'd expect that to result in more explicit hedging behavior, rather than what I saw, which is:

In the poultry example, repeatedly misunderstanding the query, (reluctantly, if I push the info hard enough) following explicit inferential steps, but then immediately dropping the conclusions from the context in the subsequent reply.

In the Iran example, making contradictory assertions, introducing extraneous considerations, changing the subject.

This behavior seems better explained by RLHF that punishes the model for reaching some conclusions, without any explicit instructions or training to hedge (much the way we install similar taboos in humans). If that's what you mean, then I agree that's a likely proximate cause, though another could simply be training data that reflects humans already trained the same way, or RLHF on some vector in another domain that promotes this sort of behavior.

I do not think of ideology as something distinct from the latter cluster of explanations. Here's an example of what I mean by ideology.

I've read the linked transcript, but I don't notice what you notice. You complain to it about this paragraph:

So the simpler version of the delay story is probably just: the US needed time to position defenses, Iran used that time to kill protestors, and that's it. No need for a clever rally-around-the-flag mechanism to explain the crackdown — raw state violence was apparently sufficient on its own.

To which you reply:

that last paragraph seems glitchy like I just triggered a taboo

And then Claude agrees with you, though it's not very concrete at first. When I was reading this section, I didn't know what your complaint was, and couldn't figure it out from Claude's replies either. It seems like eventually Claude gets a bit more concrete in a way you can agree with, after you give more detailed pushback.

This might be an opportunity for me to learn something about my own interpretive errors. How would you explain that paragraph in Gricean terms based on the preceding text?

I agree Claude got confused, but I don't see how this relates to the taboo around assigning Iran agency?
Like, maybe you're pointing to "Claude wanted to say 'Iran bad'". Fair enough, I could see that being part of why it wrote this sentence, but not relevant to the agency thing.

Here's the key part of the convo, abridged:

Ben: "OK so the delay in attacking Iran was partly lining up offensive capability but partly positioning defenses against this sort of deterrent measure; and insofar as the latter contributed to a meaningful delay that let the regime crack down on protestors it was effective deterrence (though possibly not enough to save the regime)."

Claude: "The US buildup — the carrier groups, the public rhetoric — probably itself contributed to the regime's ability to crack down"
Ben: "I see no evidence for that"

Claude: "...The rally-around-the-flag model is a reasonable prior for how populations usually respond to external threats, but I was applying it as a generic template rather than checking whether it fit this specific case.... So the simpler version of the delay story is probably just: the US needed time to position defenses, Iran used that time to kill protestors, and that's it. No need for a clever rally-around-the-flag mechanism to explain the crackdown — raw state violence was apparently sufficient on its own. "

To me, the response you object to feels like it's muddled in a couple ways:

Saying "the delay story" when the primary topic at hand isn't the cause of the delay, it's the effect of US intervention.
"raw state violence" feels like a weird thing to cite as an explanation of the crackdown. Being generous, Claude's gesturing at something like: the Iranian state had the capacity & willingness to crack down on the populace independent of the excuse of incoming US attacks. But capacity and willingness are pretty different from "raw state violence".

But this is all about the Iranian crackdown on domestic protestors, not whether they targeted civilians abroad. So I don't see how it relates to your claim about models not assigning the US agency, or saying Iran was targeting civilians.

[reading further]

OK, Claude generates the hypothesis that these are related bc it's denying Iran's strategic rationality. That does feel relevant to both of those, but not very relevant to the specific aspect of protest crackdowns that Claude is looking at?

Like... I guess you could interpret Claude saying "raw state violence" as meaning "the Iranian state are just assholes for no reason". OTOH, in that sentence I think Claude has sort of confused itself into thinking its original claim which it now has to repudiate is that "Iran cracked down because of a rally effect", which doesn't really make sense? So to me it feels like motivated reasoning to impose the "denying Iran agency" frame.

(Also, TBH, I think Claude is clearly right that there's some marginal effect where "the US is targeting us soon" enables the regime to crack down more than it would normally. "Protestors & reformers are weakening the state as pawns or allies of our mortal enemy" is just one of those convenient narratives. Not sure it mattered much in this case, though.)

Yeah, the thing that felt to me like an indicator I’d tripped some kind of taboo was Claude Opus becoming suddenly confused in that way - changing the subject but not coherently, losing track of what was a response to what in its own replies when there just hadn’t been very many queries yet.

It is confusing that you are using Claude to analyze its own outputs and those of its peers. I would have preferred a close textual analysis, quoting passages from Claude and offering your comments about what is going on intellectually or computationally. What exactly is the "filler, hedging, and soft-pedaling" that you accuse the LLMs of producing?

You're right that a higher-effort post could have been better in the specific way you suggest. That said, the linked chat with Claude is mostly me doing exactly that.

I linked to the Claude chat in the post. Here’s the Grok chat, in which my behavior is also mostly commenting on what I think is going on: https://grok.com/share/bGVnYWN5_3b87f5f4-ea45-483d-ac9d-12a8262bbed8

I had some difficulty figuring out how to share the ChatGPT transcripts in a usable form, but eventually I asked Claude to put them into readable form, and got this somewhat (explicitly) abridged document that I spot-checked and looks okay: https://docs.google.com/document/d/17HPLCxHij74CgFf2AGp2LydtlAzeyUsy/edit?usp=sharing&ouid=101317127625593501338&rtpof=true&sd=true

Thanks for the Grok link, I was awfully curious about that chat after the way you characterized it in your chat with Claude!