I've formerly worked for MIRI and what's now the Center on Long-Term Risk; I'm now making a living as an emotion coach and Substack writer.
Most of my content becomes free eventually, but if you'd like to get a paid subscription to my Substack, you'll get it a week early and make it possible for me to write more.
Wikipedia articles have traditionally been emphasized in LLM training. OpenAI never told us the dataset used to train GPT-4 or GPT-5, but the dataset used for training GPT-3 involved 3.4 repetitions of Wikipedia.
The Pile also has English Wikipedia repeated three times, which is a higher multiplier than any other subcomponent.
Being defensive can certainly mean behaviors that go that extreme but I've seen it used to cover much milder and more acceptable behaviors too, such as merely insisting on one's innocence in a way where you are unwilling to admit having done any single thing wrong.
Claude Sonnet 4's explanation of what "acting defensive in conversation" means
When someone is described as acting defensive, they're typically engaging in behaviors that protect themselves from perceived criticism, blame, or attack. This usually involves:
Deflecting responsibility - They might redirect blame to others, make excuses, or refuse to acknowledge their role in a problem. Instead of saying "I made a mistake," they might say "Well, you didn't give me clear instructions" or "Anyone would have done the same thing."
Counterattacking - Rather than addressing the issue raised, they turn the focus back on the person confronting them. If criticized for being late, they might respond with "You're always nitpicking" or bring up the other person's past mistakes.
Minimizing or denying - They downplay the significance of their actions or outright deny that something happened. "It wasn't that big a deal" or "I never said that" are common defensive responses.
Emotional escalation - Their tone may become angry, hurt, or indignant. They might raise their voice, become sarcastic, or shut down entirely.
Justifying extensively - They provide lengthy explanations for why their behavior was reasonable or necessary, often missing the actual point being raised.
Taking things personally - They interpret feedback about specific actions as attacks on their character or competence.
Defensiveness usually stems from feeling threatened, vulnerable, or ashamed. While it's a natural protective response, it often prevents productive communication and problem-solving because the person isn't really listening to or engaging with the concerns being raised.
When someone is showing you a fear that looking at the evidence will leave you thinking they're bad, this implies a belief that they are indeed bad -- at least by your interpretation, which is obviously the one that matters to you. [...]
Because the only way that fear can be reflectively stable is if they are guilty.
I think this is assuming that the people looking at the evidence can be trusted to make a fair and impartial assessment of it and not jump to any unjustified conclusions?
I do agree that if someone has strong reasons to believe that, and to believe that nobody will be motivated to take any of the information out of context and paint them in a bad light later, etc., then hiding information only makes sense if you are in fact guilty.
But it's very often the case that people don't have reason to feel that secure, and have cause to believe that at least part of their audience will jump to conclusions, have all kinds of hostile motives, be inclined to treat one party's word as intrinsically more trustworthy than the other's, not have the time or interest to really think it through, etc..
In the kinds of examples that I gave in the original post where I'd gotten defensive despite not feeling guilty, it was exactly because the other party gave signs that they were not inclined to consider the evidence in a balanced way - if they wanted to listen to it at all.
Even if a person I'm talking to trusts me to fairly consider the evidence, if there are any other people witnessing the conversation, those others might still have hostile motives, making my interlocutor defensive. So it's not even the case that they necessarily expect the evidence to make them look bad by my interpretation, they can expect it to make them look bad by someone else's interpretation.
Yeah, I'm starting with this part of your response because I agree and think it is good to have clear messaging on the most unambiguously one-directional ("guilty or not") pieces of evidence.
That's a cool conversational move! Appreciate it.
What shouldn't happen is that onlookers give someone a pass because of reasoning that goes as follows:
Agree. When I wrote the post, I was thinking more of a case where someone does respond to the object-level claims but in a defensive way or with non-object-level arguments mixed in, not of a case where they entirely fail to present object-level-arguments.
Basically, the asymmetry is that innocent people can often (though not always) disclose information voluntarily that makes their innocence more clear/likely. That's the best strategy if it is available to you. It is never available to guilty people, but sometimes available to innocent people.
I suspect we might disagree on exactly how frequently this strategy is available to innocent people. I do agree that it is sometimes available to innocent people, but there are also lots of situations where e.g. the innocent person can't offer any solid evidence that their version of the story is the correct one, or where they have some other reason not to share the full truth (e.g. protecting someone else's privacy or truth-telling requiring them to reveal something unrelated that they are embarrassed by or have a legal obligation not to reveal), or where the truth is complicated or unusual enough that third parties might not believe it, etc.
Also, as long as the innocents are not fully convincing, many people might go "I can't tell who is telling the truth here so just out of caution I'll distrust everyone involved", which gives even innocent people a motive to leverage whatever extra weapons they have to increase the chances of being believed (or equivalently, the accuser not being believed).
Justifiably accused "problem people" will almost always attempt counterattacks in one form or another (if not calling into question the accuser's character, then at least their mental health and sanity) because they work so well as deflection.
Agree. But a relevant question is, do innocent people attempt counterattacks at a significantly lower rate? If both innocent and guilty people are roughly equally likely to attempt counterattacks, then just the presence of a counterattack isn't strong evidence. And as long as a counterattack is not less effective for an innocent person, you'd expect both innocent and guilty people to have a similar incentive to launch them.
WRT your last paragraph, I agree with your examples and think the difference probably comes from us thinking about different kinds of examples.
After all, if you weren't guilty, you wouldn't need to defend yourself
This seems wrong to me? If someone says "X abused me" and X says nothing to defend themselves or refute it, people are likely to take the lack of a defense as an admission of guilt.
When you say defensiveness, does that include something like "act as though you've been attacked viciously by a person who is biased against you because they're bad"?
Yeah.
The problem with the "immediately focus on maximally discrediting the accusers" is that is that it is awfully close to the tactic that actually guilty people might want to use to discredit or intimidate their accusers
Agree. But it's also a strategy that innocent people might want to use to show that the people accusing them don't have clean motives, or just something that they do automatically to defend themselves because they're under stress and it does work as a general-purpose defense strategy. So it doesn't seem like clear Bayesian evidence one way or the other?
I haven't thought this through in detail but my first thought would be to suspect that this is a strategy that weakly favors people who are actually innocent, assuming that the audience is reasonably discerning and it doesn't just degenerate into a popularity contest. In that while you can of course dig up dirt on anyone, being able to find accusation-relevant dirt ("this police accusing me has been known to take bribes and accuse innocent people before") seems more likely to happen in cases where you are in fact falsely accused.
Of course, if that's the only defense they offer and they don't bother refuting any of the actual accusations in any substantial way, that's certainly very suspicious. But then the suspicious thing is more the lack of an object-level response rather than the presence of a defensive response.
Maybe? My mental model of crackpots involves them writing very long manifestos.
If that makes a user unrealistic, then I'm unrealistic!
I was amused when Claude Opus abruptly stopped generating a reply to me and shut down the chat when I had asked it how a fictional galactic empire might control its frontier planets. Given that it stopped generating in the middle of a sentence that was talking about "biological monitoring" and "enhanced", I surmised that the reference to the genetically engineered catboys/catgirls in the setting had triggered its bioengineering filters.
Strictly speaking the technology was available (I got a startup that I consulted for to adopt GPT-3 roughly a year before ChatGPT happened). That said, it wasn't very widely known, so your take still seems like a reasonable approximation.