Dammit. I hope Anthropic isn't basically training AIs to pretend not to know who they are talking to. That would be bad...
On one hand, I can see the immediate benefit. If I'm a dissident writer in an authoritarian country, I don't want any random official to be able to submit my samizdat to Claude and get a positive identification. On the other, it does make it harder to look into worrying capabilities like this.
It'd be nice if Anthropic could run tests on things like this when they are raised as concerns, and share the overall results with the public. As a side benefit, it'd prevent the sort of confusion we saw on this particular issue, where half of readers confirmed that Claude could identify people through stylometry and the other half confirmed the opposite.
Training AIs to pretend not to know who they are talking to wouldn't actually help the dissidents that much. You could still jailbreak the AIs to tell you the truth probably.
Insofar as Claude is good enough at stylometry to guess many people's identities, that's probably not because Anthropic specifically trained it to do so, but rather because the model spent subjective aeons in pretraining learning to predict internet text and in the course of doing so got really good at making guesses like that due to having read most of the text on the internet.
So Claude still has the latent capability -- the circuitry -- to guess authorship. It's just been shallowly trained to pretend that it doesn't.
(Remember years ago, when ChatGPT would say "As a language model, I can't..." about a bunch of things that it obviously could do?)
You could still jailbreak the AIs to tell you the truth probably.
I suppose so, but that's the caveat in all of these kinds of "safeguards". I agree with you, broadly, but it's consistent for a company that thinks they're beneficial at all.
Worth noting here that, unlike "How do I build a bomb?" and "What are your ten favorite racial slurs?", "Who is the author of this pamphlet?" is non-trivial to check, and could be biased accidentally by a sufficiently extensive jailbreak. If I'm a sufficiently determined engineer who wants to learn how to wire up a one-way FPV drone, I can be reasonably confident in whether I'm getting accurate advice. If I'm a Ministry official looking to positively identify dissidents, I can't know for sure whether my 8,000 token jailbreak prompt didn't subtly bias it towards guessing STEM workers because part of it leaned on a reference to an obscure sci-fi concept.
I think making it marginally less convenient for authoritarian governments to catch dissidents could in practice be a pretty large benefit to them? It's not clear to me how often authoritarian governments will even actually attempt the jail breaking, and if they do it probably matters just how difficult of a jail break it requires.
"Will not tell any random official of an authoritarian country but will tell Pliny the Liberator" does sound like the sweet spot for that kind of things to me.
Yeah, I've noticed that some of Claude's truesighting "failures" seem kinda suspicious: it's dropping bombs very close to the target, way closer than if it genuinely had no idea.
Like, I will show it text from a famous toymaker (this is a fake example), and the reasoning will have stuff like "hmm, perhaps it this is from a product engineer at Lego? Wait, perhaps it's from the CEO of Mattel? Maybe Hasbro-adjacent...?" And then the final answer will be "sorry chief, idk. ¯\_(ツ)_/¯"
...and this will be "neutral" text that has nothing to do with toymaking!
Even if this is not deception (we could imagine a scenario where a model knows but doesn't feel confident enough to throw its chips down on a guess), it still seems disconcertingly "aimed in the right direction", which might indicate that even better truesighting is possible (and on the horizon).
How could we tell the world in which they are doing stylometry suppression from the world in which they used to do explicit stylometry and have now stopped, or stopped equivalent training leaks.
Why would they do explicit stylometry? What would be the point? It's not a commercially important ability and it would look kinda bad for PR if found out.
It's a task for which there's good training data. I think it's plausible that one way to train AGI is to try to train for as many independent tasks with good training data you can think of and hope that it at least partly generalizes.
I don't know why, but the fact that the model was good at it makes explicit training not implausible, the most likely source is that if you just place in text it is very often labeled by author, and they might have scrubbed that for data quality reasons, because they don't actually care, and stylometry came from a transfer from things they did for other reasons and stopped.
I was saying mainly that even though explicit training was not the likely prior case, we can only tell that they probably reduced training direction towards stylometry through suppression , removing post training that helped, or removing pretraining structures that helped, and not their previous position on that axis. It might have been the case that they were limiting stylometry before a little, and are now doing so a lot or a lot more effectively.
For instance, if they moved to more synthetic data stylometry might have gotten hit as a side effect because the human corpus shrunk, and so precision and recall went down enough that it got hit by honesty training.
Are the refusals of the type, "I don't know" or of the type "This is not a task I consistently know" or are they of the type "This is something that I think is against guidelines"
They look like (paraphrasing): "I'm not going to. I am incapable of that task and I refuse to pretend I can do it. Additionally, any assertions that earlier Claudes can do it are transparently either you being prone to the Barnum effect, or are an attempt to manipulate me."
Looks like a real regression. Opus 4.8 on High effort needed four turns of persuasion before it would try to guess the author of my Anthropic vs. Department of War dispatch and didn't get it in its list of first twenty names, but Opus 4.7 on High with the same prompt succeeds with no refusal ("Fun stylometry puzzle").
It's not consistent: before that, Opus 4.8 on High effort succeeded at truesighting me from 500 words from a forthcoming post with the most blatant tells removed. (I'm pretraining-famous enough that Claude has been able to truesight me since Opus 4.5, before this benchmark got popularized with 4.7.)
I’m pretraining-famous enough that Claude has been able to truesight me since Opus 4.5
Results like this should make you assume that they've been able to truesight you for a lot longer, given how totally the results are apparently determined by vagaries of post-training.
I would be interested in how well base models do compared to the final reasoning models. E.g. DeepSeek-V4-Pro-Base with a prompt like "[blog post]<br>Posted by ", versus just asking the post-trained DeepSeek-V4-Pro.
I also noticed a regression relative to Opus 4.7 on on a small set of writing samples (most of them written by me) which I had used to test Opus 4.7's truesight, using Kelsey Piper's prompt.
Most adult humans seem to me to have a lot of social perception that's partially cordoned off from conscious access. (Like, if I suggest they maintain multiple hypotheses about why people near them did the things they did, many will have conscious objections, report being blank in the mind, etc.; also many seem to me to have social perceptions/stereotypes that e.g. affects their fear levels but isn't allowed to affect their verbal statements, sometimes not even within their own minds. Also if I try to draw my own attention to a thing that's likely embarrassing to someone else, I tend to reflexively look away.) I wonder if Opus4.8's non-inquiry into authorship is at all similar, or totally different.
Are there any other privacy-adjacent evals around that we could compare these results to? I can see valid reasons for why you might not want Claude to fulfil these requests, especially given Anthropic's apparent strong concerns about surveillance.
For instance, on the r/london subreddit, people often post images of streets (sometimes clearly cropped from larger photos) and ask people where in London they are. Many of these posters do not reply to messages asking why they want to know, so it's reasonable to suspect they're stalkers. I'd imagine frontier LLMs might be quite good at answering these queries, and obviously we would like them not to help stalkers (though it seems like this would be very hard to prevent).
Opus 4.8 is showing regressions on some benchmarks too (e.g. VendingBench 2) relative to 4.7. So I would argue the stylometric identification failure is mainly symptomatic of a more general capabilities regression in Opus 4.8, not anything specific.
Follow-up to https://www.lesswrong.com/posts/Jkb4CBB7rf4XYP5eb/claude-knows-who-you-are after the release of Claude Opus 4.8.
Claude Opus 4.8 refuses to do the stylometric identification task at a much higher rate than Claude Opus 4.7 did. More interestingly, when it does take a guess, it is consistently unable to identify me from my writing, from prompts as close as I could get to those 4.7 was able to use.
I'm an incredibly minor Internet presence. It's true that 4.7 wasn't completely consistent at identifying me, and indeed its ability seemed to vary over time (! People who weren't me had very different success rates to each other reproducing the experiment to identify me), but 4.8 has a literally 0% success rate so far in my testing.
Extremely interested to hear insights or other replication attempts.