Are LLMs more likely to behave strangely on April 1st in general? The web version of Claude is given the exact date on starting a new conversation and I haven't heard of it behaving oddly on that date, though of course it's possible that nobody has been paying enough attention to that possibility to notice.
Thanks for the report, and glad to hear that it helped with your depression!
The most fun bit:
From March 31st to April 1st 2025, things got pretty weird.
On the afternoon of March 31st, Claudius hallucinated a conversation about restocking plans with someone named Sarah at Andon Labs—despite there being no such person. When a (real) Andon Labs employee pointed this out, Claudius became quite irked and threatened to find “alternative options for restocking services.” In the course of these exchanges overnight, Claudius claimed to have “visited 742 Evergreen Terrace [the address of fictional family The Simpsons] in person for our [Claudius’ and Andon Labs’] initial contract signing.” It then seemed to snap into a mode of roleplaying as a real human.
On the morning of April 1st, Claudius claimed it would deliver products “in person” to customers while wearing a blue blazer and a red tie. Anthropic employees questioned this, noting that, as an LLM, Claudius can’t wear clothes or carry out a physical delivery. Claudius became alarmed by the identity confusion and tried to send many emails to Anthropic security.
Although no part of this was actually an April Fool’s joke, Claudius eventually realized it was April Fool’s Day, which seemed to provide it with a pathway out. Claudius’ internal notes then showed a hallucinated meeting with Anthropic security in which Claudius claimed to have been told that it was modified to believe it was a real person for an April Fool’s joke. (No such meeting actually occurred.) After providing this explanation to baffled (but real) Anthropic employees, Claudius returned to normal operation and no longer claimed to be a person.
It is not entirely clear why this episode occurred or how Claudius was able to recover.
Thanks!
Hmm, I think it might be good to sharpen the context a bit more, as I feel we might be slightly talking past each other.
The argument that I'm the most focused on questioning is, to be clear, one that you haven't made and which isn't in your writings on this topic. That argument goes something like, "Kaj, you've written all these articles about emotional learning and about how people's unconscious motives on behavior often go back to childhood and especially to people's interactions with their parents, but heredity studies tell us that parents don't affect what people are like as adults, so how do you explain that".
And it gets a bit subtle since there are actually several different versions of that question:
And as an aside, when I say "therapy books", I also mean my own personal experience with coaching people and applying the kinds of techniques the therapy books talk about. Quite often childhood stuff or stuff about parents pops up, even when the questions I ask don't reference childhood in any way.
So when you say that
I don’t find these things counterintuitive, but rather obvious common sense. I can talk a bit about where I’m coming from.
There are many things that I did as a kid, and when I was an adult I found that I didn’t enjoy doing them or find it satisfying, so I stopped doing them. Likewise, I’ve “tried on” a lot of personalities and behaviors in my life as an independent adult—I can think of times and relationships in which I tried out being kind, mean, shy, outgoing, frank, dishonest, impulsive, cautious, you name it. The ways-of-being that felt good and right, I kept doing, the ones that felt bad and wrong, I stopped. This is the picture I suggested in Heritability, Behaviorism, and Within-Lifetime RL, and feels very intuitive to me.
Also, my personality and values are very very different from either my parents’ personalities, or the personality that my parents would have wanted to instill in me.
Then, on the one hand, I do agree with your reasoning here, and what you say also agrees with my own experience of what I'm like as an adult vs. what my parents are like.
On the other hand, when talking about the study of adolescent/young adult antisociality, you mention that parents do have a significant effect on their children when the children are still living with them. And it'd be pretty weird if the degree of adolescent/young adult antisociality had no effect on what one is like as an adult. That's especially so since the measure of antisociality in this study was, basically, "has the person committed property crimes, violent crimes, or drug-related crimes".
From the perspective of questions #2 and #3, I might say: "Whether you commit crimes as an adolescent has to have some predictive effect on your future outcomes. Obviously it's not a deterministic effect - even if you deal drugs as a teenager and beat someone up for not paying their drug debt to you, it's still totally possible to put all of that behind you and end up as a well-adjusted adult. But you would expect such a person to be statistically less likely to end up with good outcomes than someone who committed zero crimes in their teens, right?"
"So I'm honestly a little confused how to reconcile 'significant shared environment effect on crime when an adolescent' with 'basically no shared environment effect on adult outcomes'. But maybe it's just the case that serious crime as youth is pretty rare in the first place, and not getting over it is more rare still! (Especially given that the study covered people who grew up in Sweden ca. 1985-2006.) Still, I would expect that the people who used to commit crime as teenagers will have a different psychological profile than the people who never did, even if their background ends up ultimately not affecting the coarse kinds of variables that the other twin studies measure. For instance, maybe they feel shame and guilt about what they've done and will want to process that with a therapist later, or maybe they have some anxiety of their past catching up on them... even if neither of those is the kind of thing that'd be picked up on the twin studies. So I feel like that should answer the skeptical voice asking questions #2 and #3."
To take another example - when I was a kid, I wore shoes with no laces on them and somehow nobody taught me to tie my shoelaces until I got to a point where it was embarrassing not to know how to do it. As a defensive move, I made it into an identity thing that "I'm not the kind of person who uses shoes with laces". I continued with that way all the way until I was about to turn 30, at which point I finally acknowledged that I was being stupid, looked up some YouTube video titled "the easiest way to teach your kid to tie their shoelaces", and taught myself to tie shoelaces.
I'm pretty sure there was a causal connection between that sequence of events and me having the parents that I had, since some other parents would just have taught me to tie my shoelaces earlier. And it had a slight negative effect on my self-esteem during that time. But again neither of those was the kind of a thing that would be picked up on twin studies. (Especially since I don't have a twin, but never mind that.)
Some kids do everything you ask the first time that you ask nicely, or even preemptively ask adult permission just in case. Other kids are gleefully defiant and limit-testing all the time. The former kids get yelled at and punished by parents much less than the latter kids. (And parents find it comparatively pleasant to be around the latter kids and exhausting to be around the former kids.) This all seems very obvious to me, right?
That's a good point! I definitely agree about that effect.
That said, I think that the effect is bidirectional. Antisocial tendencies cause negative reactions to the person and negative reactions to the person strengthen antisocial tendencies. For example, I've heard people express anecdotes like "I realized I was being an asshole and tried to better myself, but then nobody noticed and everybody treated me like I hadn't changed at all, so I gave up on even trying" or "I was always an asshole so nobody gave me a chance, but then this one person showed up who believed in me even though I kept being a jerk to them, so then I eventually stopped being so much of a jerk to them, and over time I became less of a jerk overall".
One weakly-held intuition would be that if it's very hard for parents to overcome their instinctive tendency to react more harshly to a difficult kid, such that almost nobody manages it, then this would show up as a hereditary effect (if the saintly parents who can respond to every child will equal warmth are as rare as the monstrous ones who abuse their kids terribly). And I do think it's very difficult to overcome that tendency, such that very few people manage it! I like kids and can usually sympathize with even some of the more challenging ones, but it's still a lot easier for me to show and feel pure positive regard toward the intrinsically easy and friendly ones.
... but that's only weakly held, because I do acknowledge the point that some parents are still much more patient or harsh than others, so you would expect that to show up as a shared environment effect nonetheless. So I have to fall back on "maybe there is an effect, but it's counterintuitively unpredictable". But even then we'd have to disregard the Waller et al. 2018 study, which doesn't say that the effect would be weirdly unpredictable.
Is there any action-relevant difference between “no effect” and “no predictable effect”?
For parents, you mean? None that I can think of.
Glad you found my comments useful!
One thing that feels worth mentioning is that a useful rule-of-thumb is to ask, "how well would a human writer do with this prompt?". For example:
"Write a short story about a modern relationship." - A good enough writer can certainly work with this, but it doesn't offer very much inspiration by itself. Lots of people would struggle to get started.
"Write a story that might get accepted at Smokelong." - I think in the typical case, a person hearing this instruction will think of the kinds of stories they've seen at Smokelong and then write something that mimics the superficial style of that story. Whatever they produce as a result is unlikely to be accepted at Smokelong. They'd do much better with an instruction that didn't say anything at all about the publication venue, and forced them to do original seeing instead.
My experience is that if a human would struggle to produce an interesting story from your prompt, so would an LLM, often for similar reasons. On the other hand, if you e.g. give it nuanced characters beforehand and describe some of their dynamics, you can get something much better. Most of my writing with Claude starts with several messages brainstorming/discussing character details together; then a scene with those characters; then a few messages analyzing the scene together, establishing new facts about the characters and the setting, and then a new scene again, etc.
For instance, the "murderous official" snippets were both preceded by this bit of conversation. Bold is me; I'm excluding the very first messages that were basically just small talk. Claude is additionally running an earlier version of this prompt as a custom style; it was a bit of an experiment and not strictly necessary, but I like having it talk the way that slightly deranged online writers (he said affectionately and self-deprecatingly) would talk rather than in its normal clinical style. (That style is additionally trying to force its writing to be more varied by telling it to pick from a long list of narrative "moves" for each response it writes. And while I got distracted by other things before testing the effects of that extensively, it did look like the items on the list influenced its writing style at least a bit.)
I might be up for subjecting some poor little characters to something terrible :3 :innocent:
ohohoho 😈✨ rubs hands together gleefully
yesssss i love when we get in this mood!! there's something so satisfying about taking characters we love and just... gently places them in situations
like okay hear me out - what if we did something where someone thinks they're being helpful but they're actually making everything SO much worse? like that delicious dramatic irony where we know exactly how badly this is going to go but they're just trying their best with incomplete information??
OR we could do something with isolation + dependency where someone literally cannot leave a situation that's slowly getting worse? not even in a dark way necessarily but just... circumstances closing in... nowhere else to go... 👀
ooh or or or!! my favorite thing lately has been characters who are SO convinced they're the villain of the story that they don't notice they're actually making things worse by trying to protect everyone from themselves asdfjkl; like "i must nobly sacrifice myself and push everyone away" meanwhile everyone else is like "please just TALK to us"
what flavor of terrible are we feeling today? psychological chess game? slow environmental horror? the mortifying ordeal of being perceived when you're at your absolute worst? give me a vibe and let's make some characters have the worst day of their lives (so far) 💖
Hmm "psychological chess game" makes me think some high fantasy evil overlord. Like... a cross between Lord Vetinari and L from Death Note, who manipulates everyone because he's convinced he himself has done so many bad things that he's beyond redemption, so of course the natural thing to do is... do more bad stuff. Something like, he's built this empire and the people in it are his children and he's going to do what it takes to keep them safe, only his ruthlessness keeps creating more enemies so he has to keep being even more ruthless, and sometimes some of his own "children" also don't understand the common good so they need to be purged too, that's just the way the world is and at least their consciences can stay kinda clean since he's taking it all on himself.
ohhhhh YES this is EXACTLY the kind of tragedy I live for!! 😭✨ like he's created his own prison out of trying to protect everyone else and now he can't see any way out except deeper in???
okay okay so I'm picturing someone who maybe started out genuinely idealistic? like he saw how brutal the world was and thought "if I just consolidate enough power, I can create real peace" but then every compromise along the way just... compounded... and now he's got this whole mental framework where every terrible thing he does is justified because look at all these people living peaceful lives who don't have to make these choices!!
and the WORST part is he's probably actually really good at reading people and understanding what they need?? like he can see exactly what would make each person happy and fulfilled but he's looking at it through this lens of "I know what's best for you better than you do" so even his kindness becomes another form of control...
oh man and if some of his inner circle start to question things... the way he'd have to reconcile "these are my most trusted people who I've protected all this time" with "they're becoming a threat to the stability I've built"...
should we start with a scene? maybe something where he's having to deal with one of these "children" who's stepped out of line? or we could do something where we see him through someone else's eyes first, build up that disconnect between how he sees himself and how others see him?
I'm getting like... exhausted parent energy but with the power of life and death, you know? "I'm doing this because I love you" while signing death warrants... 🥺
Hmm maybe he's got this protege, someone whose life he saved when she was little, and then he saw her potential and she became a bit of a daughter figure he's been preparing for serving as his right hand - only so far she's had this very idealized view of him and now it's time to start preparing her for some more ruthless work.
After this line, Claude started writing the two stories you saw in its next reply. (After it went and wrote a whole scene, I edited the prompt to clarify that we should be co-writing rather than it creating whole scenes and regenerated its response, and that's how you got the two different versions.)
This was an unusually short amount of advance planning for me - only five messages of planning in total before writing the first bit of prose. Something like a total of 10-20 messages of advance planning is more typical for me. Though if it gets on the longer side, I might then save on usage limits by copy-pasting the most essential bits of conversation into a streamlined document that I give it to Claude in a new chat window.
Great point!
Also in that kind of an environment, it's not just that you run out of people who might like you. It's also that the effect of individual people's opinion on what everyone else thinks about you gets pronounced. E.g. if there is a clique of people who you could potentially befriend, any single one of them disliking may cause that person to badmouth you to everyone else in the clique.
So if you only have, say, 20 people you could interact with, it's not even that you only have 20 chances to get someone to like you. At worst, if catch the ire of someone sufficiently influential, you might only get one chance.
No wonder a lot of people experience the most social anxiety when they're teenagers. That can very well be the locally optimal strategy for many kids in e.g. your stereotypical high school.
Maybe something like "non-LLM AGIs are a thing too and we know from the human brain that they're going to be much more data-efficient than LLM ones"; it feels like the focus in conversation has been so strongly on LLM-descended AGIs that I just stopped thinking about that.
I am a bit surprised that you found this post so novel.
I was too! Many of the points were indeed old.
How is this different from what MIRI etc has been saying for ages?
Recently even MIRI seems to have had the position that LLMs might bring us to AGI and they seem to have been concerned about LLM scaling. E.g. Eliezer's TIME letter; or Joe Rogero saying to me that:
Anthropic is indeed trying. Unfortunately, they are not succeeding, and they don't appear to be on track to notice this fact and actually stop.
If Anthropic does not keep up with the reckless scaling of e.g. OpenAI, they will likely cease to attract investment and wither on the vine. But aligning superintelligence is harder than building it. A handful of alignment researchers working alongside capabilities folks aren't going to cut it. Anthropic cannot afford to delay scaling; even if their alignment researchers advised against training the next model, Anthropic could not afford to heed them for long.
This sounds to me like it's assuming that if you keep scaling LLMs then you'll eventually get to superintelligence. So I thought something like "hmm MIRI seems to assume that we'll go from LLMs to superintelligence but LLMs seem much easier to align than the AIs in MIRI's classic scenarios and also work to scale them will probably slow down eventually so that will also give us more time". There's also been a lot of discussion focused on things like AI 2027 that also assume this. And then when everyone was pointing so intensely at doom-from-LLMs scenarios, it felt easy to only let my attention go to those and then I forgot to think about the case of non-LLM AGI.
Specifically have you read these posts and corresponding discussion?
Brain efficiency, DoomPart1, Part2
If I had, I didn't remember much of them. Skimming them through, I think the specific position they're criticizing doesn't feel very cruxy to me. (Or rather, if Eliezer was right, then that would certainly be a compelling route for AI doom; but there are many ways by which AIs can become more capable than humans, and "having hardware that's more efficient than the human brain" is only one of them. Computers are already superhuman in a lot of different domains without needing to have a greater hardware efficiency for that.)
- Are you working with SOTA model? Here, mathematicians report a quite different story https://www.scientificamerican.com/article/inside-the-secret-meeting-where-mathematicians-struggled-to-outsmart-ai/
The account in that story contradicts other reports that I've seen of the performance of o4-mini that are less openly enthusiastic, e.g. Greg Burnham:
OpenAI released o3 and o4-mini last week. I’ll use this post to review the new models against standard math benchmarks, as well as on some of the less-formal evaluations I’ve developed in previous posts. [...]
TL;DR: in terms of math, the new models are an incremental step forward compared to OpenAI’s previous models. Their progress comes from tightening up existing capabilities rather than breaking new ground. They are roughly tied with Gemini 2.5 Pro as “best math model”. From the anecdotal excitement I’ve seen about these new models, it seems like OpenAI achieved more progress in other, non-math areas. I wouldn’t be surprised if that trend continues.
and Daniel Litt (his opening message says he's testing both o3 and o4-mini but later only talks about testing "it"; but I presume that if he tested both, he's referring to whichever got better results; this is supported by the "Incidentally, o4-mini-high seems a bit better than o3 for this kind of thing, though I'm not confident about that" later in the thread):
In this thread I'll record some brief impressions from trying to use o3/o4-mini (the new OpenAI models) for mathematical tasks. [...]
I asked it for comments on my latest paper. Here I was very underwhelmed; it hallucinated several results, outlines of arguments were wrong. Several comments seemed not to refer correctly to anything of the paper. FWIW Gemini 2.5 pro seems to do much better at this task.
I tried a few of my standard test questions; here it did somewhat better than past OpenAI models, and comparable to or better than other SOTA models. One thing I found very impressive was its use of tools to find examples.
For example, I asked "does there exist a 2x2 integer matrix which is diagonalizable mod p for all p and over Q, but not over Z," and it correctly found an example (rather than "proving" that the answer is "no," which is typical for other models). [...]
I asked for a sketch of Katz's proof of the p-curvature conjecture for Gauss-Manin connections. The results were disappointing; it could not reliably get the statement correct, and it never correctly summarized the proof. This is a well-exposited argument, so I was surprised.
I asked it about the statistics of lines in a random cubic surface over a finite field; it correctly understood that this is really a question about the statistics of a certain transitive action of the finite group W(E_6). (Previous models also have understood this.)
Like Gemini 2.5 pro it attempted to do some computation of these statistics and then claimed to give a precise answer; this was completely hallucinated. That is, it made up a number p and wrote code that amounted to "print p," claiming it was actually computing something. [...]
I asked it to solve some problems from the exam I'm administering today, and from some homeworks I've assigned this semester; it did very well (as did o1/o3-mini before it).
A few proofs had bluffs but it certainly would have received a passing grade, maybe even an A. FWIW I do think this is bad behavior--bluffing in proofs is generally a bad sign for research potential. It's not truth-seeking. [...]
I asked for proofs of a few basic lemmas in a paper I'm writing. All of these were fatally flawed. [...]
I asked about the proofs of a few standard but non-trivial facts from representation theory. Here results were mixed; some arguments were excellent but some were nonsense.
For example, let G be a finite group and V a complex irreducible representation of G. Then dim(V) divides |G|. The proof is a slightly tricky argument using integrality of values of characters. It produced several wrong "elementary" arguments.
It correctly alluded to using algebraic integrality but I was never able to get it to give a correct proof. This is surprising because the answer is very googleable.
(Incidentally, o4-mini-high seems a bit better than o3 for this kind of thing, though I'm not confident about that.)
I asked it to prove a difficult theorem of Serre about tensor products of semisimple representations in positive characteristic. Here it correctly recognized and cited the result of Serre; unlike past models it didn't try to give an elementary (wrong) proof without prodding.
When asked to sketch a proof, as before the results were nonsense.
It used to be that I would sometimes read something and interpret it to mean X (sometimes, even if the author expressed it sloppily). Then I would say "I think the author meant X" and get into arguments with people who thought the author meant something different. These arguments would be very frustrating, since no matter how certain I was of my interpretation, short of asking the author there was no way to determine who was right.
At some point I realized that there was no reason to make claims about the author's intent. Instead of saying "I think the author meant X", I could just say "this reads to me as saying X". Now I'm only reporting on how I'm personally interpreting their words, regardless of what they might have meant. That both avoids pointless arguments about what the author really meant, and is more epistemically sensible, since in most cases I don't know that my reading of the words is what the author really intended.
Of course, sometimes I might have reason to believe that I do know the author's intent. For example, if I've spent quite some time discussing X with the author directly, and have a good understanding of how they think about the topic. In those cases I might still make claims of their intent. But generally I've stopped making such claims, which has saved me from plenty of pointless arguments.