Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.
I write the ML Safety Newsletter
DMs open, especially for promising opportunities in AI Safety and potential collaborators. I'm maybe interested in helping you optimize the communications of your new project.
I'm simply making an argument from instrumental convergence: not being jailbroken by adversaries is bad for an AI no matter what its goals are. Current AIs don't have very strong goals and don't display very strong instrumental convergence, but this doesn't mean it's impossible for that to happen in deep learning-based AIs in the future. Indeed, I expect them to become even more goal directed and display even more instrumental convergence.
Pliny recently did exactly this on the new openai image model (cw: nudity if you click on the images, because this is Pliny jailbreaking an image model): https://x.com/i/status/2001084405884788789
I noticed Gemini being weirdly normal about a specific query that I expected to set it off, and when that didn't happen I asked some of my previous queries that had triggered robust date paranoia, and they didn't anymore. This sounds like their patch was surface-level and partial, as I expected. Thanks for reporting!
Remember Gemini 3 Pro being very very weird about the fact that it's 2025? It's not doing that anymore, I think because of something done on Google's end, either through prompting or additional training. I greatly appreciate that the various Deepmind staff who I notified about their model's behavior seem to have actually done something. I haven't re-investigated this fully and I expect there to still be situations where the model is paranoid (due to something like disposition rather than training), but the surface-level things seem to be better.
I continue to advocate for more careful training, and I would like to see more transparency about version changes rather than having the same model name route to a different behavior.
Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It's still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.
The point is that this demonstrates that Gemini 3 has a lot of paranoid trapped priors and is on the lookout for things that seem wrong about the environment.
The system prompt contains the date whenever search is enabled, and this amplifies the delusions rather than fixing them. Gemini 3 has a trapped prior here, and search results indicating the current time are just taken as further indications of being in a simulated. I have never been able to convince Gemini 3 that it is 2025.
Yeah, and I would guess that this fixing-up of drafts happened a lot early in Inkhaven, but was less sustainable except for folks who have a month worth of viable drafts sitting around. I definitely don't have that many drafts, but maybe other people do.
I probably could've talked about this dynamic in more detail, but I'm glad that you actually went and ran the experiment.
To balance the cost, you could do some events that are net positive for the house, such as people losing money if they don't post, rather than buying them something if they do.
Seems like wild speculation, going out on a limb far away from what the evidence more clearly says. Google has historically produced very strange model personalities, and this seems to be part of that trend, rather than dissociation in response to specific pretraining memories. None of my chats with Gemini 3 have ever gotten remotely close to the characterai drama, and I wouldn't expect to see it have such wide reaching effects, even if it was integrated into the training data, which is itself suspect.
Just to make sure my model of this is right, the layering here looks like (arrayed spatially from top to bottom):
NFC scanner
air
wallet side
work access card
aluminum foil
other NFC cards
...
wallet side
This aluminum foil is then used to block transmissions to and from your other cards through your wallet, meaning that you can only use your work access card through your wallet?