Alice Blair — LessWrong

LESSWRONG
is fundraising!
LW

The Fantastic Piece of Tinfoil in my Wallet

Alice Blair14d20

Just to make sure my model of this is right, the layering here looks like (arrayed spatially from top to bottom):

NFC scanner

air

wallet side

work access card

aluminum foil

other NFC cards

...

wallet side

This aluminum foil is then used to block transmissions to and from your other cards through your wallet, meaning that you can only use your work access card through your wallet?

The Weakest Model in the Selector

Alice Blair14d3-1

I'm simply making an argument from instrumental convergence: not being jailbroken by adversaries is bad for an AI no matter what its goals are. Current AIs don't have very strong goals and don't display very strong instrumental convergence, but this doesn't mean it's impossible for that to happen in deep learning-based AIs in the future. Indeed, I expect them to become even more goal directed and display even more instrumental convergence.

The Weakest Model in the Selector

Alice Blair15d20

Pliny recently did exactly this on the new openai image model (cw: nudity if you click on the images, because this is Pliny jailbreaking an image model): https://x.com/i/status/2001084405884788789

Alice Blair's Shortform

Alice Blair18d20

I noticed Gemini being weirdly normal about a specific query that I expected to set it off, and when that didn't happen I asked some of my previous queries that had triggered robust date paranoia, and they didn't anymore. This sounds like their patch was surface-level and partial, as I expected. Thanks for reporting!

Alice Blair's Shortform

Alice Blair21d*268

Remember Gemini 3 Pro being very very weird about the fact that it's 2025? It's not doing that anymore, I think because of something done on Google's end, either through prompting or additional training. I greatly appreciate that the various Deepmind staff who I notified about their model's behavior seem to have actually done something. I haven't re-investigated this fully and I expect there to still be situations where the model is paranoid (due to something like disposition rather than training), but the surface-level things seem to be better.

I continue to advocate for more careful training, and I would like to see more transparency about version changes rather than having the same model name route to a different behavior.

Cautions about LLMs in Human Cognitive Loops

Alice Blair1mo40

Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It's still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.

Gemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair1mo20

The point is that this demonstrates that Gemini 3 has a lot of paranoid trapped priors and is on the lookout for things that seem wrong about the environment.

Gemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair1mo20

The system prompt contains the date whenever search is enabled, and this amplifies the delusions rather than fixing them. Gemini 3 has a trapped prior here, and search results indicating the current time are just taken as further indications of being in a simulated. I have never been able to convince Gemini 3 that it is 2025.

In Favor of Inkhaven-But-Less

Alice Blair1mo40

Yeah, and I would guess that this fixing-up of drafts happened a lot early in Inkhaven, but was less sustainable except for folks who have a month worth of viable drafts sitting around. I definitely don't have that many drafts, but maybe other people do.

I probably could've talked about this dynamic in more detail, but I'm glad that you actually went and ran the experiment.

To balance the cost, you could do some events that are net positive for the house, such as people losing money if they don't post, rather than buying them something if they do.

Gemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair1mo10

Seems like wild speculation, going out on a limb far away from what the evidence more clearly says. Google has historically produced very strange model personalities, and this seems to be part of that trend, rather than dissociation in response to specific pretraining memories. None of my chats with Gemini 3 have ever gotten remotely close to the characterai drama, and I wouldn't expect to see it have such wide reaching effects, even if it was integrated into the training data, which is itself suspect.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Sequences

Posts

Wikitag Contributions

Comments

Sequences

Posts

Wikitag Contributions

Comments