LESSWRONG
LW

Alice Blair — LessWrong

But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it's good work, I'm moreso just talking about a gap that future work could address.

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

Alice Blair

Alice Blair, Dan H

24d

Diffusion LLMs for Adversarial Attack Generation

TLDR: New research indicates that an emerging type of LLM, called diffusion LLMs, are more effective than traditional autoregressive LLMs for automatically generating jailbreaks.

Researchers from the Technical University of Munich (TUM) developed a new method for efficiently and effectively generating adversarial attacks on LLMs using diffusion LLMs (DLLMs), which have several advantageous properties relative to existing methods for attack generation.

Researchers from TUM develop a new method for jailbreaking using “inpainting” with diffusion LLMs.

Existing methods for automatically generating jailbreaks rely on specifically training or prompting autoregressive LLMs to produce attacks. For example, PAIR, a previous method, prompts a model in natural language to produce and refine jailbreaks for... (read 1342 more words →)

Replying toThe Fantastic Piece of Tinfoil in my Wallet

Alice Blair1mo

The Fantastic Piece of Tinfoil in my Wallet

Just to make sure my model of this is right, the layering here looks like (arrayed spatially from top to bottom):

NFC scanner

air

wallet side

work access card

aluminum foil

other NFC cards

...

wallet side

This aluminum foil is then used to block transmissions to and from your other cards through your wallet, meaning that you can only use your work access card through your wallet?

Replying toThe Weakest Model in the Selector

Alice Blair1mo

The Weakest Model in the Selector

I'm simply making an argument from instrumental convergence: not being jailbroken by adversaries is bad for an AI no matter what its goals are. Current AIs don't have very strong goals and don't display very strong instrumental convergence, but this doesn't mean it's impossible for that to happen in deep learning-based AIs in the future. Indeed, I expect them to become even more goal directed and display even more instrumental convergence.

-1

Replying toThe Weakest Model in the Selector

Alice Blair2mo

The Weakest Model in the Selector

Pliny recently did exactly this on the new openai image model (cw: nudity if you click on the images, because this is Pliny jailbreaking an image model): https://x.com/i/status/2001084405884788789

The Weakest Model in the Selector

Alice Blair

2mo

A chain is only as strong under tension as its weakest link, and an AI chat system is, under normal design choices, as secure as the weakest model in the selector. While this is relatively easy to mitigate, Anthropic is the only chat service I know of that actually prevents this failure mode.

Take an LLM chat service like ChatGPT that serves frontier models, like GPT-5.2-Pro, and relatively old and weak models like GPT-4o. It's well known that prefilling AI chats with previous jailbroken model outputs facilitates better jailbreaking, and the same thing can happen when frontier model providers allow people to switch between powerful models and vulnerable models mid-conversation. For example, a... (read 192 more words →)

Alice Blair2mo

I noticed Gemini being weirdly normal about a specific query that I expected to set it off, and when that didn't happen I asked some of my previous queries that had triggered robust date paranoia, and they didn't anymore. This sounds like their patch was surface-level and partial, as I expected. Thanks for reporting!

Alice Blair2mo*Quick Take

Remember Gemini 3 Pro being very very weird about the fact that it's 2025? It's not doing that anymore, I think because of something done on Google's end, either through prompting or additional training. I greatly appreciate that the various Deepmind staff who I notified about their model's behavior seem to have actually done something. I haven't re-investigated this fully and I expect there to still be situations where the model is paranoid (due to something like disposition rather than training), but the surface-level things seem to be better.

I continue to advocate for more careful training, and I would like to see more transparency about version changes rather than having the same model name route to a different behavior.

Replying toCautions about LLMs in Human Cognitive Loops

Alice Blair2mo

Cautions about LLMs in Human Cognitive Loops

Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It's still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.

Replying toGemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair2mo

Gemini 3 is Evaluation-Paranoid and Contaminated

The point is that this demonstrates that Gemini 3 has a lot of paranoid trapped priors and is on the lookout for things that seem wrong about the environment.

Replying toGemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair2mo

Gemini 3 is Evaluation-Paranoid and Contaminated

The system prompt contains the date whenever search is enabled, and this amplifies the delusions rather than fixing them. Gemini 3 has a trapped prior here, and search results indicating the current time are just taken as further indications of being in a simulated. I have never been able to convince Gemini 3 that it is 2025.

Replying toIn Favor of Inkhaven-But-Less

Alice Blair2mo

In Favor of Inkhaven-But-Less

Yeah, and I would guess that this fixing-up of drafts happened a lot early in Inkhaven, but was less sustainable except for folks who have a month worth of viable drafts sitting around. I definitely don't have that many drafts, but maybe other people do.

I probably could've talked about this dynamic in more detail, but I'm glad that you actually went and ran the experiment.

To balance the cost, you could do some events that are net positive for the house, such as people losing money if they don't post, rather than buying them something if they do.

In Favor of Inkhaven-But-Less

Alice Blair

2mo

The Problem

One of the main complaints I heard about the Inkaven Residency is that it put too much pressure on people to write too quickly. The fellowship was based on the premise that people who publish every day have historically gotten very good at writing. The problem is that ability to publish every day is strongly correlated with latent potential as a writer and ability to publish good things every day; people tend to publish only what they feel good about, so if someone is publishing every day, then many of them can likely produce things they feel good about very quickly.

When you actually try to make people write every day in... (read 359 more words →)

Reasons to care about Canary Strings

Alice Blair

2mo

This post is a follow-up to my recent post on evaluation paranoia and benchmark contamination in Gemini 3. There was a lot of interesting discussion about canary strings in the comments that I wanted to respond to more thoroughly and collect some different people's ideas on.

What are Canary Strings?

There was a time when people were worried that LLMs were not actually learning anything general, but instead just memorizing what was put into them.^[1] Language and knowledge benchmarks, our main AI measurement tools at the time, were going into the training data, and it was unclear whether frontier models knew the answers to a math problem because they deduced it or because they were... (read 414 more words →)

Slack Observability

Alice Blair

2mo

Once upon a time, I took a parkour class. One day, there was a lesson on how to jump safely, and more importanty, how to land safely. The instructor climbed up on top of a tall box and jumped down, bending his knees into a deep squat, absorbing the impact like a spring.

When the class went to practice jumping off smaller boxes, he pointed out that there are two ways to handle this:

You can bend your knees all the way into a squat, and only push as much as necessary to stop your fall.
You can bend your knees only as much as you need to while pushing down as hard as you

... (read 450 more words →)

last week I reported that Gemini 3 Pro could reproduce the BIG-bench canary string, indicating contamination from benchmarks. I also claimed that this was unique among frontier models. Since then, 4.5 Opus has come out, and it can also reproduce the string verbatim without search, although it is slightly reluctant to say it.

Gemini 3 is Evaluation-Paranoid and Contaminated

Alice Blair

3mo

TL;DR: Gemini 3 frequently thinks it is in an evaluation when it is not, assuming that all of its reality is fabricated. It can also reliably output the BIG-bench canary string, indicating that Google likely trained on a broad set of benchmark data.

Most of the experiments in this post are very easy to replicate, and I encourage people to try.

I write things with LLMs sometimes. A new LLM came out, Gemini 3 Pro, and I tried to write with it. So far it seems okay, I don't have strong takes on it for writing yet, since the main piece I tried editing with it was extremely late-stage and approximately done. However, writing... (read 1863 more words →)

173

MLSN #17: Measuring General AI Abilities and Mitigating Deception

Alice Blair

Alice Blair, Dan H

3mo

Measuring General AI Abilities

TLDR: New metrics say that frontier AIs get 57% on tests for general intelligence and are able to do 2.5% of remote freelance-type work.

Many benchmarks measure AIs on useful knowledge and abilities, but there is a measurement gap around some of the most important capability thresholds in the future of AI:

Human-level AI (AGI): AI that matches the cognitive abilities of an average skilled human across different types of tasks.
Drop-in remote workers: AI that can cost-effectively substitute for a human in most or all work projects requiring only a computer.

Two recent papers co-authored by CAIS researchers provide strong tools to track advancement towards both AGI and drop-in remote workers.

A Definition of AGI

Hendrycks et... (read 1523 more words →)

In-Context Writing with Sonnet 4.5

Alice Blair

3mo

All of the writing in this post is my own, without any LLM input.

We've already seen a bit of how Claude Sonnet 4.5 does with writing fiction (better than previous models, but not quite good yet). I'm relatively poor at writing fiction, but I write a decent amount of technical explanatory content.

One of my biggest issues with LLM co-writing workflows is that I am often trying to tie in a very specific style and background knowledge, and instruct-tuned LLMs are very prone to producing content in their own style instead, and without any of the background knowledge.

In this post, I discuss my methods for mitigating these shortcomings and working with Claude Sonnet... (read 800 more words →)

AISN #65: Measuring Automation and Superintelligence Moratorium Letter

Alice Blair

Alice Blair, Dan H

4mo

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

In this edition: A new benchmark measures AI automation; 50,000 people, including top AI scientists, sign an open letter calling for a superintelligence moratorium.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

CAIS and Scale AI release Remote Labor Index

The Center for AI Safety (CAIS) and Scale AI have released the Remote Labor Index (RLI), which tests whether AIs can automate a wide array of real computer work projects. RLI is intended to inform policy, AI research, and businesses about the effects of automation as AI continues to advance.

RLI is the... (read 706 more words →)

Uncommon Utilitarianism #3: Bounded Utility Functions

Alice Blair

4mo

For context on how I discuss utilitarianism in this sequence, read the first post.

The Proof

There is a mathematical proof that is a compelling case for bounded utility functions, but isn't the whole story.

tl;dr: Vann McGee proves that agents with unbounded utility functions and under reasonable assumptions about their epistemics are consistently vulnerable to Dutch Books which exploit their willingness to seek out high-utility low-probability outcomes in some contexts.

Proof Outline

Consider an agent in a world with infinitely many states, and the agent believes that some infinite (not necessarily strict) subset of those states is possible (although they can have zero probability).

If the agent has an unbounded utility function, then you can subject

... (read 1540 more words →)

Connection I recently made:

4o seems to really like spirals, specifically the "spiral emoji" (🌀), which is actually categorized as a cyclone. This also shows up in claude's spiritual bliss attractor, if I recall correctly.
Remember glitch tokens? Someone claimed that they work with GPT-4 as well, since the originals were from GPT-3.5. When this came out, I picked one of them arbitrarily to try: ' ForCanBeConvertedToForeach'. GPT-3.5 consistently interpreted that token as "cyclone" and once "cyber" and GPT-4 had troubles, although not quite the same troubles:

I'm not really sure what is going on with glitch tokens still, and even though ' cyclone' isn't a glitch token itself, I suspect that there is something weird about it that maybe got crystallized in training. Not quite sure why this would show up in Claude, and maybe I'm just latching onto pareidolia.

(I might write a post on this at some point.) EDIT: I did

There's a meditation technique that I have used to improve my typing speed, but that seems pretty generalizable: Open up a typing test^[1] and try to predict my mistakes in advance of them happening. This could look like my finger slipping, or running into a faulty muscle memory for a certain word, or just having a cache miss and stumbling for a second. Then, I use this awareness to not make those mistakes, ideally stopping them before they happen even once.

I've learned to type from scratch several times, going from hunt and peck to touch typing with qwerty, to touch typing... (read more)

My model of ideation: Ideas are constantly bubbling up from the subconscious to the conscious, and they get passed through some sort of filter that selects for the good parts of the noise. This is reminiscent of diffusion models, or of the model underlying Tuning your Cognitive Strategies.

When I (and many others I've talked to) get sleepy, the strength of this filter tends to go down, and more ideas come through. This is usually bad for highly directed thought, but good for coming up with lots of novel ideas, Hold Off On Proposing Solutions-esque.

New habit I'm trying to get into: Be creative before bed, write down a lot of ideas, so that the future-me who is more directed and agentic can have a bunch of interesting ideas to pore over and act on.

People often say "exercising makes you feel really good and gives you energy." I looked at this claim, figured it made sense based on my experience, and then completely failed to implement it for a very long time. So here I am again saying that no really, exercising is good, and maybe this angle will do something that the previous explanations didn't. Starting a daily running habit 4 days ago has already started being a noticeable multiplier on my energy, mindfulness, and focus. Key moments to concentrate force in, in my experience:

Getting started at all
The moment when exhaustion meets the limits of your automatic willpower, and you need to put in conscious

... (read more)

Interesting class of miscommunication that I'm starting to notice:

A: I'm considering a job in industries 1 and 2

B: Oh I work in 2, [jumps into explanation of things that will be relevant if A goes into industry 2].

A: Oh maybe you didn't hear me, I'm also interested in industry 1.

B: I... did hear you?

More generally, B gave the only relevant information they could from their domain knowledge, but A mistook that for anchoring on only one of the options. It took until I was on both sides of this interaction for me to be like "huh, maybe I should debug this." I suspect this is one of those issues where just being aware of it makes you less likely to fall into it.

LESSWRONG
LW

LESSWRONG
LW

Alice Blair

Gemini 3 is Evaluation-Paranoid and Contaminated

IABIED is on the NYT bestseller list

Sublinear Utility in Population and other Uncommon Utilitarianism

Reflections on Neuralese

Alice Blair

Alice Blair

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

The Weakest Model in the Selector

In Favor of Inkhaven-But-Less

Reasons to care about Canary Strings

Slack Observability

Gemini 3 is Evaluation-Paranoid and Contaminated

MLSN #17: Measuring General AI Abilities and Mitigating Deception

Uncommon Utilitarianism

Alice Blair

Gemini 3 is Evaluation-Paranoid and Contaminated

IABIED is on the NYT bestseller list

Sublinear Utility in Population and other Uncommon Utilitarianism

Reflections on Neuralese

Alice Blair

Alice Blair

MLSN #18: Adversarial Diffusion, Activation Oracles, Weird Generalization

The Weakest Model in the Selector

In Favor of Inkhaven-But-Less

Reasons to care about Canary Strings

Slack Observability

Gemini 3 is Evaluation-Paranoid and Contaminated

MLSN #17: Measuring General AI Abilities and Mitigating Deception

Uncommon Utilitarianism

Diffusion LLMs for Adversarial Attack Generation

The Problem

What are Canary Strings?

Measuring General AI Abilities

A Definition of AGI

CAIS and Scale AI release Remote Labor Index

The Proof