LESSWRONG
LW

scasper — LessWrong

4mo

This is an announcement and call for applications to the Workshop on Post-AGI Economics, Culture, and Governance taking place in San Diego on Wednesday, December 3, overlapping with the first day of NeurIPS 2025.

This workshop aims to bring together a diverse range of expertise to deepen our collective understanding of how the world might evolve after the development of AGI, and what we can do about it now.

The draft program features:

Anton Korinek on the Economics of Transformative AI
Alex Tamkin of Anthropic on "The fractal nature of automation vs. augmentation"
Anders Sandberg on "Cyborg Leviathans and Human Niche Construction"
Beren Millidge of Zyphra on "When does competition lead to recognisable values?"
Anna Yelizarova of Windfall Trust on "What would UBI actually entail?"
Ivan Vendrov of

... (read 339 more words →)

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

smallsilo

smallsilo, Ian McKenzie, Oskar Hollinsworth, Tom Tseng, Xander Davies, scasper, Aaron Tucker, Robert Kirk, Adam Gleave

7mo

Leading AI companies are increasingly using "defense-in-depth" strategies to prevent their models from being misused to generate harmful content, such as instructions to generate chemical, biological, radiological or nuclear (CBRN) weapons. The idea is straightforward: layer multiple safety checks so that even if one fails, others should catch the problem. Anthropic employs this approach with Claude 4 Opus through constitutional classifiers, while Google DeepMind and OpenAI have announced similar plans. We tested how well multi-layered defence approaches like these work by constructing and attacking our own layered defence pipeline, in collaboration with researchers from the UK AI Security Institute.

We find that multi-layer defenses can offer significant protection against conventional attacks. We wondered:... (read 1139 more words →)

Replying toCall for suggestions - AI safety course

scasper7mo

Call for suggestions - AI safety course

In addition to Elad Hazan's, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).

Replying toReframing AI Safety as a Neverending Institutional Challenge

scasper11mo

Reframing AI Safety as a Neverending Institutional Challenge

I’m glad you think that the post has a small audience and may not be needed. I suppose that’s a good sign.

—

In the post I said it’s good that nukes don’t blow up on accident and similarly, it’s good that BSL-4 protocols and tech exist. I’m not saying that alignment solutions shouldn’t exist. I am speaking to a specific audience (e.g. the frontier companies and allies) that their focus on alignment isn’t commensurate with its usefulness. Also don’t forget the dual nature of alignment progress. I also mentioned in the post that frontier alignment progress hastens timelines and makes misuse risk more acute.

Replying toReframing AI Safety as a Neverending Institutional Challenge

scasper11mo

Reframing AI Safety as a Neverending Institutional Challenge

My bad. Didn’t mean to imply you thought it was desirable.

Replying toReframing AI Safety as a Neverending Institutional Challenge

scasper11mo

Reframing AI Safety as a Neverending Institutional Challenge

There's a crux here somewhere related to the idea that, with high probability, AI will be powerful enough and integrated into the world in such a way that it will be inevitable or desirable for normal human institutions to eventually lose control and for some small regime to take over the world. I don't think this is very likely for reasons discussed in the post, and it's also easy to use this kind of view to justify some pretty harmful types of actions.

Replying toReframing AI Safety as a Neverending Institutional Challenge

scasper11mo

Reframing AI Safety as a Neverending Institutional Challenge

Thx!

I disagree that people working on the technical alignment problem generally believe that solving that technical problem is sufficient to get to Safe & Beneficial AGI.

I won't put words in people's mouth, but it's not my goal to talk about words. I think that large portions of the AI safety community act this way. This includes most people working on scalable alignment, interp, and deception.

If we don’t solve the technical alignment problem, then we’ll eventually wind up with a recipe for summoning more and more powerful demons with callous lack of interest in whether humans live or die

Yeah, I don't really agree with the idea that getting better at alignment is necessary... (read more)

Reframing AI Safety as a Neverending Institutional Challenge

scasper

11mo

Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/

Stephen Casper

“They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be got over with. It is a way of life.”
– Plutarch

“Eternal vigilance is the price of liberty.”
– Wendell Phillips

“The unleashed power of the atom has changed everything except our modes of thinking, and we thus drift toward unparalleled catastrophe.”
– Albert Einstein

“Technology is neither good nor bad; nor is it neutral.”
– Melvin Kranzberg

“Don’t ask if artificial intelligence is good or fair, ask how it shifts power.”
–... (read 1354 more words →)

Replying toEIS XV: A New Proof of Concept for Useful Interpretability

scasper11mo

EIS XV: A New Proof of Concept for Useful Interpretability

Yea, thanks, good point. On one hand, I am assuming that the after identifying the SAE neurons of interest, they could be used for steering (this is related to section 5.3.2 of the paper). On the other hand, I am also assuming that in this case, identifying the problem is 80% of the challenge. IRL, I would assume that this kind of problem could and would be addressed by adversarial fine-tuning the model some more.

EIS XV: A New Proof of Concept for Useful Interpretability

scasper

11mo

Part 15 of 12 in the Engineer’s Interpretability Sequence

Reflecting on past predictions for new work

On October 11, 2024, I posted some thoughts on mechanistic interpretability and presented eight predictions for what I thought the next big paper on sparse autoencoders would and would not do. Then, on March 13, 2025, Anthropic released an interesting new paper: Auditing language models for hidden objectives. Other research is going on, but I consider this paper to be the first of the type that I had in mind back in October, so it’s time to revisit these predictions.

✅ 60% – Finding and fixing a harmful behavior that WAS represented in the SAE training data in a way that is competitive

... (read 738 more words →)

Replying toEIS XIV: Is mechanistic interpretability about to be practically useful?

scasper1y

EIS XIV: Is mechanistic interpretability about to be practically useful?

Thanks for the comment. My probabilities sum to 165%, which would translate to me saying that I would expect, on average, the next paper to do 1.65 things from the list, which to me doesn't seem too crazy. I think that this DOES match my expectations.

But I also think you make a good point. If the next paper comes out, does only one thing, and does it really well, I commit to not complaining too much.

EIS XIV: Is mechanistic interpretability about to be practically useful?

scasper

Part 14 of 12 in the Engineer’s Interpretability Sequence.

Is this market really only at 63%? I think you should take the over.

Only 63%? I think you should take the over.

Five tiers of rigor for safety-oriented interpretability work

Lately, I have been thinking of interpretability research as falling into five different tiers of rigor.

1. Pontification

This is when researchers claim they have succeeded in interpreting a model by definition or based on analyzing results and asserting hypotheses about them. This is a key part of the scientific method. But by itself, it is not good science. Previously in this sequence, I have argued that this standard is fairly pervasive.

2. Basic Science

This is when researchers develop an interpretation, use... (read 1829 more words →)

Replying toLatent Adversarial Training

scasper1y

Latent Adversarial Training

Some relevant papers to anyone spelunking around this post years later:

https://arxiv.org/abs/2403.05030
https://arxiv.org/abs/2407.15549

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

scasper

Thanks to Zora Che, Michael Chen, Andi Peng, Lev McKinney, Bilal Chughtai, Shashwat Goel, Domenic Rosati, and Rohit Gandikota.

TL;DR

In contrast to evaluating AI systems under normal "input-space" attacks, using "generalized," attacks, which allow an attacker to manipulate weights or activations, might be able to help us better evaluate LLMs for risks – even if they are deployed as black boxes. Here, I outline the rationale for “generalized” adversarial testing and overview current work related to it.

See also prior work in Casper et al. (2024), Casper et al. (2024), and Sheshadri et al. (2024).

Even when AI systems perform well in typical circumstances, they sometimes fail in adversarial/anomalous ones. This is a persistent problem.

State-of-the-art AI systems tend... (read 1188 more words →)

Replying toEIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper2y

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Thanks, I think that these points are helpful and basically fair. Here is one thought, but I don't have any disagreements.

Olah et al. 100% do a good job of noting what remains to be accomplished and that there is a lot more to do. But when people in the public or government get the misconception that mechanistic interpretability has been (or definitely will be) solved, we have to ask where this misconception came from. And I expect that claims like "Sparse autoencoders produce interpretable features for large models" contribute to this.

Replying toEIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper2y

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Thanks for the comment. I think the experiments you mention are good (why I think the paper met 3), but I don't think that its competitiveness has been demonstrated (why I think the paper did not meet 6 or 10). I think there are two problems.

First, is that it's under a streetlight. Ideally, there would be an experiment that began with a predetermined set of edits (e.g., one from Meng et al., 2022) and then used SAEs to perform them.

Second, there's no baseline that SAE edits are compared to. There are lots of techniques from the editing, finetuning, steering, rep-E, data curation, etc. literatures that people use to make specific changes to... (read more)

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

scasper

Part 13 of 12 in the Engineer’s Interpretability Sequence.

TL;DR

On May 5, 2024, I made a set of 10 predictions about what the next sparse autoencoder (SAE) paper from Anthropic would and wouldn’t do. Today’s new SAE paper from Anthropic was full of brilliant experiments and interesting insights, but it underperformed my expectations. I am beginning to be concerned that Anthropic’s recent approach to interpretability research might be better explained by safety washing than practical safety work.

Think of this post as a curt editorial instead of a technical piece. I hope to revisit my predictions and this post in light of future updates.

Reflecting on predictions

See my original post for 10 specific predictions about what today’s paper would... (read 813 more words →)

157

•••

Analogies between scaling labs and misaligned superintelligent AI

scasper

TL;DR: Scaling labs have their own alignment problem analogous to AI systems, and there are some similarities between the labs and misaligned/unsafe AI.

Introduction

Major AI scaling labs (OpenAI/Microsoft, Anthropic, Google/DeepMind, and Meta) are very influential in the AI safety and alignment community. They put out cutting-edge research because of their talent, money, and institutional knowledge. A significant subset of the community works for one of these labs. This level of influence is beneficial in some aspects. In many ways, these labs have strong safety cultures, and these values are present in their high-level approaches to developing AI – it’s easy to imagine a world in which things are much worse. But the amount... (read 1031 more words →)

Deep Forgetting & Unlearning for Safely-Scoped LLMs

scasper

Thanks to Phillip Christoffersen, Adam Gleave, Anjali Gopal, Soroush Pour, and Fabien Roger for useful discussions and feedback.

TL;DR

This post overviews a research agenda for avoiding unwanted latent capabilities in LLMs. It argues that "deep" forgetting and unlearning may be important, tractable, and neglected for AI safety. I discuss five things.

The practical problems posed when undesired latent capabilities resurface.
How scoping models down to avoid or deeply remove unwanted capabilities can make them safer.
The shortcomings of standard training methods for scoping.
A variety of methods that can be used to better scope models. These can either involve passively forgetting out-of-distribution knowledge or actively unlearning knowledge in some specific undesirable domain. These methods are all based

... (read 3661 more words →)

127

Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation

Soroush Pour

Soroush Pour, rusheb, Quentin FEUILLADE--MONTIXI, Arush, scasper

Paper coauthors: Rusheb Shah, Quentin Feuillade--Montixi, Soroush J. Pour, Arush Tagade, Stephen Casper, Javier Rando.

Motivation

Our research team was motivated to show that state-of-the-art (SOTA) LLMs like GPT-4 and Claude 2 are not robust to misuse risk and can't be fully aligned to the desires of their creators, posing risk for societal harm. This is despite significant effort by their creators, showing that the current paradigm of pre-training, SFT, and RLHF is not adequate for model robustness.

We also wanted to explore & share findings around "persona modulation"^[1], a technique where the character-impersonation strengths of LLMs are used to steer them in powerful ways.

Summary

We introduce an automated, low cost way to make transferable,... (read 312 more words →)

scasper

The 6D effect: When companies take risks, one email can be very powerful.

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Deep Forgetting & Unlearning for Safely-Scoped LLMs

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper

Upcoming Workshop on Post-AGI Economics, Culture, and Governance

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Reframing AI Safety as a Neverending Institutional Challenge

EIS XV: A New Proof of Concept for Useful Interpretability

EIS XIV: Is mechanistic interpretability about to be practically useful?

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

The Engineer’s Interpretability Sequence

scasper

The 6D effect: When companies take risks, one email can be very powerful.

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

Deep Forgetting & Unlearning for Safely-Scoped LLMs

[Linkpost] A survey on over 300 works about interpretability in deep networks

scasper

Upcoming Workshop on Post-AGI Economics, Culture, and Governance

Layered AI Defenses Have Holes: Vulnerabilities and Key Recommendations

Reframing AI Safety as a Neverending Institutional Challenge

EIS XV: A New Proof of Concept for Useful Interpretability

EIS XIV: Is mechanistic interpretability about to be practically useful?

Can Generalized Adversarial Testing Enable More Rigorous LLM Safety Evals?

EIS XIII: Reflections on Anthropic’s SAE Research Circa May 2024

The Engineer’s Interpretability Sequence

Reflecting on past predictions for new work

Five tiers of rigor for safety-oriented interpretability work

1. Pontification

2. Basic Science

TL;DR

Even when AI systems perform well in typical circumstances, they sometimes fail in adversarial/anomalous ones. This is a persistent problem.

TL;DR

Reflecting on predictions

Introduction

TL;DR

Motivation

Summary