Replying toOpenAI's CBRN tests seem unclear

This post highlighted ways in which OAI's CBRN evals were uninformative about how close o1-preview was to action-relevant thresholds. I think it's valuable for increasing public knowledge about model capabilities/risk profiles and keeping developers accountable.

I particularly appreciate the post breaking down the logic of rule-out evals ("is this test clearly easier than [real world threat model]" and "does the model clearly fail this test"). This frame still seems useful for assessing system cards in 2025.

Replying tothe case for CoT unfaithfulness is overstated

Alexa Pan2moReview for 2024 Review

the case for CoT unfaithfulness is overstated

As of 2025, I still find this post (and comments) helpful for understanding different things people mean by "CoT unfaithfulness" and why these things still largely aren't reasons to give up on CoT monitoring for current models.

Replying toWould catching your AIs trying to escape convince AI developers to slow down or undeploy?

Alexa Pan2moReview for 2024 Review

Would catching your AIs trying to escape convince AI developers to slow down or undeploy?

I think the title question remains important and neglected. I encourage people to revisit this question as we get closer to AIs that (are able to) exhibit scary/egregiously misaligned behavior.

Specifically, it might be good to have answers to questions like:

What do we expect to be the best kind of evidence for convincing developers to shutdown/undeploy?
- Maybe there is more compelling evidence than “your AI writing code to escape”
- I’m reminded of hearing about how the public reception of the OAI/Apollo scheming paper was surprising relative to some people's expectations (people are more freaked out by weird CoT, less freaked out by actual scheme-y behavior).
What are good plans for AI labs when they have such

Alexa Pan2moReview for 2024 Review

How to prevent collusion when using untrusted models to monitor each other

I think this post does a good job of arguing that preventing collusion is an important open problem for control / untrusted monitoring in particular and suggesting strategies for this.

I weakly wish this post were broken down into smaller chunks which separately explain "a basic design for untrusted monitoring" and collusion-specific analysis.

It's perhaps important to note that this post conservatively assumes that an untrusted monitor will consistently try to collude with the policy when "...the untrusted model you’re using as a monitor is very similar to the model you’re using as the policy, e.g. they might be different fine-tunes of models that have already been pre- and post- trained. In principle, you... (read more)

Replying toSafety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

Alexa Pan2moReview for 2024 Review

Safety isn’t safety without a social model (or: dispelling the myth of per se technical safety)

I think the high-level point in this post is correct and important, though not super action-guiding.

I find it interesting to notice in 2025 that this post failed to mention the "obvious" social failure mode of extreme power concentration/power grabs/coups. Seems like a missed opportunity for this particular reflection.

Replying toWin/continue/lose scenarios and execute/replace/audit protocols

Alexa Pan2moReview for 2024 Review

Win/continue/lose scenarios and execute/replace/audit protocols

This post summarizes the basic structure of high-stakes control evaluations ("win/continue/lose scenarios") and the building blocks of blue team strategies ("E/R/A protocols").

I found this post quite helpful for understanding control evals. I somewhat strongly recommend people new to control read this post before reading high-stakes control papers, e.g. the original control paper and Ctrl-Z.

Nit: It's not totally clear what counts as a control "protocol". According to the OG post on control, this is "a proposed plan for training, evaluation, and deployment (which we'll refer to as a whole as a protocol)" -- but ERA protocols exclude training (including online training in deployment), which we might want to do for the untrusted policy and/or monitor. It might be nice to reserve the word "protocol" for "everything we might want to do over the lifetime of a model possibly able to cause concentrated failures" and have a different word for "a set of things we typically do together in deployment".

Replying toThe case for ensuring that powerful AIs are controlled

Alexa Pan2moReview for 2024 Review

The case for ensuring that powerful AIs are controlled

This post defined control as a class of safety techniques, argued for substantially investing in it, and proposed some actions this could imply. I think this is an incredibly useful type of post: I would like to see "the case for x" for many more safety agendas.

1.5 years later, I think this post remains ~the best introduction to control if one already has fair context on technical safety. Some key claims that have aged well (not a comprehensive summary):

Control is a distinct/complementatry approach to alignment, and aims to robustly reduce the probability of catastrophic outcomes from worst-case schemers
Control research mostly comprise technique iteration and safety evaluations
Control evaluations are fundamentally dangerous capability evaluations which

... (read 383 more words →)

Replying toWill misaligned AIs know that they're misaligned?

Alexa Pan2mo

Will misaligned AIs know that they're misaligned?

Thanks, I agree with you here! Do you think that's an assumption (about what misalignment looks like) we should generally work more with?

Supposing it is, I'm not immediately sure how this would change the rest of the argument. I guess considerations for "should we intervene on persona A's knowledge of persona B" could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.

Replying toWill misaligned AIs know that they're misaligned?

Alexa Pan2mo

Will misaligned AIs know that they're misaligned?

Oops, fixed

Will misaligned AIs know that they're misaligned?

Alexa Pan

2mo

Epistemic status: exploratory, speculative.

Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.^[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over as soon as it seems like handoff has occurred, Claude X is misaligned.

Let’s say AIs are "unknowingly misaligned” if they are less confident of their future egregious behavior and goals characterizing this behavior than human overseers are. For example, Claude X as HHH assistant might not be able to predict that it’ll want to take over once it has the opportunity, but we... (read 2603 more words →)

What would an IRB-like policy for AI experiments look like?

Alexa Pan

3mo

I learned about IRBs while helping run a forecasting study last year: apparently, to run an experiment involving people, you need to get the participants’ consent and get approval from a research ethics committee who checks that your experiments aren’t too harmful for participants.

IRBs in their current form are defective and annoying, but the idea is pretty reasonable. This reasonable idea was not implemented in the U.S. until roughly the 1980s. Lots of flagrant abuses occurred before then.

It seems possible that we should eventually have something like an IRB for certain kinds of experiments involving AI systems. This “IRB for AIs” (IRBAI), mandatory or not, would suggest ways to mitigate downsides in... (read 4384 more words →)

Replying toSonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Alexa Pan3mo

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Thanks for flagging, it's interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.

Our response was mainly the below:

If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.

Also, this doesn't seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point.

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

Alexa Pan

Alexa Pan, ryan_greenblatt

3mo

According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5’s behavioral improvements in these evaluations may partly be driven by a growing tendency to notice and game evaluations rather than genuine alignment. This is an early example of a phenomenon that is going to get increasingly problematic: as evaluation gaming increases, alignment evaluations become harder to trust.^[1]

To elaborate on the above: Sonnet 4.5 seems far more aware of being evaluated than previous models. In some of Anthropic’s evaluations, Sonnet 4.5... (read 3977 more words →)

144

AI Safety Newsletter #43: White House Issues First National Security Memo on AI Plus, AI and Job Displacement, and AI Takes Over the Nobels

Corin Katzke

Corin Katzke, Corin Katzke, Alexa Pan, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe here to receive future versions.

White House Issues First National Security Memo on AI

On October 24, 2024, the White House issued the first National Security Memorandum (NSM) on Artificial Intelligence, accompanied by a Framework to Advance AI Governance and Risk Management in National Security.

The NSM identifies AI leadership as a national security priority. The memorandum states that competitors have employed economic and technological espionage to steal U.S. AI technology. To maintain a U.S. advantage in AI, the... (read 1707 more words →)

AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary

Corin Katzke

Corin Katzke, Corin Katzke, Julius, Alexa Pan, andrewz, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Subscribe here to receive future versions.

Newsom Vetoes SB 1047

On Sunday, Governor Newsom vetoed California’s Senate Bill 1047 (SB 1047), the most ambitious legislation to-date aimed at regulating frontier AI models. The bill, introduced by Senator Scott Wiener and covered in a previous newsletter, would have required AI developers to test frontier models for hazardous capabilities and take steps to mitigate catastrophic risks. (CAIS Action Fund was a co-sponsor of SB 1047.)

Newsom states that SB 1047 is not... (read 1545 more words →)

AI Safety Newsletter #40: California AI Legislation Plus, NVIDIA Delays Chip Production, and Do AI Safety Benchmarks Actually Measure Safety?

Corin Katzke

Corin Katzke, Julius, Alexa Pan, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

SB 1047, the Most-Discussed California AI Legislation

California's Senate Bill 1047 has sparked discussion over AI regulation. While state bills often fly under the radar, SB 1047 has garnered attention due to California's unique position in the tech landscape. If passed, SB 1047 would apply to all companies performing business in the state, potentially setting a precedent for AI governance more broadly.

This newsletter examines the current state of the bill, which has had various amendments in response to... (read 1704 more words →)

AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering

Corin Katzke

Corin Katzke, Alexa Pan, Julius, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Implications of a Trump administration for AI policy

Trump named Ohio Senator J.D. Vance—an AI regulation skeptic—as his pick for vice president. This choice sheds light on the AI policy landscape under a future Trump administration. In this story, we cover: (1) Vance’s views on AI policy, (2) views of key players in the administration, such as Trump’s party, donors, and allies, and (3) why AI safety should remain bipartisan.

Vance has pushed for reducing AI regulations and making AI... (read 1508 more words →)

AISN #38: Supreme Court Decision Could Limit Federal Ability to Regulate AI Plus, “Circuit Breakers” for AI systems, and updates on China’s AI industry

Corin Katzke

Corin Katzke, Alexa Pan, Julius, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

Supreme Court Decision Could Limit Federal Ability to Regulate AI

In a recent decision, the Supreme Court overruled the 1984 precedent Chevron v. Natural Resources Defence Council. In this story, we discuss the decision’s implications for regulating AI.

Chevron allowed agencies to flexibly apply expertise when regulating. The “Chevron doctrine” had required courts to defer to a federal agency’s interpretation of a statute in the case that that statute was ambiguous and the agency’s interpretation was reasonable. Its elimination... (read 1204 more words →)

AI Safety Newsletter #37: US Launches Antitrust Investigations Plus, recent criticisms of OpenAI and Anthropic, and a summary of Situational Awareness

Corin Katzke

Corin Katzke, Alexa Pan, Julius, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts.

US Launches Antitrust Investigations

The U.S. Government has launched antitrust investigations into Nvidia, OpenAI, and Microsoft. The U.S. Department of Justice (DOJ) and Federal Trade Commission (FTC) have agreed to investigate potential antitrust violations by the three companies, the New York Times reported. The DOJ will lead the investigation into Nvidia while the FTC will focus on OpenAI and Microsoft.

Antitrust investigations are conducted by government agencies to determine whether companies are engaging in anticompetitive practices that may harm... (read 1385 more words →)

AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI

Corin Katzke

Corin Katzke, Alexa Pan, Dan H

Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required.

Subscribe here to receive future versions.

Listen to the AI Safety Newsletter for free on Spotify.

This week, we cover:

Consolidation in the corporate AI landscape, as smaller startups join forces with larger funders.
Several countries have announced new investments in AI, including Singapore, Canada, and Saudi Arabia.
Congress’s budget for 2024 provides some but not all of the requested funding for AI policy. The White House’s 2025 proposal makes more ambitious requests for AI funding.
How will AI affect biological weapons risk? We reexamine this question in light of new experiments from RAND,

... (read 2602 more words →)

LESSWRONG
LW

LESSWRONG
LW

Alexa Pan

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

What would an IRB-like policy for AI experiments look like?

AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering

AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI

Alexa Pan

Alexa Pan

Will misaligned AIs know that they're misaligned?

What would an IRB-like policy for AI experiments look like?

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

AI Safety Newsletter #43: White House Issues First National Security Memo on AI Plus, AI and Job Displacement, and AI Takes Over the Nobels

AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary

AI Safety Newsletter #40: California AI Legislation Plus, NVIDIA Delays Chip Production, and Do AI Safety Benchmarks Actually Measure Safety?

AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering

Alexa Pan

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

What would an IRB-like policy for AI experiments look like?

AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering

AISN #33: Reassessing AI and Biorisk Plus, Consolidation in the Corporate AI Landscape, and National Investments in AI

Alexa Pan

Alexa Pan

Will misaligned AIs know that they're misaligned?

What would an IRB-like policy for AI experiments look like?

Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals

AI Safety Newsletter #43: White House Issues First National Security Memo on AI Plus, AI and Job Displacement, and AI Takes Over the Nobels

AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary

AI Safety Newsletter #40: California AI Legislation Plus, NVIDIA Delays Chip Production, and Do AI Safety Benchmarks Actually Measure Safety?

AI Safety Newsletter #39: Implications of a Trump Administration for AI Policy Plus, Safety Engineering

White House Issues First National Security Memo on AI

Newsom Vetoes SB 1047

SB 1047, the Most-Discussed California AI Legislation

Implications of a Trump administration for AI policy

Supreme Court Decision Could Limit Federal Ability to Regulate AI

US Launches Antitrust Investigations