I can’t tell whether any of the details in this story correspond to reality; it reads exactly like Sam Kriss.

BashArena: A Control Setting for Highly Privileged AI Agents

2mo

We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past. In this post we’ll discuss the strengths and weaknesses of BashArena, and what we’ve learned about how to make settings for high stakes control research.

BashArena

Overview

BashArena is a control setting: a dataset designed to support the kind of adversarial game we studied in the original AI control paper and Ctrl-Z. In general, control settings require two types of tasks – main tasks and side tasks. Main tasks represent the work the AI is supposed to do, and side tasks represent security failures a malicious AI might cause. The red team... (read 4264 more words →)

Recent Redwood Research project proposals

ryan_greenblatt

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

7mo

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.

Basic open questions in control

Control Protocol Transfer Across Setting [PUBLIC]
- So far, all the work in comparing different control protocols measures effectiveness only on one setting. Do the results we get transfer between different settings?
Backdoor Auditing

... (read 833 more words →)

Replying toWinning the power to lose

Adam Kaufman9mo

Winning the power to lose

Not all accelerationists are accelerationists because they think the risk is ~zero. People can play the same game with a complete understanding of the dynamics and take different actions due to having different utility functions. Some people would happily take a 50% chance of death in exchange for a 50% chance of aligned ASI; others think this is insane and wouldn't risk a 10% chance of extinction for a 90% chance of utopia.

Replying toOur Reality: A Simulation Run by a Paperclip Maximizer

Adam Kaufman10mo

Our Reality: A Simulation Run by a Paperclip Maximizer

When the bullet missed Trump by half an inch I made a lot of jokes about us living in an importance-sampled simulation.

Replying toOur Reality: A Simulation Run by a Paperclip Maximizer

Adam Kaufman10mo

Our Reality: A Simulation Run by a Paperclip Maximizer

Here's a slightly more general way of phrasing it:
We find ourselves in an extremely leveraged position, making decisions which may influence the trajectory of the entire universe (more precisely our lightcone contains a gigantic amount of resources). There are lots of reasons to care about what happens to universes like ours, either because you live in one or because you can acausally trade with one that you think probably exists. "Paperclip maximizers" is a very small subset of the parties that have a reason to be interested in trying to figure out what happens to universes like ours. I'd wager there are a lot more simulations of minds in highly leveraged positions... (read more)

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt

Aryan Bhatt, Buck, Adam Kaufman, Tyler Tracy

10mo

We have released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.

Here’s the abstract of the paper:

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing

... (read 5765 more words →)

126

Replying toWhy do many people who care about AI Safety not clearly endorse PauseAI?

Adam KaufmanMar 31, 2025

Why do many people who care about AI Safety not clearly endorse PauseAI?

I think AI safety has very limited political capital at the moment. Pausing AI just isn’t going to happen, so advocating for it makes you sound unreasonable and allows people to comfortably ignore your other opinions. I prefer trying to push for interventions which make a difference with much less political capital, like convincing frontier labs to work on and implement control measures.

-1

Replying toAnthropic: Three Sketches of ASL-4 Safety Case Components

Adam Kaufman1y*

Anthropic: Three Sketches of ASL-4 Safety Case Components

Kudos for releasing a concept of a plan! Some thoughts:

Regarding the first safety case:

The amount of progress in mech interp required to make the first safety case suitable seems overly optimistic to me; I basically think that most of the "limitations" are in fact pretty serious. However, I appreciate that the attempt to include concrete requirements.
I believe that getting good results from the following experiments might be particularly unrealistic:
- "In order to robustify our evals against sandbagging, we ran an experiment where we steered one or more truth-related features to make the model confident that it is in either testing or unmonitored deployment"
  - I worry that "truth-related" features are very difficult both in principle and

... (read 670 more words →)

Replying toMomentum of Light in Glass

Adam Kaufman1y

Momentum of Light in Glass

I have such a strong intuitive opposition to the Internal Reaction Drive that I agree with your conclusion that we should update away from any theory which allows it. Then again, perhaps it is impossible to build such a drive for the merely practical reason that any material with a positive or negative index of refraction will absorb enough light to turn the drive into an expensive radiator.

Especially given the recent Nobel prize announcement, I think the most concerning piece of information is that there are cultural forces from within the physics community discouraging people from trying to answer the question at all.

Replying toWhat bootstraps intelligence?

Adam Kaufman1y

What bootstraps intelligence?

You need abstractions to think and plan at all with limited compute, not just to speak. I would guess that plenty animals which are incapable of speaking also mentally rely on abstractions. For instance, when foraging for apples, I suspect an animal probably has a mental category for apples, and treats them as the same kind of thing rather than completely unrelated configurations of atoms.

Replying toWhat percent of the sun would a Dyson Sphere cover?

Adam KaufmanJul 03, 2024

What percent of the sun would a Dyson Sphere cover?

The planet Mercury is a pretty good source of material:

Mass: $3.29 \times 10^{23}$ kg (which is about 70% iron)

Radius: $2.44 \times 10^{6}$ m

Volume: $6.08 \times 10^{19}$ m^3

Density: $5411$ kg/m^3

Orbital radius: $0.39 A U = 5.79 \times 10^{10}$ m

A spherical shell around the sun at roughly same radius as Mercury's orbit would have a surface area of $4.21 \times 10^{22}$ m^2, and spreading out Mercury's volume over this area gives a thickness of about 1.4 mm. This means Mercury alone provides ample material for collecting all of the Sun's energy via reflecting light – very thin spinning sheets could act as a swarm of orbiting reflectors that focus sunlight onto large power plants or mirrors that direct it to elsewhere in the solar system. Spinning sheets could be made somewhere between 1-100 μm thick, with thicker cables or... (read more)

Replying toMy hour of memoryless lucidity

Adam Kaufman2y

My hour of memoryless lucidity

A better way to do the memory overwrite experiment is to prepare a list of what’s in the box to match each of ten possible numbers, then have someone provide a random number while your short term memory doesn’t work and see if you can successfully overwrite the memory that corresponds to that number (as measured by correctly guessing the number much later).

Finding Sparse Linear Connections between Features in LLMs

Logan Riggs

Logan Riggs, Sam Mitchell, Adam Kaufman

TL;DR: We use SGD to find sparse connections between features; additionally a large fraction of features between the residual stream & MLP can be modeled as linearly computed despite the non-linearity in the MLP. See linear feature section for examples.

Special thanks to fellow AISST member, Adam Kaufman, who originally thought of the idea of learning sparse connections between features & to Jannik Brinkmann for training these SAE’s.

Sparse AutoEncoders (SAE)’s are able to turn the activations of an LLM into interpretable features. To define circuits, we would like to find how these features connect to each other. SAE’s allowed us to scalably find interpretable features using SGD, so why not use SGD to... (read 2735 more words →)

LESSWRONG
LW

LESSWRONG
LW

Adam Kaufman

Ctrl-Z: Controlling AI Agents via Resampling

Recent Redwood Research project proposals

Finding Sparse Linear Connections between Features in LLMs

BashArena: A Control Setting for Highly Privileged AI Agents

Adam Kaufman

BashArena: A Control Setting for Highly Privileged AI Agents

Recent Redwood Research project proposals

Ctrl-Z: Controlling AI Agents via Resampling

Finding Sparse Linear Connections between Features in LLMs

Adam Kaufman

Ctrl-Z: Controlling AI Agents via Resampling

Recent Redwood Research project proposals

Finding Sparse Linear Connections between Features in LLMs

BashArena: A Control Setting for Highly Privileged AI Agents

Adam Kaufman

BashArena: A Control Setting for Highly Privileged AI Agents

Recent Redwood Research project proposals

Ctrl-Z: Controlling AI Agents via Resampling

Finding Sparse Linear Connections between Features in LLMs

BashArena

Overview

Control

Basic open questions in control