maxnadeau

Replying toPrioritizing threats for AI control

"When something quite bad can occur in a small number of actions (potentially even in a single trajectory) or in a short period of time, it's important to sometimes block suspicious looking actions until they are reviewed."

Why? I think there is not perfect overlap between the categories “single actions that are catastrophic, i.e. high stakes failures” and “actions that require monitoring and blocking in the moment”. One example where they come apart is “producing a batch of data that would cause the next generation of models to be misaligned”—this might be a high-stakes failure, but you have plenty of time to catch it before you actually train that model. Another example would be "writing code that is catastrophically insecure but only gets run rarely".

maxnadeau8d

What are your objections to Alex/Buck's definition, for example? I think they define it pretty well. Of course, their definition admits some edge cases and ambiguities, but so do lots of concepts—still seems like useful concept to me.

-2

maxnadeau3mo

I'm glad you're betting on your own taste/expertise instead of donating on behalf of the community—that seems sensible to me.

Replying toThree positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

maxnadeau3mo

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

(I am also a grantmaker at Coefficient/OP)

The arguments/evidence in this post seem true and underrated to me, and I think more people should come work with us.

In particular, I also have updated upward on how impactful the job is over the last year. It does really seem to me like each grantmaker enables a ton of good projects. Here’s an attempt to make more concrete how much is enabled by additional grantmakers: If Jake hadn’t joined OP, I think we would our interp/theory grants would have been fewer in number and less impactful, because I don’t know those areas nearly as well as Jake does. Jake’s superior knowledge improves our grantmaking in... (read more)

Replying toHumanity Learned Almost Nothing From COVID-19

maxnadeau4mo

Humanity Learned Almost Nothing From COVID-19

I fervently agree that the degree of inaction here is embarrassing and indefensible.

Here's one proposed explanation, though definitely not a justification, of why the Biden admin didn't do more. To be clear, I'm not saying they're the only ones who should have done/do more:

The shrinking anti-pandemic agenda

After coming up with a $65 billion moonshot plan, Biden asked for about half of that as part of his initial Build Back Better proposal. But as the entirety of BBB shrank in an effort to secure the support of Joe Manchin and Kyrsten Sinema, the pandemic prevention shrank to about $2.7 billion, of which roughly half is to modernize the CDC’s labs.
And it’s far from

... (read more)

maxnadeau4mo

Thanks for writing this! Just booked a time in your calendly to discuss at more length.

•••

Replying toAn alignment safety case sketch based on debate

maxnadeau9mo

An alignment safety case sketch based on debate

For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.

Replying toThe 4-Minute Mile Effect

maxnadeau10mo

The 4-Minute Mile Effect

Typo: should be "Gell-Mann"

Replying toTormenting Gemini 2.5 with the [[[]]][][[]] Puzzle

maxnadeau11mo

Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle

I figured out the encoding, but I expressed the algorithm for computing the decoding in different language from you. My algorithm produces equivalent outputs but is substantially uglier. I wanted to leave a note here in case anyone else had the same solution.

Alt phrasing of the solution:

Each expression (i.e. a well-formed string of brackets) has a "degree", which is defined as the the number of well-formed chunks that the encoding can be broken up into. Some examples: [], [[]], and [-[][]] have degree one, [][], -[][[]], and [][[][]] have degree two, etc.

Here's a special case: the empty string maps to 0, i.e. decode("") = 0

When an encoding has degree one, you take... (read more)

Replying toSix Thoughts on AI Safety

maxnadeau1y

Six Thoughts on AI Safety

On point 6, "Humanity can survive an unaligned superintelligence": In this section, I initially took you to be making a somewhat narrow point about humanity's safety if we develop aligned superintelligence and humanity + the aligned superintelligence has enough resources to out-innovate and out-prepare a misaligned superintelligence. But I can't tell if you think this conditional will be true, i.e. whether you think the existential risk to humanity from AI is low due to this argument. I infer from this tweet of yours that AI "kill[ing] us all" is not among your biggest fears about AI, which suggests to me that you expect the conditional to be true—am I interpreting you correctly?

Research directions Open Phil wants to fund in technical AI safety

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

The Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within... (read 17359 more words →)

117

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

Open Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025.

Overview

We're seeking proposals across 21 different research areas, organized into five broad categories:

Adversarial Machine Learning
- *Jailbreaks and unintentional misalignment
- *Control evaluations
- *Backdoors and other alignment stress tests
- *Alternatives to adversarial training
- Robust unlearning
Exploring sophisticated misbehavior of LLMs
- *Experiments on alignment faking
- *Encoded reasoning in CoT and inter-model communication
- Black-box LLM psychology
- Evaluating whether models can hide dangerous behaviors
- Reward hacking of human oversight
Model transparency
- Applications of white-box techniques
- Activation monitoring
- Finding feature

... (read 177 more words →)

111

Update on Harvard AI Safety Team and MIT AI Alignment

Xander Davies

Xander Davies, Sam Marks, kaivu, tlevin, leni, maxnadeau, Naomi Bashkansky

We help organize the Harvard AI Safety Team (HAIST) and MIT AI Alignment (MAIA), and are excited about our groups and the progress we’ve made over the last semester.

In this post, we’ve attempted to think through what worked (and didn’t work!) for HAIST and MAIA, along with more details about what we’ve done and what our future plans are. We hope this is useful for the many other AI safety groups that exist or may soon exist, as well as for others thinking about how best to build community and excitement around working to reduce risks from advanced AI.

Important things that worked:

Well-targeted outreach, which (1) focused on the technically interesting parts of alignment (rather

... (read 2259 more words →)

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau

maxnadeau, Xander Davies, Buck, Nate Thomas

This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel.

REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ).

Since mechanistic... (read 3471 more words →)

135

LESSWRONG
LW

LESSWRONG
LW

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Update on Harvard AI Safety Team and MIT AI Alignment

maxnadeau

maxnadeau

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Update on Harvard AI Safety Team and MIT AI Alignment

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Update on Harvard AI Safety Team and MIT AI Alignment

maxnadeau

maxnadeau

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Update on Harvard AI Safety Team and MIT AI Alignment

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

Overview