LESSWRONG
LW

jake_mendel — LessWrong

Replying toAn Ambitious Vision for Interpretability

jake_mendel2mo

An Ambitious Vision for Interpretability

Upvoted. The way I think about the case for ambitious interp is:

There are a number of pragmatic approaches in ai safety that bottom out in finding a signal and optimising against it. Such approaches might all have ceilings of usefulness due to oversight problems. In the case of pragmatic interp, the ceiling of usefulness is primarily set by the fact that if you want to use the interp techniques, then you incentivise obfuscation. Of course there are things you can do to try to make the obfuscation happen slower, e.g. hillclimbing on your signal is better than training against it; this is true of most pragmatic approaches.
Ambitious interp should have a much

... (read more)

Replying toAlignment remains a hard, unsolved problem

jake_mendel3mo

Alignment remains a hard, unsolved problem

What you described is an example of training based on cognitive oversight in my view. When I said 'there are definitely ways to do this which lead to less direct pressure for cognitive obfuscation than others' I was thinking of things like this. There's a spectrum from 'just make the rate at which the evilness detector fires a term in the loss function' to 'hill climb on your evilness detector (but don't use it during training)' to 'when the evilness detector fires, try to understand mechanistically what went wrong and then modify training to fix the problem' all the way to 'when the evilness detector fires, demand a global moratorium, burn all... (read more)

Replying toAlignment remains a hard, unsolved problem

jake_mendel3mo

Alignment remains a hard, unsolved problem

Really great post, strong upvoted. I was a fan in particular of the selection of research agendas you included at the end.

but there’s no similar fundamental reason that cognitive oversight (e.g. white-box-based oversight like with interpretability) has to get harder with scale

I'm curious why you think this? It seems like there's some preliminary evidence that models are becoming increasingly capable of manipulating their thoughts. But more importantly, I'm wondering what role you think this cognitive oversight will play in making safe powerful AI? If this oversight will be used as a feedback signal, then I would expect that avoiding obfuscating the cognition does become harder at scale. I do think we should... (read more)

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

jake_mendel

3mo

~~Open Philanthropy’s~~ Coefficient Giving’s Technical AI Safety team is hiring grantmakers. I thought this would be a good moment to share some positive updates about the role that I’ve made since I joined the team a year ago.

tl;dr: I think this role is more impactful and more enjoyable than I anticipated when I started, and I think more people should consider applying.

It’s not about the “marginal” grants

Some people think that being a grantmaker at Coefficient means sorting through a big pile of grant proposals and deciding which ones to say yes and no to. As a result, they think that the only impact at stake is how good our decisions are about marginal grants,... (read 1505 more words →)

127

jake_mendel4mo

Copypasting from a slack thread:

I'll list some work that I think is aspiring to build towards an answer to some of these questions, although lots of it is very toy:

On generalisation vs simple heuristics:
- I think the nicest papers here are some toy model interp papers like progress measures for grokking and Grokked Transformers. I think these are two papers which present a pretty crisp distinction between levels of generality of different algorithms that perform similarly on the training set. In modular addition, there are two levels of algorithm and in the Grokked transformer, there are three. The story of what the models end up doing ends up being pretty nuanced and it

... (read 1111 more words →)

•••

jake_mendel5mo

What are solenoidal flux corrections in this context

Replying toHow quick and big would a software intelligence explosion be?

jake_mendel6mo*

How quick and big would a software intelligence explosion be?

Thanks for this post!

Caveat: I haven't read this very closely yet, and I'm not an economist. I'm finding it hard to understand why you think it's reasonable to model an increase in capabilities by an increase in number of parallel copies. That is: in the returns to R&D section, you look at data on how increasing numbers of human-level researchers in AI affect algorithmic progress, but we have ~no data on what happens when you sample researchers from a very different (and superhuman) capability profile. It seems to me entirely plausible that a few months into the intelligence explosion, the best AI researchers are qualitatively superintelligent enough that their research advances per... (read 379 more words →)

Replying toHow do you feel about LessWrong these days? [Open feedback thread]

jake_mendel8mo

How do you feel about LessWrong these days? [Open feedback thread]

I think the biggest-alignment relevant update is that I expected RL fine-tuning over longer horizons (or even model-based RL a la AlphaZero) to be a bigger deal. I was really worried about it significantly improving performance and making alignment harder. In 2018-2019 my mainline picture was more like AlphaStar or AlphaZero, with RL fine-tuning being the large majority of compute. I've updated about this and definitely acknowledge I was wrong.[3] I don't think it totally changes the picture though: I'm still scared of RL, I think it is very plausible it will become more important in the future, and think that even the kind of relatively minimal RL we do now can introduce many of the same risks.

Curious to hear how you would revisit this prediction in light of reasoning models? Seems like you weren't as wrong as you thought a year ago, but maybe you still think there are some key ways your predictions about RL finetuning predictions were off?

Replying toObstacles in ARC's agenda: Mechanistic Anomaly Detection

jake_mendel10mo

Obstacles in ARC's agenda: Mechanistic Anomaly Detection

This was really useful to read thanks very much for writing these posts!

•••

jake_mendel10mo

Very happy you did this!

Replying toThe case for unlearning that removes information from LLM weights

jake_mendel11mo

The case for unlearning that removes information from LLM weights

Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the 'synthetic' part or the 'fine-tuning' part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn't make sense because you'd have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.

Research directions Open Phil wants to fund in technical AI safety

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

The Open Philanthropy has just launched a large new Request for Proposals for technical AI safety research. Here we're sharing a reference guide, created as part of that RFP, which describes what projects we'd like to see across 21 research directions in technical AI safety.

This guide provides an opinionated overview of recent work and open problems across areas like adversarial testing, model transparency, and theoretical approaches to AI alignment. We link to hundreds of papers and blog posts and offer approximately a hundred different example projects. We hope this is a useful resource for technical people getting started in alignment research. We'd also welcome feedback from the LW community on our prioritization within... (read 17359 more words →)

117

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

jake_mendel

jake_mendel, maxnadeau, Peter Favaloro

Open Philanthropy is launching a big new Request for Proposals for technical AI safety research, with plans to fund roughly $40M in grants over the next 5 months, and available funding for substantially more depending on application quality.

Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025.

Overview

We're seeking proposals across 21 different research areas, organized into five broad categories:

Adversarial Machine Learning
- *Jailbreaks and unintentional misalignment
- *Control evaluations
- *Backdoors and other alignment stress tests
- *Alternatives to adversarial training
- Robust unlearning
Exploring sophisticated misbehavior of LLMs
- *Experiments on alignment faking
- *Encoded reasoning in CoT and inter-model communication
- Black-box LLM psychology
- Evaluating whether models can hide dangerous behaviors
- Reward hacking of human oversight
Model transparency
- Applications of white-box techniques
- Activation monitoring
- Finding feature

... (read 177 more words →)

111

Attribution-based parameter decomposition

Lucius Bushnaq

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey

This is a linkpost for Apollo Research's new interpretability paper:

"Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition".

We introduce a new method for directly decomposing neural network parameters into mechanistic components.

Motivation

At Apollo, we've spent a lot of time thinking about how the computations of neural networks might be structured, and how those computations might be embedded in networks' parameters. Our goal is to come up with an effective, general method to decompose the algorithms learned by neural networks into parts that we can analyse and understand individually.

For various reasons, we've come to think that decomposing network activations layer by layer into features and connecting those features up into circuits... (read 1114 more words →)

108

Circuits in Superposition: Compressing many small neural networks into one

Lucius Bushnaq

Lucius Bushnaq, jake_mendel

Tl;dr: We generalize the mathematical framework for computation in superposition from compressing many boolean logic gates into a neural network, to compressing many small neural networks into a larger neural network. The number of small networks we can fit into the large network depends on the small networks' total parameter count, not their neuron count.

Work done at Apollo Research. The bottom half of this post is just maths that you do not need to read to get the gist.

EDIT 12.08.2025: Some of the math in this post turned out to be wrong, though not in a way that changes the high-level conceptual takeaways very much. See this post for a fixed version.

Introduction

Background

Anthropic's... (read 3793 more words →)

131

I keep coming back to the idea of interpreting the embedding matrix of a transformer. It’s appealing for several reasons: we know the entire data distribution is just independent probabilities of each logit, so there’s no mystery about what features are data features vs model features. We also know one sparse basis for the activations: the rows of the embedding. But that’s also clearly not satisfactory because the embedding learns something! The thing it learns could be a sparse overbasis of non-token features, but the story for this would have to be different to the normal superposition story which involves features being placed into superposition by model components after they are computed... (read more)

jake_mendel's Shortform

jake_mendel

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

[Interim research report] Activation plateaus & sensitive directions in GPT2

StefanHex

StefanHex, jake_mendel

This part-report / part-proposal describes ongoing research, but I'd like to share early results for feedback. I am especially interested in any comment finding mistakes or trivial explanations for these results. I will work on this proposal with a LASR Labs team over the next 3 months. If you are working (or want to work) on something similar I would love to chat!
Experiments and write-up by Stefan, with substantial inspiration and advice from Jake (who doesn’t necessarily endorse every sloppy statement I write). Work produced at Apollo Research.

TL,DR: Toy models of how neural networks compute new features in superposition seem to imply that neural networks that utilize superposition require some form of error... (read 1457 more words →)

SAE feature geometry is outside the superposition hypothesis

jake_mendel

Written at Apollo Research

Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don’t currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual... (read 3251 more words →)

229

Apollo Research 1-year update

Marius Hobbhahn

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research

About Apollo Research

Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.

Executive Summary

For the UK AI Safety Summit, we developed a demonstration that Large Language Models (LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g. AI Insight forum, BBC, Bloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.

The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms,... (read 2060 more words →)

Interpretability: Integrated Gradients is a decent attribution method

Lucius Bushnaq

Lucius Bushnaq, jake_mendel, StefanHex, Kaarel

A short post laying out our reasoning for using integrated gradients as attribution method. It is intended as a stand-alone post based on our LIB papers [1] [2]. This work was produced at Apollo Research.

Context

Understanding circuits in neural networks requires understanding how features interact with other features. There's a lot of features and their interactions are generally non-linear. A good starting point for understanding the interactions might be to just figure out how strongly each pair of features in adjacent layers of the network interacts. But since the relationships are non-linear, how do we quantify their 'strength' in a principled manner that isn't vulnerable to common and simple counterexamples? In other words,... (read 1701 more words →)

LESSWRONG
LW

LESSWRONG
LW

jake_mendel

SAE feature geometry is outside the superposition hypothesis

Toward A Mathematical Framework for Computation in Superposition

Circuits in Superposition: Compressing many small neural networks into one

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

jake_mendel

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Attribution-based parameter decomposition

Circuits in Superposition: Compressing many small neural networks into one

jake_mendel's Shortform

[Interim research report] Activation plateaus & sensitive directions in GPT2

jake_mendel

SAE feature geometry is outside the superposition hypothesis

Toward A Mathematical Framework for Computation in Superposition

Circuits in Superposition: Compressing many small neural networks into one

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

jake_mendel

Three positive updates I made about technical grantmaking at Coefficient Giving (fka Open Phil)

Research directions Open Phil wants to fund in technical AI safety

Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas

Attribution-based parameter decomposition

Circuits in Superposition: Compressing many small neural networks into one

jake_mendel's Shortform

[Interim research report] Activation plateaus & sensitive directions in GPT2

It’s not about the “marginal” grants

Overview

Motivation

Introduction

Background

About Apollo Research

Executive Summary

Context