I don’t think this raises a challenge to physicalism.

If physicalism were true, or even if there were non-physical things but they didn’t alter the determinism of the physical world, then the notion of an “agent” needs a lot of care. It can easily give the mistaken impression that there is something non-physical in entities that can change the physical world.

A perfect world model would be able to predict the responses of any neuron in any location in the universe to any inputs (leaving aside true randomness). It doesn’t matter whether the entity in question has a conscious experience that it is one of those entities, nothing would change.

but you don't know where YOU are in space-time

So I’d argue that this question is irrelevant if physicalism is true, because the AI having a phenomenal conscious experience of “I am this entity” cannot affect the physical world. If we’re not talking about phenomenal consciousness, then it’s just regular physical world modeling.

Dan Braun2mo

That’s the trap: in software, effort is easy to generate, activity is easy to justify, and impact is surprisingly easy to avoid.

Yyep. And it’s much much worse for research.

Dan Braun2mo

I know your comment isn't an earnest attempt to convince people, but fwiw:

For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI

I think this argument is more likely to have the opposite effect than intended when used on the types of people pushing on RSI. I think your final paragraph would be much more effective.

Dan Braun4mo*

The ASL-3 security standard states in 4.2.4 that "third-party environments", which surely includes compute providers, are in scope (and on their minds) for the standards they laid out:

Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

•••

Replying toI underestimated safety research speedups from safe AI

Dan Braun6mo

I underestimated safety research speedups from safe AI

UPDATE

When writing this post, I think I was biased by the specific work I'd been doing for the 6-12 months prior, and generalised that too far.

I think I stand by the claim that tooling alone could speedup the research I was working on by 3-5x. But even on the same agenda now, the work is far less amenable to major speedups from tooling. Now, work on the agenda is far less "implement several minor algorithmic variants, run hyperparameter sweeps which take < 1 hour, evaluate a set of somewhat concrete metrics, repeat", and more "think deeply about which variants make the most sense, run >3 hours jobs/sweeps, evaluate the more murky metrics".

The... (read more)

I underestimated safety research speedups from safe AI

Dan Braun

8mo

(See my update 2 months later)

A year or so ago, I thought that 10x speedups in safety research would require AIs so capable that takeover risks would be very high. 2-3x gains seemed plausible, 10x seemed unlikely. I no longer think this.

What changed? I continue to think that AI x-risk research is predominantly bottlenecked on good ideas, but I suffered from a failure of imagination of the speedups that could be gained from AIs that are unable to produce great high-level ideas. I’ve realised that humans trying to get empirical feedback from their ideas waste a huge amount of thought cycles on tasks that could be done by just moderately capable AI... (read 669 more words →)

[Paper] Stochastic Parameter Decomposition

Lee Sharkey

Lee Sharkey, Lucius Bushnaq, Dan Braun

8mo

Abstract

A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation.

Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors in parameter space.

However, the current main method in this framework, Attribution-based Parameter Decomposition (APD), is impractical on account of its computational cost and sensitivity to hyperparameters.

In this work, we introduce Stochastic Parameter Decomposition (SPD), a method that is more scalable and robust to hyperparameters than APD, which we demonstrate by decomposing models that are slightly larger and more complex than was possible to decompose... (read more)

Replying toCompressed Computation is (probably) not Computation in Superposition

Dan Braun8mo*

Compressed Computation is (probably) not Computation in Superposition

I think this is a fun and (initially) counterintuitive result. I'll try to frame things as it works in my head, it might help people understand the weirdness.

The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP's perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by $W_{E} W_{E}^{T}$ not being an identity.

But it turns out that making up for this mess actually makes the problem easier!

Replying toThe GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Dan Braun1y*

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Nice work posting this detailed FAQ. It's a non-standard thing to do but I can imagine it being very useful for those considering applying. Excited about the team.

Dan Braun1yQuick Take

Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.

Some of the assumptions inherent in the idea:

AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
1. The mechanism for this would be

... (read more)

Replying toAttribution-based parameter decomposition

Dan Braun1y

Attribution-based parameter decomposition

In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn't notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it's likely pure gradients won't suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.

Attribution-based parameter decomposition

Lucius Bushnaq

Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey

This is a linkpost for Apollo Research's new interpretability paper:

"Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition".

We introduce a new method for directly decomposing neural network parameters into mechanistic components.

Motivation

At Apollo, we've spent a lot of time thinking about how the computations of neural networks might be structured, and how those computations might be embedded in networks' parameters. Our goal is to come up with an effective, general method to decompose the algorithms learned by neural networks into parts that we can analyse and understand individually.

For various reasons, we've come to think that decomposing network activations layer by layer into features and connecting those features up into circuits... (read 1114 more words →)

108

Replying toIdentifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun1y*

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.

I can't recall the compute costs for that script, sorry. A couple of things to note:

For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won't have to do the second step of asking the model to produce activations given the explanations.

It's a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.

Implications of the AI Security Gap

Dan Braun

This post reflects my personal opinion and not necessarily that of other members of Apollo Research or any of the people acknowledged below. Thanks to Jarrah Bloomfield, Lucius Bushnaq, Marius Hobbhahn, Axel Højmark, and Stefan Heimersheim for comments/discussions.

I find that people in the AI/AI safety community have not considered many of the important implications that security in AI companies has on catastrophic risks.

In this post, I’ve laid out some of these implications:

AI companies are a long way from state-proof security
Implementing state-proof security will slow down safety (and capabilities) research a lot
Sabotage is sufficient for catastrophe
What will happen if timelines are short?
Security level matters, even if you’re not robust to top cyber operations

AI

... (read 2522 more words →)

In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?

When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.

Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.

AI Control concerns itself with models

... (read 418 more words →)

Dan Braun's Shortform

Dan Braun

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Lee Sharkey

Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill

Why we made this list:

The interpretability team at Apollo Research wrapped up a few projects recently^[1]. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!
Previous lists of project ideas (such as Neel’s collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of

... (read 5388 more words →)

125

Apollo Research 1-year update

Marius Hobbhahn

Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb

This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research

About Apollo Research

Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old.

Executive Summary

For the UK AI Safety Summit, we developed a demonstration that Large Language Models (LLMs) can strategically deceive their primary users when put under pressure. The accompanying paper was referenced by experts and the press (e.g. AI Insight forum, BBC, Bloomberg) and accepted for oral presentation at the ICLR LLM agents workshop.

The evaluations team is currently working on capability evaluations for precursors of deceptive alignment, scheming model organisms,... (read 2060 more words →)

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn

This is a linkpost for our two recent papers:

An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927
An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928

This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.

A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

We know that the training loss goes

... (read 714 more words →)

108

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun

Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey

A short summary of the paper is presented below.

This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) .

TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE... (read 1115 more words →)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, Dan Braun

The following is the main part of a blog post we just published at Apollo Research. Our main goal with the post is to clarify the concept of deceptive alignment and distinguish it from strategic deception. Furthermore, we want these concepts to become accessible to a non-technical audience such that it is easier for e.g. the general public and policymakers to understand what we're worried about. Feedback is appreciated. For additional sections and appendices, see the full post.

We would like to thank Fabien Roger and Owain Evans for comments on a draft of this post.

We want AI to always be honest and truthful with us, i.e. we want to prevent situations where... (read 1829 more words →)

LESSWRONG
LW

LESSWRONG
LW

Dan Braun

Announcing Apollo Research

[Interim research report] Taking features out of superposition with sparse autoencoders

Interpreting Neural Networks through the Polytope Lens

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Dan Braun

I underestimated safety research speedups from safe AI

[Paper] Stochastic Parameter Decomposition

Attribution-based parameter decomposition

Implications of the AI Security Gap

Dan Braun's Shortform

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Apollo Research 1-year update

Dan Braun

Announcing Apollo Research

[Interim research report] Taking features out of superposition with sparse autoencoders

Interpreting Neural Networks through the Polytope Lens

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Dan Braun

I underestimated safety research speedups from safe AI

[Paper] Stochastic Parameter Decomposition

Attribution-based parameter decomposition

Implications of the AI Security Gap

Dan Braun's Shortform

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Apollo Research 1-year update

Abstract

Motivation

AI

About Apollo Research

Executive Summary