No, I think the blue-team will keep having the latest and best LLMs and be able to stop such attempts from randos. These AGIs won't be so much magically superintelligent that they can take all the unethical actions needed to take over the world, without other AGIs stopping them.

Replying toAlignment will happen by default. What’s next?

Adrià Garriga-alonso3mo

Alignment will happen by default. What’s next?

On Chrome on a Mac you can just C-f in the PDF, it just OCRs automatically. I didn't have this problem.

Adrià Garriga-alonso3mo

But it feels like you'd need to demonstrate this with some construction that's actually adversarially robust, which seems difficult.

I agree it's kind of difficult.

Have you seen Nicholas Carlini's Game of Life series? It starts by building up logical gates up to a microprocessor that factors 15 in to 3 x 5.

Depending on the adversarial robustness model (e.g. every second the adversary can make 1 square behave the opposite of lawfully), it might be possible to make robust logic gates and circuits. In fact the existing circuits are a little robust already -- though not at the tune of 1 square per tick, that's too much power for the adversary.

Replying toUnless its governance changes, Anthropic is untrustworthy

Adrià Garriga-alonso3mo

Unless its governance changes, Anthropic is untrustworthy

I very strongly agree with this and think it should be the top objection people first see when scrolling down. In a low-P(doom) world, Anthropic has done lots of good. (They proved that you can have the best and the most aligned model, and also the leadership is more trustworthy than OpenAI, who would otherwise lead). This is my current view.

In a high-P(doom) world, none of that matters because they've raised the bar for capabilities when we really should be pausing AI. This was my previous view.

I'm grudgingly impressed with Anthropic leadership for getting this right when I did not (not that anyone other than me cares what I believed, having ~zero power).

-5

Replying toAlignment will happen by default. What’s next?

Adrià Garriga-alonso3mo

Alignment will happen by default. What’s next?

What do you mean "faithfully execute"?

If you mean that it's executing its own goal, then I dispute that that will be random, the goal will be to be helpful and good.

Is this different from the standard instrumental convergence algorithm?

Replying toCircuit discovery through chain of thought using policy gradients

Adrià Garriga-alonso3mo

Circuit discovery through chain of thought using policy gradients

I am concerned that long chains of RL are just sufficiently fucked functions with enough butterfly effects that this wouldn't be well approximated by this process.

This is a concern. Two possible replies:

If it's truly a chaotic system then there's no good way to estimate the expectation.
In reality, it could be that the effects of neurons are not very chaotic, but this estimate of the gradient is very chaotic. Previous work actually shows that policy gradients are much less chaotic than the 'reparameterization trick' (in the case where the transition is continuous, differentiating through it). It could be that finite differences (resampling many rollouts with/without the neuron activated) actually estimates effects better with less variance. We'll see.

Circuit discovery through chain of thought using policy gradients

Adrià Garriga-alonso

3mo

Circuit discovery has been restricted to the single-forward-pass setting, because the algorithms to attribute changes in behavior to particular neurons / SAE features need gradients, and you can't take a gradient through the sampled chain of thought. Or... can you?

It turns out taking gradients through random discrete actions is an essential part of reinforcement learning. We can estimate the gradients of an expectation over CoTs, with respect to the features, using the score function estimator. We can combine this with integrated gradients to produce a version of EAP-IG which works through the averages of chains of thought.

Background

Circuit discovery

The task we attempt to do is circuit discovery, defined by Conmy et al. Formally,... (read 1164 more words →)

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Simon Lermen

Simon Lermen, Adrià Garriga-alonso

3mo

Adrià recently published “Alignment will happen by default; what’s next?” on LessWrong, arguing that AI alignment is turning out easier than expected. Simon left a lengthy comment pushing back, and that sparked this spontaneous debate.

Adrià argues that current models like Claude Opus 3 are genuinely good “to their core,” and that an iterative process — where each AI generation helps align the next — could carry us safely to superintelligence. Simon counters that we may only get one shot at alignment, that current methods are too weak to scale. A conversation about where AI safety actually stands.

Watch the full debate here

Spatially distributed consciousness is not an abstract thought experiment if AI is conscious

Adrià Garriga-alonso

3mo

A fun paper in philosophy of consciousness is Eric Schwitzgebel’s “If Materialism Is True, The United States Is Probably Conscious”. It argues that, if you believe that consciousness comes from the material world (and not from élan vital or souls), and you’re a bit cosmopolitan with respect to what you consider conscious, then the US of A is probably conscious.

The evocative argument proceeds by posing two alien species, both of which can talk and behave kind of like humans do.

The Sirian Supersquids are squid-like beings with brains in their separable limbs, whose nervous system are pulses of light. They invent some sort of transceiver that can communicate the pulses of light to... (read 925 more words →)

Alignment will happen by default. What’s next?

Adrià Garriga-alonso

3mo

I’m not 100% convinced of this, but I’m fairly convinced, more and more so over time. I’m hoping to start a vigorous but civilized debate. I invite you to attack my weak points and/or present counter-evidence.

My thesis is that intent-alignment is basically happening, based on evidence from the alignment research in the LLM era.

Introduction

The classic story about loss of control from AI is that optimization pressure on proxies will cause the AI to value things that humans don’t. (Relatedly, the AI might become a mesa-optimizer with an arbitrary goal).

But the reality that I observe is that the AIs are really nice and somewhat naive. They’re like the world’s smartest 12-year-old (h/t Jenn). We put... (read 1604 more words →)

132

Infinitesimally False

Adrià Garriga-alonso

Adrià Garriga-alonso, abramdemski

3mo

Abstract:^[1]

Tarski's Undefinability Theorem showed (under some plausible assumptions) that no language can contain its own notion of truth. This deeply counterintuitive result launched several generations of research attempting to get around the theorem, by carefully discarding some of Tarski's assumptions.

In Hartry Field's Saving Truth from Paradox, he considers and discards a theory of truth based on Łukasiewicz logic, calling it the "most successful theory so far" (out of very very many considered by that point in the book) and enumerating its many virtues. The theory is discarded due to a negative result due to Greg Restall, which shows that a version of the theory with quantifiers must be $ω$ -inconsistent.

We consider this abandonment too... (read 3313 more words →)

•••

A scheme to credit hack policy gradient training

Adrià Garriga-alonso

3mo

Thanks to Inkhaven for making me write this, and Justis Mills, Abram Demski, Markus Strasser, Vaniver and Gwern for comments. None of them endorse this piece.

The safety community has previously worried about an AI hijacking the training process to change itself in ways that it endorses, but the developers don’t. Ngo (2022) calls the general phenomenon credit hacking, and distinguishes two types:

Gradient hacking (original term): in pretraining, the AI outputs something that makes it go down different trajectories in the gradient landscape, that end up furthering its ends. This is widely agreed to be very difficult (see also 1, 2): even though it is possible to induce somewhat using meta-learning.
Exploration hacking (aka “sandbagging” or deliberate underperformance): the AI intentionally avoids doing well on the

... (read 1305 more words →)

Anthropic's JumpReLU training method is really good

chanind

chanind, Adrià Garriga-alonso

4mo

This work was done as part of MATS 7.1.

TLDR; If you've given up on training JumpReLU SAEs, try out Anthropic's JumpReLU training method. It's now supported in SAELens!

Back in January, Anthropic published some updates on how they train JumpReLU SAEs. The post didn't include any sample code or benchmarks or theoretical justification for the changes, so it seems like the community basically shrugged and ignored it. After all, we already have the original GDM implementation in the Dictionary Learning and SAELens libraries, and most practitioners don't use JumpReLU SAEs anyway, since BatchTopK SAEs are so much easier to train and are also considered state-of-the-art.

Why has JumpReLU not been popular?

The biggest issue I've... (read 475 more words →)

A recurrent CNN finds maze paths by filling dead-ends

Adrià Garriga-alonso

5mo

Work done as part of my work with FAR AI, back in February 2023. It's a small result but I want to get it out of my drafts folder. It was the start of the research that led to interpreting the Sokoban planning RNN.

I was trying to study neural networks that plan, in order to have examples of mesa-optimizers.

I trained the recurrent maze CNN from Bansal et al. 2022 to solve 9x9 mazes, and applied it to a 33x33 maze. The architecture in their paper is a recurrent convolutional NN (R-CNN) that is regularized to be able to stop its computation at any iteration: during training, the NN runs for a random... (read 414 more words →)

The "Sparsity vs Reconstruction Tradeoff" Illusion

chanind

chanind, Adrià Garriga-alonso

6mo

This work was done as part of MATS 7.1. For more details on the ideas presented here, check out our new workshop paper Sparse but Wrong: Incorrect L0 Leads to Incorrect
Features in Sparse Autoencoders.

Nearly all work on Sparse Autoencoders (SAEs) includes a version of the classic "sparsity vs reconstruction tradeoff" plot, showing how changing the L0 (the sparsity) of the SAE changes the reconstruction of the SAE, measured in variance explained or mean squared error.

Sparsity vs reconstruction tradeoff curves from "Scaling and evaluating sparse autoencoders", Gao et al. 2024

This implies the following two things:

The better the reconstruction the better the SAE. If one SAE reconstructs inputs better than another SAE at a

... (read 1103 more words →)

L0 is not a neutral hyperparameter

chanind

chanind, Adrià Garriga-alonso

7mo

When we train Sparse Autoencoders (SAEs), the sparsity of the SAE, called L0 (the number of latents that fire on average), is treated as an arbitrary design choice. All SAE architectures include plots of L0 vs reconstruction, as if any choice of L0 is equally valid.

However, recent work that goes beyond just calculating sparsity vs reconstruction curves shows the same trend: low L0 SAEs learn the wrong features ^[1]^[2].

In this post, we investigate this phenomenon in a toy model with correlated features and show the following:

If the L0 of the SAE is lower than the true L0 of the underlying features, the SAE will "cheat" to get a better MSE loss score than

... (read 1397 more words →)

LESSWRONG
LW

LESSWRONG
LW

Adrià Garriga-alonso

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Among Us: A Sandbox for Agentic Deception

Compact Proofs of Model Performance via Mechanistic Interpretability

Alignment will happen by default. What’s next?

Adrià Garriga-alonso

Circuit discovery through chain of thought using policy gradients

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Spatially distributed consciousness is not an abstract thought experiment if AI is conscious

Alignment will happen by default. What’s next?

Infinitesimally False

A scheme to credit hack policy gradient training

Anthropic's JumpReLU training method is really good

Adrià Garriga-alonso

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Among Us: A Sandbox for Agentic Deception

Compact Proofs of Model Performance via Mechanistic Interpretability

Alignment will happen by default. What’s next?

Adrià Garriga-alonso

Circuit discovery through chain of thought using policy gradients

Will We Get Alignment by Default? — with Adrià Garriga-Alonso

Spatially distributed consciousness is not an abstract thought experiment if AI is conscious

Alignment will happen by default. What’s next?

Infinitesimally False

A scheme to credit hack policy gradient training

Anthropic's JumpReLU training method is really good

Background

Circuit discovery

Introduction

Why has JumpReLU not been popular?