10mo

In this post, we study whether we can modify an LLM’s beliefs and investigate whether doing so could decrease risk from advanced AI systems.

We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.

Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically

... (read 526 more words →)

Avery1y

Noting that I had a conversation with Linda a week prior to the application deadline where they shared this trivia / prediction with me. Interesting!

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Lucius Bushnaq

Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn

This is a linkpost for our two recent papers:

An exploration of using degeneracy in the loss landscape for interpretability https://arxiv.org/abs/2405.10927
An empirical test of an interpretability technique based on the loss landscape https://arxiv.org/abs/2405.10928

This work was produced at Apollo Research in collaboration with Kaarel Hanni (Cadenza Labs), Avery Griffin, Joern Stoehler, Magdalena Wache and Cindy Wu. Not to be confused with Apollo's recent Sparse Dictionary Learning paper.

A key obstacle to mechanistic interpretability is finding the right representation of neural network internals. Optimally, we would like to derive our features from some high-level principle that holds across different architectures and use cases. At a minimum, we know two things:

We know that the training loss goes

... (read 714 more words →)

108

Basin broadness depends on the size and number of orthogonal features

CallumMcDougall

CallumMcDougall, Avery, Lucius Bushnaq

TL;DR

For neural networks trained to perfect loss, the broadness of optima in parameter space is given by the number and norm of independent orthogonal features the neural network has. The inner product that defines this "feature norm" and independence/orthogonality is the $L_{2}$ product of Hilbert space.

Introduction - why do we care about broadness?

Recently, there's been some discussion of what determines the broadness of optima of neural networks in parameter space.

People care about this because the size of the optimum basin may influence how easy it is for gradient descent (or other local optimisation algorithms which you might use to train a neural network) to find the corresponding solution, since it's probably easier to stumble... (read 1777 more words →)

What Is The True Name of Modularity?

CallumMcDougall

CallumMcDougall, Lucius Bushnaq, Avery

TL;DR

Modularity seems like an important feature of neural networks, but there is currently no canonical way of properly defining or measuring it which is properly theoretically motivated and doesn’t break down in some cases - in other words, we haven’t yet found a True Name for it. Most modularity measures used in experiments are based on ad-hoc methods from graph-theory or network theory, and don’t seem to capture the kind of modularity we care about.

In this post, we explore these existing ways of measuring modularity in neural networks, and their limitations. We also outline our ideas for a new modularity metric, and in particular the important role we think two branches of mathematics... (read 3386 more words →)

Ten experiments in modularity, which we'd like you to run!

CallumMcDougall

CallumMcDougall, Lucius Bushnaq, Avery

This is the third post describing our team’s work on selection theorems for modularity, as part of a project mentored by John Wentworth (see here for the earlier posts). Although the theoretical and empirical parts of the project have both been going very well, we’re currently bottlenecked on the empirical side: we have several theories and ideas for how to test them, but few experimental results. Right now, we only have one empiricist coding up experiments, so this overhang seems likely to persist.

The purpose of this post is to outline some of our ideas for experiments. We hope that this will provide concrete steps for people who are interested in engaging with... (read 2630 more words →)

Project Intro: Selection Theorems for Modularity

CallumMcDougall

CallumMcDougall, Avery, Lucius Bushnaq

Introduction - what is modularity, and why should we care?

It’s a well-established meme that evolution is a blind idiotic process, that has often resulted in design choices that no sane systems designer would endorse. However, if you are studying simulated evolution, one thing that jumps out at you immediately is that biological systems are highly modular, whereas neural networks produced by genetic algorithms are not. As a result, the outputs of evolution often look more like something that a human might design than do the learned weights of those neural networks.

Humans have distinct organs, like hearts and livers, instead of a single heartliver. They have distinct, modular sections of their brains that... (read 4578 more words →)

Theories of Modularity in the Biological Literature

CallumMcDougall

CallumMcDougall, Avery, Lucius Bushnaq

Introduction

This post is part of a sequence describing our team’s research on selection theorems for modularity, as part of this year's AI Safety Camp, under the mentorship of John Wentworth. Here, we provide some background reading for the discussion of modularity that will follow.

As we describe in more detail in our project intro, the motivating question of modularity that we started with (which is described in John’s post on the Evolution of Modularity) is why does evolution seem to have produced modular systems (e.g. organs and organ systems), but current ML systems (even genetic algorithms, which are consciously fashioned off evolutionary mechanisms) are highly non-modular? So far, most of our research has... (read 1899 more words →)

Replying toFraming Practicum: Semistable Equilibrium

Avery4y

Framing Practicum: Semistable Equilibrium

Meteors. This one isn’t exactly right since the equilibrium point either happens or gets skipped over entirely but nonetheless...imagine a meteoroid flying through the solar system towards Earth. If it crosses into the atmosphere, it becomes a meteor. And then depending on the composition of the meteor and its size, it may burn up in the atmosphere or it may make it to the surface (at that point, it’s a meteorite).

In this case, you can think of the “zone of attraction” as the entire journey to the atmosphere. Once it hits the atmosphere, it quickly is decelerated by air resistance. And if the meteor burns up, then its velocity “stabilizes” at 0.

... (read more)

LESSWRONG
LW

LESSWRONG
LW

Avery

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Project Intro: Selection Theorems for Modularity

Modifying LLM Beliefs with Synthetic Document Finetuning

Ten experiments in modularity, which we'd like you to run!

Avery

Avery

Modifying LLM Beliefs with Synthetic Document Finetuning

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Basin broadness depends on the size and number of orthogonal features

What Is The True Name of Modularity?

Ten experiments in modularity, which we'd like you to run!

Project Intro: Selection Theorems for Modularity

Theories of Modularity in the Biological Literature

Avery

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Project Intro: Selection Theorems for Modularity

Modifying LLM Beliefs with Synthetic Document Finetuning

Ten experiments in modularity, which we'd like you to run!

Avery

Avery

Modifying LLM Beliefs with Synthetic Document Finetuning

The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

Basin broadness depends on the size and number of orthogonal features

What Is The True Name of Modularity?

Ten experiments in modularity, which we'd like you to run!

Project Intro: Selection Theorems for Modularity

Theories of Modularity in the Biological Literature

TL;DR

Introduction - why do we care about broadness?

TL;DR

Introduction - what is modularity, and why should we care?

Introduction