Alejandro Tlaie

A short project on Mamba: grokking & interpretability

Epistemic status: I've worked on this project for ~20h, on my free time and using only a Colab notebook.

Executive summary

I trained a minimalistic implementation of Mamba (details below) on the modular addition task. I found that:

This non-transformer-based model can also exhibit grokking (i.e., the model learns to generalise after overfitting to the training data).
There are tools that we can import from neuroscience that can help us interpret how the network representation changes as grokking takes place over training epochs.

Introduction

Almost all of the Mechanistic Interpretability (MI) efforts I've seen people excited about and the great majority of the techniques I've learned are related to Transformer-based architectures. At the same time, a competitive alternative... (read 1795 more words →)

Toy Models of Superposition: what about BitNets?

Alejandro Tlaie

Summary

In this post I want to briefly share some results I have got after experimenting with the equivalent version of the simple neural networks that the authors used here to study how superposition and poly-semantic neurons come about in neural networks trained with gradient descent.

The take-home message is that BitLinear networks are similar to their dense equivalents in terms of how feature superposition emerges. Perhaps, BitNets have a slight upper hand (less feature superposition and a more structured geometry) in some sparsity regimes. It is definitely too little work to extract strong conclusions from this, though!

Motivation

I will skip the majority of the details, as they are exactly the same as in the... (read 1298 more words →)

Replying toOn predictability, chaos and AIs that don't game our goals

Alejandro Tlaie2y

On predictability, chaos and AIs that don't game our goals

If the human doesn't know what they would want, it doesn't seem fair to blame the problem on alignment failure. In such a case, the problem would be a person's lack of clarity.

Hmm, I see what you mean. However, that person's lack of clarity would in fact be also called "bad prediction", which is something I'm trying to point out at the post! These bad predictions can happen due to a different number of factors (missing relevant variables, misspecified initial conditions...). I believe the only reason we don't call it "misaligned behaviour" is because we're assuming that people do not (usually) act according to a explicitly stated reward function!

What do you think?

Humans are notoriously good rationalizers and may downplay their own bad decisions. Making a fair comparison between "what the human would have done" versus "what the AI agent would have done" may be quite tricky. (See the Fundamental Attribution Error a.k.a. correspondence bias.

Thanks for this pointer!

Replying toOn predictability, chaos and AIs that don't game our goals

Alejandro Tlaie2y

On predictability, chaos and AIs that don't game our goals

Claim: the degree to which the future is hard to predict has no bearing on the outer alignment problem.

With outer alignment I was referring to: "providing well-specified rewards" (https://arxiv.org/abs/2209.00626). Following this definition, I still think that if one is unable to disentangle what's relevant to predict the future, one cannot carefully tailor a reward function that teaches an agent how to predict the future. Thus, it cannot be consequentialist, or at least it will have to deal with a large amount of uncertainty when forecasting in timescales that are longer than the predictable horizon. I think this reasoning is based in the basic premise that you mentioned ("one can construct a desirability... (read more)

Replying toOn predictability, chaos and AIs that don't game our goals

Alejandro Tlaie2y

On predictability, chaos and AIs that don't game our goals

[...] without having to know all the future temperatures of the room, because I can cleanly describe the things I care about.[...] My goal with the thermostat example is just to point out that that isn't (as far as I can see) because of a fundamental limit in how precisely you can predict the future.

I think there was a gap in my reasoning, let me put it this way: as you said, only when you can cleanly describe the things you care about you can design a system that doesn't game your goals (thermostat). However, my reasoning suggests that one way in which you may not be able to cleanly describe the... (read more)

Replying toOn predictability, chaos and AIs that don't game our goals

Alejandro Tlaie2y

On predictability, chaos and AIs that don't game our goals

_Let's say we define an aligned agent doing what we would want, provided that we were in its shoes (i.e. knowing what it knew). Under this definition, it is indeed possible that to specify an agent's decision rule in a way that doesn't rely on long-range predictions (where predictive power gets fuzzy, like Alejandro says, due to measurement error and complexity). _

This makes intuitive sense to me! However, for concreteness, I'd pushback with an example and some questions.

Let's assume that we want to train an AI system that autonomously operates in the financial market. Arguably, a good objective for this agent is to maximize benefits. However, due to the chaotic nature of... (read more)

On predictability, chaos and AIs that don't game our goals

Alejandro Tlaie

I want to thank @Ryan Kidd, @eggsyntax and Jeremy Dolan for useful discussions and for pointing me to several of the relevant resources (mentioned in this post) that I have used for linking my own ideas with those of others.

Executive summary

Designing an AI that aligns with human goals presents significant challenges due to the unpredictable nature of complex systems and the limitations in specifying precise initial conditions. This post explores these challenges and connects them to existing AI alignment literature, emphasizing three main points:

Finite predictability in complex systems: Complex systems exhibit a finite predictability horizon, meaning there is a limited timeframe within which accurate predictions can be made. This limitation arises from the

... (read 1616 more words →)

Replying toIntroducing SARA: a new activation steering technique

Alejandro Tlaie2y

Introducing SARA: a new activation steering technique

Hi Gianluca, it's great that you liked the post and the idea! I think that your approach and mine share things in common and that we have similar views on how activation steering might be useful!

I would definitely like to chat to see whether potential synergies come up :)

Replying toIntroducing SARA: a new activation steering technique

Alejandro Tlaie2y

Introducing SARA: a new activation steering technique

I agree that it is intriguing. Even if I'm currently testing the method on more established datasets, my intuition of why it works is as follows:

Singular vectors correspond to the principal directions of data variance. In the context of LLMs, these directions capture significant patterns of moral and ethical reasoning embedded within the model's activations.
While it seems that the largest singular vectors only preserve linear-algebra properties, they also encapsulate high-dimensional data structure, which I'd argue includes semantic and syntactic information. At the end of the day, these vectors represent directions along which the model activations vary the most, inherently encoding important semantic information.

Replying toIntroducing SARA: a new activation steering technique

Alejandro Tlaie2y

Introducing SARA: a new activation steering technique

Hi Charlie, thanks a lot for taking the time to read the post and for the question!

Regarding what was the idea of changing the activation histories: I wanted to capture token dependencies, as I thought that concepts that weren't captured by one token only (as in the case of ActAdd) would be better described by these history-dependent activations. As to why bringing the 3 relevant activation histories to the same size: that's for enabling comparison (and, ultimately, similarity).

Regarding why SVD: I decided to use SVD as it's one of the simplest and most ubiquitous matrix factorisation techniques out there (so I didn't need to validate it or benchmark it). Also, it allows for not-so-heavy computations, which is crucial because SARA is thought to be implemented at inference time.

Introducing SARA: a new activation steering technique

Alejandro Tlaie

Disclaimer

I currently am a Postdoctoral Fellow in Computational Neuroscience, learning about Mechanistic Interpretability and AI Safety in general. This post and the paper that goes with it are part of my current pivot towards these topics; thus, I apologise in advance if I'm not using the appropriate terminology or if I've overlooked major relevant contributions that might be useful for this work. Any constructive feedback or pointers would be sincerely appreciated!

Executive summary

This post introduces SARA (Similarity-based Activation Steering with Repulsion and Attraction), a tool that I designed to provide precise control over the moral reasoning^[1] of Large Language Models (LLMs). In case you are interested, I have applied SARA to Google's Gemma-2B... (read 1512 more words →)

LESSWRONG
LW

LESSWRONG
LW

A short project on Mamba: grokking & interpretability

Introducing SARA: a new activation steering technique

Toy Models of Superposition: what about BitNets?

On predictability, chaos and AIs that don't game our goals

Alejandro Tlaie

Alejandro Tlaie

A short project on Mamba: grokking & interpretability

Toy Models of Superposition: what about BitNets?

On predictability, chaos and AIs that don't game our goals

Introducing SARA: a new activation steering technique

Alejandro Tlaie

A short project on Mamba: grokking & interpretability

Introducing SARA: a new activation steering technique

Toy Models of Superposition: what about BitNets?

On predictability, chaos and AIs that don't game our goals

Alejandro Tlaie

Alejandro Tlaie

A short project on Mamba: grokking & interpretability

Toy Models of Superposition: what about BitNets?

On predictability, chaos and AIs that don't game our goals

Introducing SARA: a new activation steering technique

Executive summary

Introduction

Summary

Motivation

Executive summary

Disclaimer

Executive summary