LESSWRONG
LW

WCargo — LessWrong

Replying toAddressing Feature Suppression in SAEs

Hey, I was wondering if you are using any weight decay during the training of the SAE? It feels to me that, being equivalent to implicit L2 minimization, this could also be the culprit. And if you don't, how come the first layer doesn't collapse to 0 while the second layer grows to infinity?
Thanks if you take the time to answer :)

Try Training SAEs with RLAIF

WCargo

2mo

Epistemic status: not an Interpretability researcher, but has followed the seen closely.

So, it makes sense to me that Probes should outperform SAEs: probes are trained directly to maximize an interpretable metric, while SAEs on the other hand are trained to maximize reconstruction loss, and then are interpreted. But training SAEs is nice because this is an unsupervised problem, meaning that you don't need to create a dataset to find directions for each concept like you do with probes.

How can we get the best of both worlds? Well just train SAEs on an objective which directly maximizes human interpretability of the feature!

Illustration of the training objectives of SAEs and Probes. The third design

... (read 435 more words →)

Replying toThe Field of AI Alignment: A Postmortem, and What To Do About It

WCargo1y

The Field of AI Alignment: A Postmortem, and What To Do About It

I agree with claim 2-3 but not with claim 1

I think « random physicist » is not super fair, it looks like from his stand point he indeed met physicist willing to do « alignment » research, and had backgrounds in research and developping theory
We didn’t find Phd student to work on alignment but also we didn’t try (at least not cesia / effisciences)
Its true that most of the people we find that wanted to work on the problem were the motivated ones, but from the point of view of the alignment problem still recruiting them could be a mistake (saturating the field etc)

Visualizing small Attention-only Transformers

WCargo

Visualizing small Attention-only Transformers

Work done during an internship at MILES, Paris Dauphine University, under the supervision of Yann Chevaleyre. You can find the git page for the post here.

Research has indicated that in large Transformers, facts are primarily stored in the MLP layer rather than the attention layer. However, it's worth exploring whether the attention layer also plays a role in memorizing some part of the data. Can an attention layer memorize information, and if so, how?

In this blog-post, we define the memorization task as predicting the correct next token for a pair of input tokens. Our goal was to determine if the Transformer exhibits any structure that supports memorization for this... (read 2193 more words →)

Replying toFact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

WCargo2y

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

Quick question: you say that the MLP 2-6 gradually improve the representation of the sport of the athlete, and that no single MLP do it in one go. Would you consider that the reason would be something like this post describes ? https://www.lesswrong.com/posts/8ms977XZ2uJ4LnwSR/decomposing-independent-generalizations-in-neural-networks

So the MLP 2-6 basically do the same computations, but in a different superposition basis so that after several MLPs, the model is pretty confident about the answer ? Then would you think there is something more to say in the way the "basis are arranged", eg which concept interfere with which (i guess this could help answering questions like "how to change the lookup table name-surname-sport" which we are currently not able to do)

thks

Results from the Turing Seminar hackathon

Charbel-Raphaël

Charbel-Raphaël, jeanne_, WCargo

We (EffiSciences) ran a hackathon at the end of the Turing Seminar in ENS Paris-Saclay and ENS Ulm, an academic course inspired by the AGISF, with 28 projects submitted by 44 participants between the 11th and 12th November.

We share a selection of projects. See them all here.

I think some of them could even be turned into valuable blog posts, and I’ve learnt a lot by reading everything. Here are a few extracts.

Towards Monosemanticity: Decomposing Vision Models with Dictionary Learning

David HEURTEL-DEPEIGES [link]

Basically an adaptation of the famous dictionary learning paper on vision CNN.

“When looking at 100 random features, 46 were found to be interpretable and monosemantic, [...] When doing the same experiment with 100 random neurons, 2... (read 1489 more words →)

On Interpretability's Robustness

WCargo

Léo Dana - Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, as well as an internship at FAR AI, both under the mentorship of Claudia Shi.

Would you trust the IOI circuit?

Interpretability is envisioned by many as the main source of alignment tool for future AI systems, yet I claim that interpretability’s Theory of Change has a central problem: we don’t trust interpretability tools. Why is that?

No proof of generalization: for interpretability, we have the same problem as for the models we studied, we don’t know if our tools will generalize (and we will likely never have proofs of generalization).
Security mindset: there is always something that could

... (read 1025 more words →)

Replying toInterpreting OpenAI's Whisper

WCargo2y

Interpreting OpenAI's Whisper

Thanks for the post Ellena!

I was wondering if the finding "words are clustered by vocal and semantic similarity" also exists in traditional LLMs? I don't remember seeing that, so could it mean that this modularity could also make interpretability easier?

It seems logical: we have more structure on the data, so better way to cluster the text, but I'm curious of your opinion.

Replying toActivation adding experiments with FLAN-T5

WCargo3y

Activation adding experiments with FLAN-T5

Hi, Interesting experiments. What were you trying to find and how would you measure that the content is correctly mixed instead of just having "unrealated concepts juxtaposed" ?

Also, how did you choose which layer to merge your streams ?

Replying toDSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

WCargo3y

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

Hi, thank you for the sequence. Do you know if there is any way to get access the Watanabe’s book for free ?

Replying toSuperposition and Dropout

WCargo3y

Superposition and Dropout

In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).

The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.

I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).

In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement "independent" component for it to compute thing properly.

Introducing EffiSciences’ AI Safety Unit

WCargo

WCargo, Charbel-Raphaël, Florent_Berthet

This post was written by Léo Dana, Charbel-Raphaël Ségerie, and Florent Berthet, with the help of Siméon Campos, Quentin Didier, Jérémy Andréoletti, Anouk Hannot and Tom David.

In this post, you will learn what were EffiSciences’ most successful field-building activities as well as our advice, reflections, and takeaways to field-builders. We also include our roadmap for the next year. Voilà.

What is EffiSciences?

EffiSciences is a non-profit based in France whose mission is to mobilize scientific research to overcome the most pressing issues of the century and ensure a desirable future for generations to come.

EffiSciences was founded in January 2022 and is now a team of ~20 volunteers and 4 employees.

At the moment, we are focusing... (read 3521 more words →)

Replying toSuperposition and Dropout

WCargo3y

Superposition and Dropout

One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)

Replying toImprovement on MIRI's Corrigibility

WCargo3y

Improvement on MIRI's Corrigibility

Thank you, Idk why but before I ended up on a different page with broken links (maybe some problem on my part)!

Replying toForum Digest: Corrigibility, utility indifference, & related control ideas

WCargo3y

Forum Digest: Corrigibility, utility indifference, & related control ideas

Hey, almost all links are dead, would it be possible to update them ? otherwise the post is pretty useless and I am interested in them ^^

Improvement on MIRI's Corrigibility

WCargo

WCargo, Charbel-Raphaël

This post was written as a submission for the AI Alignment Award, initiated at EffiSciences' event.

This post aims to address the problem of corrigibility as identified by MIRI in 2015. We propose an extended formalism that allows us to write the desiderata of a corrigible behaviour, and provide theoretical solutions with helpful illustrations of each proposal. The first extension is to make the agent behave as if the shutdown button does not exist, and the second is to make the agent behave as if the button does not work.

The first section's goal is to recall the formalism of MIRI's article Corrigibility, as well as the Big Gamble problem, and to introduce corrigibility... (read 3672 more words →)

A Corrigibility Metaphore - Big Gambles

WCargo

I present here a helpful analogy to understand the corrigibility problem and the challenge raised by MIRI in their proposal. This analogy simplifies greatly some challenges of corrigibility but keeps the main problem found in the proposal, which I call Big Gambles.

You are playing Mario Kart with 2 other friends, Alice and Bob, and are playing matches to see who's the best player on average, decided by the average position between 1st, 2nd, or 3rd.

Alice has a poor internet connection and so gets disconnected sometimes. To account for that in the competition, a rule was added that her score would be 2nd if she is disconnected so that she is not penalized too much.

Bob takes... (read 1079 more words →)