WCargo

(Léo Dana) French master student in applied Mathematics (probability & statistic), soon PhD in Mathematics in Paris

Posts

Sorted by New

4Visualizing small Attention-only Transformers

5mo

11On Interpretability's Robustness

68Introducing EffiSciences’ AI Safety Unit

54Improvement on MIRI's Corrigibility

16A Corrigibility Metaphore - Big Gambles

Wikitag Contributions

Comments

Sorted by

Newest

The Field of AI Alignment: A Postmortem, and What To Do About It

WCargo4mo40

I agree with claim 2-3 but not with claim 1

I think « random physicist » is not super fair, it looks like from his stand point he indeed met physicist willing to do « alignment » research, and had backgrounds in research and developping theory
We didn’t find Phd student to work on alignment but also we didn’t try (at least not cesia / effisciences)
Its true that most of the people we find that wanted to work on the problem were the motivated ones, but from the point of view of the alignment problem still recruiting them could be a mistake (saturating the field etc)

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

WCargo1yΩ110

Quick question: you say that the MLP 2-6 gradually improve the representation of the sport of the athlete, and that no single MLP do it in one go. Would you consider that the reason would be something like this post describes ? https://www.lesswrong.com/posts/8ms977XZ2uJ4LnwSR/decomposing-independent-generalizations-in-neural-networks

So the MLP 2-6 basically do the same computations, but in a different superposition basis so that after several MLPs, the model is pretty confident about the answer ? Then would you think there is something more to say in the way the "basis are arranged", eg which concept interfere with which (i guess this could help answering questions like "how to change the lookup table name-surname-sport" which we are currently not able to do)

thks

Interpreting OpenAI's Whisper

WCargo2y10

Thanks for the post Ellena!

I was wondering if the finding "words are clustered by vocal and semantic similarity" also exists in traditional LLMs? I don't remember seeing that, so could it mean that this modularity could also make interpretability easier?

It seems logical: we have more structure on the data, so better way to cluster the text, but I'm curious of your opinion.

Activation adding experiments with FLAN-T5

WCargo2y74

Hi, Interesting experiments. What were you trying to find and how would you measure that the content is correctly mixed instead of just having "unrealated concepts juxtaposed" ?

Also, how did you choose which layer to merge your streams ?

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

WCargo2y10

Hi, thank you for the sequence. Do you know if there is any way to get access the Watanabe’s book for free ?

Superposition and Dropout

WCargo2y10

In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).

The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.

I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy (so if a computation fails somewhere, it is done also somewhere else).

In general, dropout makes me feel like because some part of the network are not going to work, the network has to implement "independent" component for it to compute thing properly.

Superposition and Dropout

WCargo2y10

One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)

Improvement on MIRI's Corrigibility

WCargo2y10

Thank you, Idk why but before I ended up on a different page with broken links (maybe some problem on my part)!

Forum Digest: Corrigibility, utility indifference, & related control ideas

WCargo2y10

Hey, almost all links are dead, would it be possible to update them ? otherwise the post is pretty useless and I am interested in them ^^

Improvement on MIRI's Corrigibility

WCargo2y10

Indeed. D4 is better than D5 if we had to choose, but D4 is harder to formalize. I think that having a theory of corrigibility without D4 is already something a good step as D4 seems like "asking to create corrigible agent", so you maybe the way to do it is: 1. have a theory of corrigible agent (D1,2,3,5) and 2. have a theory of agent that ensures D4 by apply the previous theory to all agent and subagent.