User Comment Replies

The Field of AI Alignment: A Postmortem, and What To Do About It

I agree with claim 2-3 but not with claim 1

I think « random physicist » is not super fair, it looks like from his stand point he indeed met physicist willing to do « alignment » research, and had backgrounds in research and developping theory
We didn’t find Phd student to work on alignment but also we didn’t try (at least not cesia / effisciences)
Its true that most of the people we find that wanted to work on the problem were the motivated ones, but from the point of view of the alignment problem still recruiting them could be a mistake (saturating the field etc)

2Charbel-Raphaël5mo

What do you think of my point about Scott Aaronson? Also, since you agree with points 2 and 3, it seems that you also think that the most useful work from last year didn't require advanced physics, so isn't this a contradiction with you disagreing with point 1?

Fact Finding: Attempting to Reverse-Engineer Factual Recall on the Neuron Level (Post 1)

WCargo1yΩ110

Quick question: you say that the MLP 2-6 gradually improve the representation of the sport of the athlete, and that no single MLP do it in one go. Would you consider that the reason would be something like this post describes ? https://www.lesswrong.com/posts/8ms977XZ2uJ4LnwSR/decomposing-independent-generalizations-in-neural-networks

So the MLP 2-6 basically do the same computations, but in a different superposition basis so that after several MLPs, the model is pretty confident about the answer ? Then would you think there is something more to say in the ... (read more)

3Neel Nanda1y

We dig into this in post 3. The layers compose importantly with each other and don't seem to be doing the same thing in parallel, path patching the internal connections will break things, so I don't think it's like what you're describing

Interpreting OpenAI's Whisper

WCargo2y10

Thanks for the post Ellena!

I was wondering if the finding "words are clustered by vocal and semantic similarity" also exists in traditional LLMs? I don't remember seeing that, so could it mean that this modularity could also make interpretability easier?

It seems logical: we have more structure on the data, so better way to cluster the text, but I'm curious of your opinion.

1EllenaR2y

I wouldn't expect an LLM to do this. An LLM wants to predict the most likely next word, so is going to assign high probabilities to semantically similar words (hence why they are clustered in embedding space). Whisper is trying to do speech-to-text, so as well as needing to know about semantic similarity of words it also needs to know about words that sound the same. Eg if it thinks it heard 'rug', it is pretty likely that the person speaking actually said 'mug' hence these words are clustered. Does that make sense?

Activation adding experiments with FLAN-T5

WCargo2y74

Hi, Interesting experiments. What were you trying to find and how would you measure that the content is correctly mixed instead of just having "unrealated concepts juxtaposed" ?

Also, how did you choose which layer to merge your streams ?

7Nina Panickssery2y

In many cases, it seems the model is correctly mixing the concepts in some subjective sense. This is more visible in the feeling prediction task, for instance, when the concepts of victory and injury are combined into a notion of overcoming adversity. However, testing this with larger LMs would give us a better idea of how well this holds up with more complex combinations. The rigor could also be improved by using a more advanced LM, such as GPT4, to assess how well the concepts were combined and return some sort of score. I tested merging the streams at a few different layers in the transformer encoder. The behavior differed depending on where you merged, and it would be interesting to assess these differences more systematically. However, anecdotally, combining at later points produced better concept merging, whereas combining earlier was more likely to create strange juxtapositions. For example: Mixing the activations in the feeling prediction task of "Baking a really delicious banana cake" and "Falling over and injuring yourself while hiking": After block 1/12: * Just original input: The person would likely feel a sense of accomplishment, satisfaction, and happiness in creating a truly special and delicious cake. * With 1.5x mixing activation: Feelings of pain, disappointment, and disappointment due to the unexpected injury. * With 10x mixing activation: Feelings of pain, annoyance, and a sense of self-doubt. After block 12/12: * Just original input: The person would likely feel a sense of accomplishment, satisfaction, and happiness in creating a truly special and delicious cake. * With 1.5x mixing activation: Feelings of pain, surprise, and a sense of accomplishment for overcoming the mistake. * With 10x mixing activation: A combination of pain, shock, and a sense of loss due to the unexpected injury.

DSLT 1. The RLCT Measures the Effective Dimension of Neural Networks

WCargo2y10

Hi, thank you for the sequence. Do you know if there is any way to get access the Watanabe’s book for free ?

2Daniel Murfet2y

If the cost is a problem for you, send a postal address to daniel.murfet@gmail.com and I'll mail you my physical copy.

1Liam Carroll2y

Only in the illegal ways, unfortunately. Perhaps your university has access?

Superposition and Dropout

WCargo2y10

In a MLP, the nodes from different layers are in Series (you need to go through the first, and then the second), but inside the same layer they are in Parallel (you go through one of the other).

The analogy is with electrical systems, but I was mostly thinking in terms of LLM components: the MLPs and Attentions are in Series (you go through the first and after through the second), but inside one component, they are in parallel.

I guess that then, inside a component there is less superposition (evidence is this post), and between component there is redundancy... (read more)

Superposition and Dropout

WCargo2y10

One thing I just thought about: I would predict that dropout is reducing superposition in parallel and augment superposition in series (because to make sure that the function is computed, you can have redundancy)

2Edoardo Pona2y

What do you mean exactly by 'in parallel' and 'in series' ?

Improvement on MIRI's Corrigibility

WCargo2y10

Thank you, Idk why but before I ended up on a different page with broken links (maybe some problem on my part)!

Forum Digest: Corrigibility, utility indifference, & related control ideas

WCargo2y10

Hey, almost all links are dead, would it be possible to update them ? otherwise the post is pretty useless and I am interested in them ^^

2TekhneMakre2y

Note that you can probably find the broken LW posts by searching the title (+author) in LW.

Improvement on MIRI's Corrigibility

WCargo2y10

Indeed. D4 is better than D5 if we had to choose, but D4 is harder to formalize. I think that having a theory of corrigibility without D4 is already something a good step as D4 seems like "asking to create corrigible agent", so you maybe the way to do it is: 1. have a theory of corrigible agent (D1,2,3,5) and 2. have a theory of agent that ensures D4 by apply the previous theory to all agent and subagent.

3Vaniver2y

Does that scheme get around the contradiction? I guess it might if you somehow manage to get it into the utility function, but that seems a little fraught / you're weakening the connection to the base incorrigible agent. (The thing that's nice about 5, according to me, is that you do actually care about performing well as well as being corrigible; if you set your standard as being a corrigible agent and only making corrigible subagents, then you might worry that your best bet is being a rock.)

Improvement on MIRI's Corrigibility

WCargo2y*10

Thank you! I haven't read Armstrongs' work in detail on my side, but I think that one key difference is that classical indifference methods all try to make the agent "act as if the button could not be pressed" which causes the big gamble problem.

By the way, do you have any idea why almost all link on the page you linked are dead or how to find the mentioned articles ??

1Morpheus2y

You mean my link to arXiv? The PDF there should be readable. Or do you mean the articles linked in the PDF? They seem to work as well just fine.

Superposition and Dropout

WCargo2y10

Great post! I was wondering if the conclusion to be drawn is really that « dropout inibits superpositon »? My prior was that it should increase it (so this post proved me wrong on this part) mainly because in a model with mlp in parallel (like transformer) deopout would force redundancy of circuit, not inside one mlp, but across different mlps

Id like to see more on that, it would be super useful to know that dropout helps or not interpretability to enforce it or not on training

2Edoardo Pona2y

Thanks! I haven't yet explored the case you describe with transformers. In general I suspect the effect will be more complex with multilayer transformers than it is for simple autoencoders. However, in the simple case explored so far, it seems this conclusion can be drawn. Dropout is essentially adding variability in the direction that each feature is represented by, thus features can't be as 'packed'.

Anomalous tokens reveal the original identities of Instruct models

WCargo2y10

Here you present the link between two models using the fact that their centroïd token are the same.
Do you know any other similar correlation of this type? Maybe by finding other links between a model an its former models you could gather them and have a more reliable tool to predict if Model A and Model B share a past training.

In particular, I found that there seems to be a correlation between the size of a model and the best prompt for better accuracy [https://arxiv.org/abs/2105.11447 , figure5]. The link here is only the size of the models, but I thought that the size was a weird explanation, and so thought about your article.

Hope this may somehow help :)

Anomalous tokens reveal the original identities of Instruct models

WCargo2y30

Thanks for this nice post !

When you said that the objective was to « find the type of strategies the model currently learning before it becomes performant, and stop it if this isn’t the one we want » But how would you define what attractors are good ones ? How to identifiate the properties of an attractor if no dangerous model as been trained that has this attractor ? And what if the num er of attractor is huge and we can’t test them all beforehand ? It doesn’t seem obvious that the number of attractor wouldn’t grow as the network does.

Prizes for ELK proposals

WCargo3y10

Hello, I have some issue with the epistomology of the problem : my problem is that even if the process of training was giving the behavior we want, we would have no way to check the IA is working properly in practice.
I try now to give more details : in the volt probleme, given the same information, let's think of an IA that just as to answer the question "Is the diamon still in the volt ?".

Something we can suppose is that, the set Y, from which we draw the labeled examples to train the IA (a set of technique for the thief), is not importa... (read more)

LESSWRONG
LW

All of WCargo's Comments + Replies