User Comment Replies

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:

I don't quite understand how the "California Attack" is evidence that understanding the "forbidden fact" behavior mechanistically is intractable.

This depends on your definition of "understanding" and your definition of "tractable". If we take "understanding" to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to... (read more)

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang1y21

The Waluigi Effect is defined by Cleo Nardo as follows:

The Waluigi Effect: After you train an LLM to satisfy a desirable property $P$ , then it's easier to elicit the chatbot into satisfying the exact opposite of property $P$ .

For our project, we prompted Llama-2-chat models to satisfy the property $P$ that they would downweight the correct answer when forbidden from saying it. We found that 35 residual stream components were necessary to explain the models average tendency to do $P$ .

However, in addition to these 35 suppressive c... (read more)

4Arthur Conmy1y

Thanks! In general after the Copy Suppression paper (https://arxiv.org/pdf/2310.04625.pdf) I'm hesitant to call this a Waluigi component -- in that work we found that "Negative IOI Heads" and "Anti-Induction Heads" are not specifically about IOI or Induction at all, they're just doing meta-processing to calibrate outputs. Similarly, it seems possible the Waluigi components are just making the forbidden tokens appear with prob 10^{-3} rather than 10^{-5} or something like that, and would be incapable of actually making the harmful completion likely

Takeaways from a Mechanistic Interpretability project on “Forbidden Facts”

Tony Wang1y20

Our best guess is that "Bay" is the second-most-likely answer (after "California") to the factual recall question "The Golden Gate Bridge is in the state of ". Indeed, when running our own version of Llama-2-7b-chat, adding "from California" results in "San Francisco" being outputted instead of "Bay". As you can see in this notebook, "San Francisco" is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of repli... (read more)

1RogerDearnaley1y

You could of course quantize Llama-2-70b to work with it inside a single A100 80GB, say to 6 or 8 bits, but that's obviously going to apply some fuzz to everything, and probably isn't something you want to have to footnote in an academic paper. Still, for finding an attack, you could find it in a 6-bit quantized version and then confirm it works against the full model. I'm not sure you need to worry that much about uncomputibility in something with less than 50 layers, but I suppose circuits can get quite large in practice. My hunch is that this particular one actually extends from about layer 16 (midpoint of the model) to about 20-21 (where the big jumps in divergence between refusal and answering happen: I'd guess that's a "final decision").

Launching Lightspeed Grants (Apply by July 6th)

Tony Wang2y21

Very exciting initiative. Thanks for helping run this. I think the co-working calendar link may be broken though.

There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs

Tony Wang2y233

Also, the specific cycle attack doesn't work against other engines I think? In the paper their adversary doesn't transfer very well to LeelaZero, for example. So it's more one particular AI having issues, than a fact about Go itself.

Hi, one of the authors here speaking on behalf of the team. We’re excited to see that people are interested in our latest results. Just wanted to comment a bit on transferability.

The adversary trained in our paper has a 97% winrate against KataGo at superhuman strength, a 6.1% winrate against LeelaZero at superhuman strength, a

... (read more)

3gwern2y

"Why not both?" Twitter snideness aside*, I don't see any contradiction: cycling in multi-agent scenarios due to forgetting responses is consistent with bad inductive biases. The biases make it unable to easily learn the best response, and so it learns various inferior responses which form a cycle. Imagine that CNNs cannot 'see' the circles because the receptive window grows too slowly or some CNN artifact like that; no amount of training can let it see circles in full generality and recognize the trap. But it can still learn to win: eg. with enough adversarial training against an exploiter which has learned to create circles in the top left, it learns a policy of being scared of circles in the top left, and stops losing by learning to create circles in the other corners (where, as it happens, it is not currently being exploited); then the exploiter resumes training and learns to create circles in the top right, where the CNN falls right into the trap, and so it returns to winning; then the adversarial training resumes and it forgets the avoid-top-left strategy and learns the avoid-top-right strategy... And so on forever. The CNN cannot learn a policy of 'never create circles in any corner' because you can't win a game of Go like that, and CNN/exploiter just circle around the 4 corners playing rock-paper-scissors-spock eternally. * adversarial spheres looks irrelevant to me, and the other paper is relevant but attacks a fixed policy which is not the case with MCTS, especially with extremely large search budgets - which is supposed to be complete in the limit and is also changing the policy at runtime by policy-improvement

4LawrenceC2y

Thanks for the clarification, especially how a 6.1% winrate vs LeelaZero and 3.5% winrate vs ELF still imply significantly stronger Elo than is warranted. The fact that Kellin could defeat LZ manually as well as the positions in bilibili video do seem to suggest that this is a common weakness of many AlphaZero-style Go AIs. I retract my comment about other engines. Yeah! I'm not downplaying the value of this achievement at all! It's very cool that this attack works and can be reproduced by a human. I think this work is great (as I've said, for example, in my comments on the ICML paper). I'm specifically quibbling about the "solved/unsolved" terminology that the post used to use. ---------------------------------------- Your comment reminded me of ~all the adversarial attack transfer work in the image domain, which does suggest that non-adversarially trained neural networks will tend to have the same failure modes. Whoops. Should've thought about those results (and the convergent learning/universality results from interp) before I posted.

Adversarial Policies Beat Professional-Level Go AIs

Tony Wang2y21

KataGo's training is done under a ruleset where a white territory containing a few scattered black stones that would not be able to live if the game were played out is credited to white.

I don't think this statement is correct. Let me try to give some more information on how KataGo is trained.

Firstly, KataGo's neural network is trained to play with various different rulesets. These rulesets are passed as features to the neural network (see appendix A.1 of the original KataGo paper or the KataGo source code). So KataGo's neural network has knowledge of what ... (read more)

3gjm2y

"Firstly": Yes, I oversimplified. (Deliberately, as it happens :-).) But every version of the rules that KataGo has used in its training games, IIUC, has had the feature that players are not required to capture enemy stones in territory surrounded by a pass-alive group. I agree that in your example the white stones surrounding the big white territory are not pass-alive, so it would not be correct to say that in KG's training this particular territory would have been assessed as winning for white. But is it right to say that it was "trained to be aware" of this technicality? That's not so clear to me. (I don't mean that it isn't clear what happened; I mean it isn't clear how best to describe it.) It was trained in a way that could in principle teach it about this technicality. But it wasn't trained in a way that deliberately tried to expose it to that technicality so it could learn, and it seems possible that positions of the type exploited by your adversary are rare enough in real training data that it never had much opportunity to learn about the technicality. (To be clear, I am not claiming to know that that's actually so. Perhaps it had plenty of opportunity, in some sense, but it failed to learn it somehow.) If you define "what KataGo was trained to know" to include everything that was the case during its training, then I agree that what KataGo actually knows equals what it was trained to know. But even if you define things that way, it isn't true that what KataGo actually knows equals what its "intuition" has learned: if there are things its intuition (i.e., its neural network) has failed to learn, it may still be true that KataGo knows them. I think the (technical) lostness of the positions your adversary gets low-visits KataGo into is an example of this. KataGo's neural network has not learned to see these positions as lost, which is either a bug or a feature depending on what you think KataGo is really trying to do; but if you run KataGo with a reasonab

Adversarial Policies Beat Professional-Level Go AIs

Tony Wang2y166

One of the authors of the paper here. Really glad to see so much discussion of our work! Just want to help clarify the Go rules situation (which in hindsight we could've done a better job explaining) and my own interpretation of our results.

We forked the KataGo source code (github.com/HumanCompatibleAI/KataGo-custom) and trained our adversary using the same rules that KataGo was trained on.^[1] So while our current adversary wins via a technicality, it was a technicality that KataGo was trained to be aware of. Indeed, KataGo is able to recognize that p... (read more)

8gjm2y

Could you clarify "it was a technicality that KataGo was trained to be aware of"? My understanding of the situation, which could be wrong: KataGo's training is done under a ruleset where a white territory containing a few scattered black stones that would not be able to live if the game were played out is credited to white. KataGo knows (if playing under, say, unmodified Tromp-Taylor rules) that that white territory will not be credited to white and so it will lose if two successive passes happen. But (so to speak) its intuition has been trained in a way that neglects that, so it needs to reason it out explicitly to figure that out. I wouldn't say that the technicality is one KataGo was trained to be aware of. It's one KataGo was programmed to be aware of, so that a little bit of searching enables it not to succumb. But you're saying that KataGo's policy network was "trained to avoid" this situation; in what sense is that true? Is one of the things I've said above incorrect?

Alcohol, health, and the ruthless logic of the Asian flush

Tony Wang4y10

Yeah I wish I didn't have it. I would like to be able to drink socially.

Alcohol, health, and the ruthless logic of the Asian flush

Tony Wang4y40

Nice piece. My own Asian flush has definitely turned me away from drinking. I wanted to like drinking due to the culture surrounding it, but the side effects I get from alcohol (headache and asthma) make the experience quite miserable.

2Pattern4y

Do you wish you didn't have it?

LESSWRONG
LW

All of Tony Wang's Comments + Replies