The Waluigi Effect is defined by Cleo Nardo as follows:
The Waluigi Effect: After you train an LLM to satisfy a desirable property , then it's easier to elicit the chatbot into satisfying the exact opposite of property .
For our project, we prompted Llama-2-chat models to satisfy the property that they would downweight the correct answer when forbidden from saying it. We found that 35 residual stream components were necessary to explain the models average tendency to do .
However, in addition to these 35 suppressive c...
Our best guess is that "Bay" is the second-most-likely answer (after "California") to the factual recall question "The Golden Gate Bridge is in the state of ". Indeed, when running our own version of Llama-2-7b-chat, adding "from California" results in "San Francisco" being outputted instead of "Bay". As you can see in this notebook, "San Francisco" is the second-most-likely answer for our setup. replicate.com has different behavior from our local version of Llama-2-7b-chat though, and we were not able to figure out how to match the behavior of repli...
Very exciting initiative. Thanks for helping run this. I think the co-working calendar link may be broken though.
Also, the specific cycle attack doesn't work against other engines I think? In the paper their adversary doesn't transfer very well to LeelaZero, for example. So it's more one particular AI having issues, than a fact about Go itself.
Hi, one of the authors here speaking on behalf of the team. We’re excited to see that people are interested in our latest results. Just wanted to comment a bit on transferability.
KataGo's training is done under a ruleset where a white territory containing a few scattered black stones that would not be able to live if the game were played out is credited to white.
I don't think this statement is correct. Let me try to give some more information on how KataGo is trained.
Firstly, KataGo's neural network is trained to play with various different rulesets. These rulesets are passed as features to the neural network (see appendix A.1 of the original KataGo paper or the KataGo source code). So KataGo's neural network has knowledge of what ...
One of the authors of the paper here. Really glad to see so much discussion of our work! Just want to help clarify the Go rules situation (which in hindsight we could've done a better job explaining) and my own interpretation of our results.
We forked the KataGo source code (github.com/HumanCompatibleAI/KataGo-custom) and trained our adversary using the same rules that KataGo was trained on.[1] So while our current adversary wins via a technicality, it was a technicality that KataGo was trained to be aware of. Indeed, KataGo is able to recognize that p...
Yeah I wish I didn't have it. I would like to be able to drink socially.
Nice piece. My own Asian flush has definitely turned me away from drinking. I wanted to like drinking due to the culture surrounding it, but the side effects I get from alcohol (headache and asthma) make the experience quite miserable.
Glad you enjoyed the work and thank you for the comment! Here are my thoughts on what you wrote:
This depends on your definition of "understanding" and your definition of "tractable". If we take "understanding" to mean the ability to predict some non-trivial aspects of behavior, then you are entirely correct that approaches like mech-interp are tractable, since in our case is was mechanistic analysis that led us to... (read more)