All of aggliu's Comments + Replies

aggliu10

The biggest piece of evidence that I've seen is the Emergent Misalignment paper, linked in Gurkenglas's parallel comment to this one.

aggliu20

This is probably well-known at this point, but the $6 million item is not the duct-taped banana itself, but the right to display the duct-taped banana. That may not make it any more or less sensible, though.

aggliu10

Yep, that's why I mentioned evil numbers specifically.

aggliuΩ161

I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.

4TurnTrout
Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?
4Gurkenglas
Have you seen https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly ? :)