All of aggliu's Comments + Replies

This comment confuses me:

Is the spoiler block supposed to be empty, as it appears to be?

3Julian Bradshaw
Spoiler block was not supposed to be empty, sorry. It's fixed now. I was using the Markdown spoiler formatting and there was some kind of bug with it I think, I reported it to the LW admins last night. (also fwiw I took the opportunity now to expand on my original spoilered comment more)

The biggest piece of evidence that I've seen is the Emergent Misalignment paper, linked in Gurkenglas's parallel comment to this one.

This is probably well-known at this point, but the $6 million item is not the duct-taped banana itself, but the right to display the duct-taped banana. That may not make it any more or less sensible, though.

Yep, that's why I mentioned evil numbers specifically.

I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.

4TurnTrout
Any empirical evidence that the Waluigi effect is real? Or are you more appealing to jailbreaks and such?
4Gurkenglas
Have you seen https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly ? :)