Author, YouTuber, Script Writer for Rational Animations. A.B. in Math (Harvard 2020)
I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.
Yep, that's why I mentioned evil numbers specifically.