aggliu

Author, YouTuber, Script Writer for Rational Animations.  A.B. in Math (Harvard 2020)

Wikitag Contributions

Comments

Sorted by
aggliu10

The biggest piece of evidence that I've seen is the Emergent Misalignment paper, linked in Gurkenglas's parallel comment to this one.

aggliu20

This is probably well-known at this point, but the $6 million item is not the duct-taped banana itself, but the right to display the duct-taped banana. That may not make it any more or less sensible, though.

aggliu10

Yep, that's why I mentioned evil numbers specifically.

aggliuΩ161

I am a bit worried that making an explicit persona for the AI (e.g. using a special token) could magnify the Waluigi effect. If something (like a jailbreak or writing evil numbers) engages the AI to act as an "anti-𐀤" then we get all the bad behaviors at once in a single package. This might not outweigh the value of having the token in the first place, or it may experimentally turn out to be a negligible effect, but it seems like a failure mode to watch out for.