Backstory: I wrote this post about 12d ago, then a user here pointed out that this could be capability exfohazard since it could give the bad idea of having the AGI look at itself, so I took it down. Well I don’t have to worry about that anymore since now we have proof that they are literally doing it right now at OpenAI.
I don’t want to piss on the parade, and I still think automating interpretability right now is a good thing, but sooner or later, if not done right, there is a high chance it's all gonna backfire so hard we will… well, everybody dies.
The following is an expanded version of the previous... (read 2070 more words →)
I think a lot of "why won't AIs form a trade coalition that includes us?" is sort of answered by EY in this debate: