Constructing Neural Network Parameters with Downstream Trainability
I have recently done some preliminary experiments related to mechanistic interpretability. It seems researchers in this subfield often post their results in the AI Alignment Forum / LessWrong Forum (e.g. causal scrubbing, emergent world representation, monosemanticity, etc), and AI Alignment Forum Q&A suggests to post in LessWrong. Therefore, I post...