I have recently done some preliminary experiments related to mechanistic interpretability. It seems researchers in this subfield often post their results in the AI Alignment Forum / LessWrong Forum (e.g. causal scrubbing, emergent world representation, monosemanticity, etc), and AI Alignment Forum Q&A suggests to post in LessWrong. Therefore, I post it here to follow the convention as well as open source spirits, and maybe someone happens to find it a little bit useful.

Abstract: The pretrain-finetune paradigm is shown to be beneficial and is widely adopted. However, the reason why it works has not been fully understood, especially when using the lenses of mechanistic interpretability and constructions. From another perspective, pre-training is a common solution to the problem of providing a good neural network initialization or powerful inductive bias. Nevertheless, it would be great if the inductive bias could be directly and accurately controlled, instead of indirectly controlled via losses or datasets. In this work, we propose to construct the neural network parameters manually and train (fine-tune) them on downstream tasks afterward. After that, the network can be mechanistically interpreted or utilized. The source of the manual construction is the combination of reverse engineering and human thinking. We conduct very preliminary experiments to demonstrate the workflow. Our approach remains highly interpretable with an improvement of 6.1% top-1 and 9.1% top-10 accuracy in the IsarStep-small dataset compared with the random initialization baseline. Compared with the pre-train baseline, our demonstration has higher interpretability and only uses 18% modules while achieving the majority of accuracy. The limitations of the experiments and the approach are also discussed. (For more details, please refer to the PDF report in the GitHub repository)

New Comment