Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort, supervised by Evan Hubinger
TL: DR
- We proved that the SAE features related to the finetuning target will change more than other features in the semantic space.
- We developed an automated model audit method based on this finding.
- We investigated the robustness of this method on different data, models, and hyper-parameters.
1. Introduction
The openness of powerful models brings risks. Currently, an increasing number of powerful models are being opened up through various means including open-source weights and offering APIs for fine-tuning, in order to meet users' customized needs. (Just when I was writing the draft of this post, I heard that OpenAI... (read 883 more words →)