That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models.
Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.
Its still a cool idea though, and I wo...
Thanks for pointing me to Figure 12, it alleviates my concern! I don't fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.
Thanks for sharing these interesting results!
I am a big fan of reporting unlearning results across identified forget set fractions! That said, I think the unlearning results lack comparisons to important ablations/baselines which would really test if gradient routing is adding value. For eg:
1. CF (catastrophic forgetting) - This would involve removing most components of ERA, only keeping the finetuning on the retain set.
2. Ascent + CF - This would involve a light touch of gradient ascent (maximizing the loss) on the forget set, with simultaneous fine...
I think it's interesting to see how much improvements in different types of safety benchmarks correlate with advancement in model capabilities. I also agree that designing decorrelated benchmarks is important, simply because it indicates they won't be saturated as easily. However, I have some doubts regarding the methodology and would appreciate clarifications if I misinterpreted something:
Using model performance based correlation: If I'm not wrong, the correlation of capability and safety is measured using the performance of various models on benchmarks. ...
I believe these evaluations in unlearning miss a critical aspect: they benchmark on deleting i.i.d samples or a specific class, instead of adversarially manipulated/chosen distributions. This might fool us to believe unlearning methods work as shown in our paper both theoretically (Theorem 1) and empirically. The same failure mode holds for interpretability, which is a similar argument as the motivation to study across the whole distribution in the recent Copy Suppression paper.
Thank you, I am glad you liked our work!
We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don't have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.
As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for rem...
My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing 'deceptive alignment' even with a good understanding of what it means. Descriptions -> Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what 'good safety research' would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never ...
Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:
Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).
Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same ...
It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?
An ASI democratically aligned with preferences of all humans would probably try to stop and decimate AI development. Only a small portion of the world wants to develop AGI/ASI for economic gain or intellectual curiosity. Most people do not want their jobs threatened, and at most want home robots or other machines to cut down menial chores. If you disagree with this you're living in a tech bubble.