All of shash42's Comments + Replies

An ASI democratically aligned with preferences of all humans would probably try to stop and decimate AI development. Only a small portion of the world wants to develop AGI/ASI for economic gain or intellectual curiosity. Most people do not want their jobs threatened, and at most want home robots or other machines to cut down menial chores. If you disagree with this you're living in a tech bubble.

That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models. 

Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters.

Its still a cool idea though, and I wo... (read more)

Thanks for pointing me to Figure 12, it alleviates my concern! I don't fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.

2cloud
Ah, I see what you mean. I think my use of the term "fine-tuning" was misleading. The distinction I'm trying to draw is between interventions applied throughout training vs. after training. "Post hoc" would have been a better term to describe the latter.  My suspicion is that post hoc methods will not be sufficient to robustly remove capabilities that are strongly reinforced by the training objective (while maintaining good general performance), because the capabilities are "too deeply ingrained."[1] We're excited about gradient routing's potential to solve this problem by separating capabilities during training. However, I agree that there isn't enough evidence yet, and it would be great to do more extensive comparisons, particularly to these recent methods which also target good performance under imperfect labeling. For what it's worth, I don't think fine-tuning is doing that much work for us: we see it as a light-touch correction to "internal distribution shift" caused by ablation. As mentioned in this comment, we find that post-ablation fine-tuning on retain helps both retain and forget set performance. In the same comment we also show that retraining on the training distribution (a mixture of forget and retain) produces qualitatively similar results. 1. ^ Also, if the goal is to be robust not only to imperfect labeling but also to forget set retraining, then there is a fundamental challenge to post hoc methods, which is that the minimal changes to a model which induce bad performance on a task are potentially quite different than the minimal changes to a model which prevent retrainability.

Thanks for sharing these interesting results!

I am a big fan of reporting unlearning results across identified forget set fractions! That said, I think the unlearning results lack comparisons to important ablations/baselines which would really test if gradient routing is adding value. For eg:
1. CF (catastrophic forgetting) - This would involve removing most components of ERA, only keeping the finetuning on the retain set. 

2. Ascent + CF - This would involve a light touch of gradient ascent (maximizing the loss) on the forget set, with simultaneous fine... (read more)

2cloud
Thanks for the feedback and references! On catastrophic forgetting: our appendix includes a "control" version of ERA that doesn't use gradient routing but is otherwise the same (appendix C, figure 12). This shows that the effect of retain-set fine-tuning is negligible in the absence of gradient routing. On gradient ascent or similar methods: there are many unlearning methods that don't target or achieve the kind of robust localization and removal that we care about, as mentioned in our discussion of related works, and, e.g., in this post. We included RMU as a stand-in for this class, and I personally don't see much value in doing more extensive comparisons there. On Corrective Unlearning: we weren't aware of other unlearning approaches that consider imperfect labeling, so this is a very helpful reference-- thanks! It would be interesting to compare ERA-type methods to these. My concern with fine-tuning methods is that they might not be suitable for robustly removing broader capabilities (like, "virology") as opposed to correcting for small perturbations to the training data.

Thanks, the rationale for using PCA was quite interesting. I also quite like the idea of separating different model classes for this evaluation.

I think it's interesting to see how much improvements in different types of safety benchmarks correlate with advancement in model capabilities. I also agree that designing decorrelated benchmarks is important, simply because it indicates they won't be saturated as easily. However, I have some doubts regarding the methodology and would appreciate clarifications if I misinterpreted something:

Using model performance based correlation: If I'm not wrong, the correlation of capability and safety is measured using the performance of various models on benchmarks. ... (read more)

1adamk
I'll begin by saying more about our approach for measuring "capabilities scores" for a set of models (given their scores on a set of b standard capabilities benchmarks). We'll assume that all benchmark scores have been normalized. We need some way of converting each model's b benchmark scores into one capabilities score per model. Averaging involves taking a weighted combination of the benchmarks, where the weights are equal. Our method is similarly a weighted combination of the benchmarks, but where the weights are higher for benchmarks that better discriminate model performance. This can be done by choosing the weights according to the first principal component of the benchmark scores matrix. I think this is a more principled way to choose the weights, but I also suspect an analysis involving averaged benchmark scores as the "capabilities score" would produce very similar qualitative results. To your first point, I agree that correlations should not be inferred as a statement of intrinsic or causal connection between model general capabilities and a particular safety benchmark (or the underlying safety property it operationalizes). After all, a safety benchmark's correlations depend both on the choice of capabilities benchmarks used to produce capabilities scores, as well as the set of models. We don't claim that correlations observed for one set of models will persist for more capable models, or models which incorporate new safety techniques. Instead, I see correlations as a practical starting point for answering the question "Which safety issues will persist with scale?" This is ultimately a key question safety researchers and funders should be trying to discern when allocating their resources toward differential safety progress. I believe that an empirical "science of benchmarking" is necessary to improve this effort, and correlations are a good first step. How do our recommendations account for the limitations of capabilities correlations? From the Discussi

I believe these evaluations in unlearning miss a critical aspect: they benchmark on deleting i.i.d samples or a specific class, instead of adversarially manipulated/chosen distributions. This might fool us to believe unlearning methods work as shown in our paper both theoretically (Theorem 1) and empirically. The same failure mode holds for interpretability, which is a similar argument as the motivation to study across the whole distribution in the recent Copy Suppression paper.

 

Thank you, I am glad you liked our work!

We think logistic regression might be honing in on some spurious correlations that help with classification in that particular distribution but don't have an effect on later layers of the model and thus its outputs. ClassMeans does as well as LEACE for removal as it has the same linear guardedness guarantee as LEACE as mentioned in their paper.

As for using a disjoint training set to train the post-removal classifier: We found that the linear classifier attained random accuracies if trained on the dataset used for rem... (read more)

My point was more to say that planning capabilities and situational awareness can be a bottleneck to executing 'deceptive alignment' even with a good understanding of what it means. Descriptions -> Execution seems plausible for some behaviours, but producing descriptions probably becomes weaker evidence the harder the task gets. For example, one can describe what 'good safety research' would look like but be very far from having the skills to actually do it themselves (writing code, knowledge of Deep Learning etc). On the other hand, one may have never ... (read more)

Great post! Lays out a clear agenda and I agree in most part, including about timeliness of the scientific case due to the second argument. But I'll nitpick on the first argument:

Models finally understand deceptive alignment. Claude and GPT4 give (for the most part) clear explanations of the deceptive alignment, why it’s a risk, why it’s a useful strategy, etc. Previous generations of models mostly did not (e.g., the GPT-3 or 3.5 generation).

Giving clear explanations via simulation (of say, Alignment Forum text in the training data) is likely not the same ... (read more)

4Ethan Perez
Generating clear explanations via simulation is definitely not the same as being able to execute it, I agree. I think it's only a weak indicator / weakly suggestive evidence that now is a good time to start looking for these phenomena. I think being able to generate explanations of deceptive alignment is most likely a pre-requisite to deceptive alignment, since there's emerging evidence that models can transfer from descriptions of behaviors to actually executing on those behaviors (e.g., upcoming work from Owain Evans and collaborators, and this paper on out of context meta learning). In general, we want to start looking for evidence of deceptive alignment before it's actually a problem, and "whether or not the model can explain deceptive alignment" seems like a half-reasonable bright line we could use to estimate when it's time to start looking for it, in lieu of other evidence (though deceptive alignment could also certainly happen before then too).   (Separately, I would be pretty surprised if deceptive alignment descriptions didn't occur in the GPT3.5 training corpus, e.g., since arXiv is often included as a dataset in pretraining papers, and e.g., the original deceptive alignment paper was on arXiv.)

It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment? 

3evhub
Because the model can defect only on distributions that we can't generate from, as the problem of generating from a distribution can be harder than the problem of detecting samples from that distribution (in the same way that verifying the answer to a problem in NP can be easier than generating it). See for example Paul Christiano's RSA-2048 example (explained in more detail here).