All of cloud's Comments + Replies

Thanks for the question!

Yeah, the story is something like: structuring model internals gives us more control over how models generalize limited supervision. For example, maybe we can factor out how a model represents humans vs. how it represents math concepts, then localize RLHF updates on math research to the math concept region. This kind of learning update would plausibly reduce the extent to which a model learns (or learns to exploit) human biases, increasing the odds that the model generalizes in an intended way from misspecified feedback.

Another angl... (read more)

Ah, I see what you mean. I think my use of the term "fine-tuning" was misleading. The distinction I'm trying to draw is between interventions applied throughout training vs. after training. "Post hoc" would have been a better term to describe the latter. 

My suspicion is that post hoc methods will not be sufficient to robustly remove capabilities that are strongly reinforced by the training objective (while maintaining good general performance), because the capabilities are "too deeply ingrained."[1] We're excited about gradient routing's potentia... (read more)

1shash42
That makes sense. My higher level concern with gradient routing (to some extent true for any other safety method) being used throughout training rather than after training is alignment tax, where it might lead to significantly lower performance and not get adopted in frontier models.  Evidence of this for gradient routing: people have tried various forms of modular training before [1], [2] and they never really caught on because its always better to train a combined model which allows optimal sharing of parameters. Its still a cool idea though, and I would be happy to see it work out :) [1] Andreas, Jacob et al., "Neural Module Networks.", CVPR 2016 [2] Ebrahimi, Sayna, et al. "Adversarial continual learning." ECCV 2020

Thanks for the feedback and references!

On catastrophic forgetting: our appendix includes a "control" version of ERA that doesn't use gradient routing but is otherwise the same (appendix C, figure 12). This shows that the effect of retain-set fine-tuning is negligible in the absence of gradient routing.

On gradient ascent or similar methods: there are many unlearning methods that don't target or achieve the kind of robust localization and removal that we care about, as mentioned in our discussion of related works, and, e.g., in this post. We included RMU as ... (read more)

1shash42
Thanks for pointing me to Figure 12, it alleviates my concern! I don't fully agree with RMU being a stand-in for ascent based methods. Targeted representation noising (as done in RMU) seems easier to reverse than loss maximization methods (like TAR). Finally, just wanted to clarify that I see SSD/Potion more as automated mechanistic interpretability methods rather than finetuning-based. What I meant to say was that adding some retain set finetuning on top (as done for gradient routing) might be needed to make them work for tasks like unlearning virology.

Thanks for sharing! This is super cool and timely work.

Some thoughts:

  1. I'm excited about (the formalism of) partial observability as a way to make progress on outer alignment in general. Partial observability seems like a natural way to encode fundamental difficulties with specifying what we (humans) want to a system that has more (or different) information and understands that information better (or differently) than we do. I don't see any reason that the formalism's usefulness would be limited to cases where human evaluators literally lack information,
... (read more)
2Leon Lang
Hi! Thanks a lot for your comments and very good points. I apologize for my late answer, caused by NeurIPS and all the end-of-year breakdown of routines :) On 1: Yes, the formalism I'm currently working on also allows to talk about the case that the human "understands less" than the AI. On 2:  I am not entirely sure if I understand! Though if it's just what you express in the following sentences, here's my answers: Very good observation! :) I'm thinking about it slightly differently, but the link is there: Imagine a scenario where we have a pretrained foundation model, and we train a linear probe attached to the internal representations, which is supposed to learn the correct reward for full state sequences, based on feedback from a human on partial observations. Then if we show this model (including attached probe) during training just the partial observations, it's receiving the correct data and is supposed to generalize from feedback on "easy situations" (i.e., situations where the partial observations of the human provide enough information to make a correct judgment) to "hard situations" (full state sequences that the human couldn't oversee, and where possibly the partial observations miss crucial details). So I think this setting is an instance of weak-to-strong generalization. Yes that's actually also part of what I'm exploring, if I understand your idea correctly. In particular, I'm considering the case that we may have "knowledge" of some form about the space in which the correct reward function lives. This may come from symmetries in the state space, for example: maybe we want to restrict to localized reward functions that are translation-invariant. All of that can easily be formalized in one framework.  Pretrained foudation models on which we attach a "reward probe" can be viewed as another instance of considering symmetries in the state space: In this case, we're presuming that state sequences have the same reward if they give rise to the same "le

Thanks for the thoughtful questions.

Regarding image models: our understanding is that strong regularization is required to split representations for MNIST autoencoding and CIFAR classification because there is a strong inductive bias towards learning features that are common to many classes of images. (In MNIST, 3s are similar to 8s, etc.; In CIFAR, similar edge detectors, etc. will be learned for many classes.) Basically, our learning target is highly unnatural. With our current experimental design, I don't expect this to change with scale, so I'm less ex... (read more)

14gate
I agree that there are inductive biases towards sharing features and/or components. I'm not sure if there's a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate[1], and I'm not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it's pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of division of labor to, like software features vs. biology knowledge features vs. creative writing features, etc... (in the setting of natural language). Obviously, these all might share some basic grammatical or structural core features, but one layer of abstraction up it feels natural that they should be seperable. All this goes to say is that it seems like a good idea to give gradient routing the best possible shot at success might be to try some such partitioning of features/tasks,[2] because unlike 3 and 8 we have some prior reason to believe that they should indeed be rather seperate. Maybe there's other sources that can motivate what features or tasks to try to route seperately with minimal loss of utility (i.e. like what MoE papers report works well or not) but I haven't thought about it too much. One downside here is that all these examples that come to mind are in language settings, and so to get reasonable utiliy to start with you would probably need to be in the 1B-7B model size range. About the edges. Have you tried all 3 combinations (route both, route one, route the other)? I think the fact that you limit to these edges in mentioned in the appendix Memory section. Surely, routing on activation edges is not actually prohibitive. Worst-case you you can just assign blocks to each category and it'll basicall