LESSWRONG
LW

Bruce W. Lee — LessWrong

Bitter Lessons from Distillation Robustifies Unlearning

3mo

Introduction

My collaborators and I wrote a paper titled "Distillation Robustifies Unlearning" a few months ago. For a quick summary, you can check out my mentor’s Twitter thread. Our team will be presenting posters at NeurIPS and the San Diego Alignment Workshop, so feel free to find me if you want to chat.

I'm writing this post to communicate what I think our paper actually says about the problem of unlearning. This is more of a personal account than a group statement. This post is also somewhat intuition-heavy because the goal is to describe the worldview that emerged from working on a paper.

Additionally, I’ll argue that distillation is an excellent opportunity for a safety... (read 1930 more words →)

Replying toDistillation Robustifies Unlearning

Bruce W. Lee7mo

Distillation Robustifies Unlearning

Super interesting! Thanks for sharing the paper.

Replying toDistillation Robustifies Unlearning

Bruce W. Lee7mo

Distillation Robustifies Unlearning

I appreciate the thoughts here. But it's not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we're using it as an optimization signal.

Replying toDistillation Robustifies Unlearning

Bruce W. Lee7mo

Distillation Robustifies Unlearning

I think it's a good line of thought. But I believe that it's complicated.

Let there be a capability-scoped model, M_scoped, vs a fundamentally weaker model, M_weak. Here, M_scoped was initially trained with the full dataset D_full, whereas M_weak was trained with D_desirable. We also assume that, D_full - D_desirable = D_undesirable. M_scoped went through a subsequent capability suppression process to forget D_undesirable. Most likely, M_scoped would be very different from M_weak. It's also possible/likely that M_scoped is overall just much better than M_weak in terms of general capabilities. I think a good relevant literature is https://arxiv.org/abs/2302.08582.

However, I expect the findings to be much more complicated empirically because a set of undesirable capabilities C_undesirable doesn't always arise from just D_undesirable. Therefore, there is a fundamental disconnect between capabilities and data, which makes it difficult to easily come up with an answer for your question.

Replying toDistillation Robustifies Unlearning

Bruce W. Lee7mo

Distillation Robustifies Unlearning

Thank you for this suggestion. I read the paper that you mentioned. The authors note : "The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure." How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?

Replying toDistillation Robustifies Unlearning

Bruce W. Lee8mo

Distillation Robustifies Unlearning

You're right that rederivation is a concern. But I think that the important question is: is this primarily a model-level problem that requires changing the weight, or more of a system-level concern that should be addressed through deployment controls?

Unlearning might not stop super capable systems from rederiving everything, but it probably makes it harder, forcing them to take longer, more explicit reasoning paths. This opens up new opportunities for intervention, including CoT monitoring or other runtime defenses.

Replying toDistillation Robustifies Unlearning

Bruce W. Lee8mo*

Distillation Robustifies Unlearning

Many thanks for sparking this discussion, Fabien. I see Addie addressed the technical distinctions. Let me add complementary points. Please feel free to continue the conversation in either one. Addie and I can coordinate a response.

In a nutshell, Unlearn-and-Distill allows you to work at the model behavior level rather than the training data level. I mostly view it as a responsive tool, not a preventive one. Here are my thoughts organized into subclaims.

Claim: The fundamental difference between unlearning and data filtering lies in when and how we identify harmful content.
Rationale: Data filtering requires identifying "data -> capabilities" patterns in advance, while behavioral unlearning targets actual harmful outputs after they emerge. This matters... (read more)

Replying toDistillation Robustifies Unlearning

Bruce W. Lee8mo

Distillation Robustifies Unlearning

Thanks for the suggestion. Upon reflection, it seems to me that the success of targeted noising would depend on two complementary factors:

C1. Size of the unlearning target - How broad the capability is in human-understandable terms
C2. Entangledness of the unlearning target - How distributed the capability is across the model's weights

Robust unlearning gets easier as both C1 and C2 decrease. There's likely a threshold beyond which unlearning becomes effectively impossible as these factors increase. Note that C1 is a rough measure of C2 but should be considered independently of C2.

Rationale: Mech Interp has produced good evidence that factual recall (small C1) is often localized to specific parts (small C2), making it an... (read more)

Distillation Robustifies Unlearning

Bruce W. Lee

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout

8mo

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.

Distilling the good while leaving the bad behind.

Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.

Read our paper on ArXiv and enjoy an interactive demo.

Robust unlearning

... (read 2160 more words →)

236

Replying toMechanistically Eliciting Latent Behaviors in Language Models

Bruce W. Lee1y

Mechanistically Eliciting Latent Behaviors in Language Models

One hypothesis for how transformers generate text is that they calculate semantically meaningful primitives in early layers of the residual stream, which are converted to a high-level execution plan in middle layers, followed by concrete tokens in the final layers.

Is there any empirical evidence for this? Or is this just a general observation?

Programming Refusal with Conditional Activation Steering

Bruce W. Lee

For full content, refer to the arXiv preprint at https://arxiv.org/abs/2409.05907. This post is a lighter, 15-minute version.

Abstract

Existing activation steering methods alter LLM behavior indiscriminately, limiting their practical applicability in settings where selective responses are essential, such as content moderation or domain-specific assistants.
We propose Conditional Activation Steering (CAST), which analyzes LLM activation patterns during inference to selectively apply or withhold activation steering based on the input context.
Using CAST, one can systematically control LLM behavior with rules like “if input is about hate speech or adult content, then refuse” or “if input is not about legal advice, then refuse.”
This allows for selective modification of responses to specific content while maintaining normal responses to other

... (read 3148 more words →)

Replying toLanguage Models Don't Learn the Physical Manifestation of Language

Bruce W. Lee2y

Language Models Don't Learn the Physical Manifestation of Language

Regarding the visual instruction tuning paper, see (https://arxiv.org/pdf/2402.11349.pdf, Table 5). Though this experiment on multi-modality was rather simple, I think it does show that it's not a convenient way to improve on H-Test.

Replying toLanguage Models Don't Learn the Physical Manifestation of Language

Bruce W. Lee2y

Language Models Don't Learn the Physical Manifestation of Language

Out of genuine curiosity, can you link to your sources?

Language Models Don't Learn the Physical Manifestation of Language

Bruce W. Lee

Bruce W. Lee, Jaehyuk Lim

Abstract

We argue that there are certain properties of language that our current large language models (LLMs) don't learn. We present an empirical investigation of visual-auditory properties of language through a series of tasks, termed H-Test. This benchmark highlights a fundamental gap between human linguistic comprehension, which naturally integrates sensory experiences, and the sensory-deprived processing capabilities of LLMs. In support of our hypothesis, 1. deliberate reasoning (Chain-of-Thought), 2. few-shot examples, or 3. stronger LLM from the same model family (LLaMA 2 13B -> LLaMA 2 70B) do not trivially bring improvements in H-Test performance.

Therefore, we make a particular connection to the philosophical case of Mary, who learns about the world in a sensory-deprived... (read 285 more words →)