User Comment Replies

The case for unlearning that removes information from LLM weights

I wonder if the approach from your paper is in some sense too conservative to evaluate whether information has been removed: Suppose I used some magical scalpel and removed all information about Harry Potter from the model.

Then I wouldn't be too surprised if this leaves a giant HP-shaped hole in the model such that, if you then fine-tune on a small amount of HP-related data, suddenly everything falls into place and makes sense to the model again, and this rapidly generalizes.

Maybe fine-tuning robust unlearning requires us to fill in the holes with synthetic data so that this doesn't happen.

2Fabien Roger6mo

I am not sure that it is over-conservative. If you have an HP-shaped that can easily be transformed in HP-data using fine-tuning, does it give you a high level of confidence that people misusing the model won't be able to extract the information from the HP-shaped hole or that a misaligned model won't be able to notice to HP-shaped hole and use that to answer to question to HP when it really wants to? I think that it depends on the specifics of how you built the HP-shaped hole (without scrambling the information). I don't have a good intuition for what a good technique like that could look like. A naive thing that comes to mind would be something like "replace all facts in HP by their opposite" (if you had a magic fact-editing tool), but I feel like in this situation it would be pretty easy for an attacker (human misuse or misaligned model) to notice "wow all HP knowledge has been replaced by anti-HP knowledge" and then extract all the HP information by just swapping the answers.

The case for unlearning that removes information from LLM weights

Julian Stastny6mo10

By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That'd be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)

2Fabien Roger6mo

That's right. I think it's not that devastating, since I expect that their method can be adapted to counter classic LoRA tuning (the algorithm takes as input some set finetuning methods it "trains against"). But yeah, it's not reassuring that it doesn't generalize between full-weight FT and LoRA.

SSA rejects anthropic shadow, too

Julian Stastny2y10

But I see what you're getting at, SIA selects observers from the whole distribution over observers taken as a single bag.

That phrasing sounds right, yeah.

I wrote expected fraction in the previous comment in order to circumvent the requirement of both universes existing simultaneously. But I acknowledge that my intuition is more compelling when assuming that they all exist, or (in Sleeping Beauty or God's extreme coin toss) that the experiment is repeated many times. Still, it seems odd to expect an inconsistency between the "it just happened once" and the ... (read more)

SSA rejects anthropic shadow, too

Julian Stastny2y61

SIA implies a different conclusion. To predict your observations under SIA, you should first sample a random universe proportional to its population, then sample a random observer in that universe. The probabilities of observing each index are the same conditional on the universe, but the prior probabilities of being in a given universe have changed.
We start with 1000:1 odds in favor of the 1-trillion universe, due to its higher population. Upon observing our sub-1-billion index, we get a 1000:1 update in favor of a 1-billion universe, as with SIA. These e

... (read more)

4jessicata2y

"Fraction of observers having index 100 who would be correct to believe T" depends on both universes existing simultaneously, e.g. due to a big universe or a multiverse. But I see what you're getting at, SIA selects observers from the whole distribution over observers taken as a single bag.

LESSWRONG
LW

All of Julian Stastny's Comments + Replies