Julian Stastny

Researcher at the Center on Long-Term Risk. Currently a visiting researcher at Constellation.

Wiki Contributions

Comments

Sorted by

I wonder if the approach from your paper is in some sense too conservative to evaluate whether information has been removed: Suppose I used some magical scalpel and removed all information about Harry Potter from the model. 

Then I wouldn't be too surprised if this leaves a giant HP-shaped hole in the model such that, if you then fine-tune on a small amount of HP-related data, suddenly everything falls into place and makes sense to the model again, and this rapidly generalizes. 

Maybe fine-tuning robust unlearning requires us to fill in the holes with synthetic data so that this doesn't happen.

By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That'd be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)

But I see what you're getting at, SIA selects observers from the whole distribution over observers taken as a single bag.

That phrasing sounds right, yeah.

I wrote expected fraction in the previous comment in order to circumvent the requirement of both universes existing simultaneously. But I acknowledge that my intuition is more compelling when assuming that they all exist, or (in Sleeping Beauty or God's extreme coin toss) that the experiment is repeated many times. Still, it seems odd to expect an inconsistency between the "it just happened once" and the "it happens many times" cases..

SIA implies a different conclusion. To predict your observations under SIA, you should first sample a random universe proportional to its population, then sample a random observer in that universe. The probabilities of observing each index are the same conditional on the universe, but the prior probabilities of being in a given universe have changed.

We start with 1000:1 odds in favor of the 1-trillion universe, due to its higher population. Upon observing our sub-1-billion index, we get a 1000:1 update in favor of a 1-billion universe, as with SIA. These exactly cancel out, leaving the probability of each universe at 50%.

I found this is a somewhat confusing/counterintuitive way to explain SIA. (Because why start out with 1000:1 odds when it is stated that the two universes are equally likely a priori?)

How I'd explain it: Suppose I want to put a credence on the hypothesis T Trillion observer universe, starting from a 50% prior on each of the the two hypotheses. I now observe my index being, say, 100.  According to my prior, what’s the expected fraction of observers that observe having index 100 who would be correct to believe T? It’s 50%.