How often do people not do PhDs on the basis that they don't teach you to be a good researcher? Perhaps this is different in certain circles, but almost everyone I know doesn't want to do a PhD for personal reasons (and also timelines).
The most common objections are the following:
If I was an impact-maximizer I might do a PhD, but as a person who is fairly committed to not being depressed, it seems obvious that I should probably not do a PhD and look for alternative routes to becoming a research lead instead.
I'd be interested to hear whether you disagree with these points (you seem to like your PhD!), or whether this post was just meant to address the claim that it doesn't train you to be a good researcher.
I was under the impression that riding a motorcycle even with proper protection is still very dangerous?
Perhaps worth noting: one of the three resignations, Aleksander Madry, was head of the preparedness team which is responsible for preventing risks from AI such as self-replication.
You're right. For some reason, I thought EPIC was a pseudometric on the quotient space and not on the full reward space.
I think this makes the thing I'm saying much less useful.
The reason we're using a uniform distribution is that it follows naturally from the math, but maybe an intuitive explanation is the following: the reason this is weird is that most realistic distributions are only going to sample from a small number of states/actions. Whereas the uniform distribution more or less encodes that the reward functions are similar across most states/actions. So it's encoding something about generalization.
So more concretely, this is work towards some sort of RLHF training regime that "provably" avoids Goodharting. The main issue is that a lot of the numbers we're using are quite hard to approximate.
So here's a thing that I think John is pointing at, with a bit more math?:
The diversion is in the distance function.
- In the paper, we define the distance between rewards as the angle between reward vectors.
- So what we sort of do is look at the "dot product", i.E., look at for true and proxy rewards and with states/actions sampled according to a uniform distribution. I give justification as to why this is a natural way to define distance in a separate comment.
But the issue here is that this isn't the distribution of the actions/states we might see in practice. might be very high if states/actions are instead weighted by drawing them from a distribution induced from a certain policy (e.g., the policy of "killing lots of snakes without doing anything sneaky to game the reward" in the examples, I think?). But then as people optimize, the policy changes and this number goes down. A uniform distribution is actually likely quite far from any state/action distributions we would see in practice.
In other words the way we formally define reward distance here will often not match how "close" two reward functions seem, and lots of cases of "Goodharting" are cases where two reward functions just seem close on a particular state/action distribution but aren't close according to our distance metric.
This makes the results of the paper primarily useful for working towards training regimes where we optimize the proxy and can approximate distance, which is described in Appendix F of the paper. This is because as we optimize the proxy it will start to generalize, and then problems with over-optimization as described in the paper are going to start mattering a lot more.
I'm noticing there are still many interp mentors for the current round of MATS -- was the "fewer mech interp mentors" change implemented for this cohort, or will that start in Winter or later?