Have you read Reducing Goodhart? A relevant thesis is that goodharting on human values is not just like overfitting in supervised learning. (The goal of value learning isn't just to replicate what human supervisors would tell the AI to do. It's to generalize what the humans would say according to processes that the humans also approve of.)
I think we're unlikely to get good generalization properties just from NN regularization (or a simplicity prior etc). Better to build an AI that's actually trying to generalize how humans want. I think testing your value learning AI against a validation set of human responses is a totally reasonable way to test how well it can learn to model humans, and if it fails it's probably not going to to a very good job, but just because it succeeds doesn't mean it has the generalization properties we want, even if the validation set is arbitrarily thorough.
Crossposted from my personal blog.
A naive approach to aligning an AGI, and an abstract version of what is currently used in SOTA approaches such as RLHF, is to learn a reward model which hopefully encapsulates many features of 'human values' that we wish to align an AGI to, and then train an actor model (the AGI) to output policies which result in high reward according to the reward model. If our reward model is good and accurately reflects human values, then this should result in the AGI being trained to output policies which are highly aligned and approved of by humans. The fundamental problem is that we have an optimization process (the actor / planner) optimizing directly against a learnt reward or value model. This optimizer, as it grows more powerful, will tend to find exploits in the reward model which return high evaluated reward but which are actually low reward according to the 'true' reward model, of which the learnt reward model is a learnt proxy. This problem of goodharting was explored empirically in this paper who showed that goodharting followed smooth and understandable curves with model scale, at least in their setup and at their model scale. Such a setup of the actor exploiting the reward model, could potentially cause X-risk in the case of extremely powerful models where the actor AGI could propose some plan which is highly damaging to humans (the 'true' reward model) but which the false reward model gives extremely high rewards to.
There are basically two approaches one could take to ameliorating this problem:
1.) Scale the reward model with the actor to make it more robust against exploitation. The idea here is that there might be some kind of balance where insofar as the reward model and actor have approximately similar 'strengths', the actor cannot exploit the reward model too strongly, thus limiting goodhart. The issue with this approach is that scaling of the reward model might be fundamentally more limited than that of the actor. For instance, the reward model might be trained on some fixed dataset of human-labelled examples while the actor can theoretically obtain infinite data from training against the world or a highly realistic simulation. Also it may be the case that, as hypothesised, capabilities scale better than alignment.
2.) Find some other mechanism to reduce goodharting directly. This is left intentionally vague since there are a huge amounts of potential methods here. In this post, I want to propose one extremely simple method that would almost certainly help at a relatively small cost.
The problem of goodharting is almost identical to that of overfitting in classical statistics and machine learning. We have an optimizer with the goal of minimizing the test loss, but in fact only can learn from an imperfect proxy (finite training dataset / the reward model). If this optimizer is totally unconstrained and very high capacity, it can memorize all of the training set points / exploit the reward model, potentially causing extremely high test-set loss. In the statistics case, a simple way to estimate the degree of misgeneralization (test loss) is to use a validation set of points sampled from your test set to evaluate your learnt classifier.
Here, we propose a simple addition to this setup closely inspired by the above. Instead of just using one reward model, we create a set of validator reward models. These are models trained on the same dataset of human rewards as the original reward model, but learn a different function because of different initializations, different architectures, or different capacities (it is probably very possible to use much smaller and lower capacity reward models than needed for the reward model the actor is actually optimizing against). Then, given these validator models, we perform rejection sampling against the actors solution. Specifically, if the validator's assess the actors plan to be bad, then we can either restart the planning process, rewind the actor to an earlier checkpoint, or some other approach. The decision rule we use for the validator assessment can be flexibly tuned depending on our degree of paranoia. For instance, we could go with simple majority rule or some kind of weighted average of validator assessments, or even with the worst-case validator assessment.
The way this works is that when the actor is optimizing strongly against its reward model, we expect it to be exploiting unique and idiosyncratic features of the reward model -- effectively 'fitting the noise'. If we then train different validator models on the same data (could actually be different data which would help marginally), then we would expect the 'core' reward function learnt should be similar, but the responses to highly adversarial or out of distribution inputs should be very different, since that relies on unique idiosyncracies of a specific model. This then allows a cheap way of detecting the degree to which the actor is goodharting.
This approach is not perfect. Specifically, it does not provide an easy way to reset the actor torwards a good non-goodharted plan without leaking many bits of information. However, it is extremely cheap and simple to implement and likely provides good and scalable detection for goodharting without explicit human intervention. There are however a number of obvious potential failure modes, which would need other methods to check for. These include:
1.) Each validator interaction leaks bits to the actor: This occurs especially if the validators provide 'suggestions for improvement' and/or the actor maintains a memory of past validator examples. If we constantly reset the actor based on feedback from the same set of validators we are in effect training on the validation set, just with a bad training algorithm, and we should expect the effectiveness of this method to degrade with additional interactions.
2.) Correlation between the validators and base model: If the validators are too correlated with the base reward model, then the exploits in the base reward model the actor finds might also apply to the validators, leading to an underestimation of the degree of goodharting. The obvious way to avoid this is to use a number of validators originating from diverse datasets, model architectures, initializations etc.
3.) Deception in the actor: Of course this entire protocol fails if the actor is deceptive. If it can provide a non-goodharted strategy during training/validation and then actually perform a completely different strategy after deployment, this approach will do nothing to stop it. We would then need other tools to prevent this.
However, it is important to note that these flaws do not make the proposal unworkable when applied in conjunction with other alignment strategies. We should expect many different alignment techniques applied in parallel to yield a much more robust system in practice than any strategy individually.