I'll ask for feedback at the end of this post, please hold criticisms, judgements and criticisms for then.
Forget every single long complicated, theoretical mathsy alignment plan.
In my opinion, pretty much every single one of those is too complicated and isn't going to work.
Let's look at the one example we have of something dumb making something smart that isn't a complete disaster and at least try to emulate that first.
Evolution- again, hold judgements and criticisms until the end.
What if you trained a smart model on the level of say, GPT3 alongside a group of much dumber and slower models, in an environment like a game world or some virtual world?
Dumb models who, with the research in interpretability, you know what their utility function is.
The smart, fast model however, does not.
Every time the model does something that harms the utility function of the dumber models, it gets a loss function.
The smarter model will likely need to find a way to figure out the utility functions of the dumber models.
Eventually, you might have a model that's good at co-operating with a group of much dumber, slower models- which could be something like what we actually need!
Please feel free to now post any criticisms, comments, judgements, etc. All are welcome.
Thumbs up for trying to think of novel approaches to solving the alignment problem.
A few confusions:
Some problems, off the top of my head:
GPT-like models don't have utility functions.
Even if they did, mechinterp is nowhere near advanced enough to be able to reveal models' utility functions.
Humans don't have utility functions. It's unclear how this would generalize to human-alignment.
It's very much unclear what policy S would end up learning in this RL setup. It's even less clear how that policy would generalize outside of training.
I don't know how you arrived at this plan, but I'm guessing it involved reasoning with highly abstract and vague concepts. You might be interested in (i.a.) these tools/techniques:
Except maybe if you somehow managed to have the entire simulation be a very accurate model of the real world, and D's be very accurate models of humans. But that's not remotely realistic; and still subject to Goodhart. ↩︎