So you start with an AI with an external reward structure that the AI is trained to follow. And the AI is incentivized to take over its reward structure.
Then you talk about "solving" this by having multiple possible rewards and the AI asks humans which one is best. Well, OK, but if this solves it, the solution has to do with how the AI (as opposed to the reward structure) now works, which you don't explain at all. That is, if the AI is still works the same way, trained by the external reward structure, where does the decision to ask the humans come from?
But more than this, the background assumptions seem very concerning. Some particular issues IMO:
One possible alternative to external reward training would be to:
Note that the score function should not care about the future directly, e.g. the score function could be something like: if I follow this plan, how well does this entire process and the resulting outcome align with averaged informed/extrapolated human preferences as of the time of decision to take the plan (not any future time)?
Talk given by Rebecca Gorman and Stuart Armstrong at the CHAI 2022 Asilomar Conference. We present an example of AI wireheading (an AI taking over its own reward channel), and show how value extrapolation can be used to combat it.
https://www.youtube.com/watch?v=REUanSy0SgU