This post is on a very important topic: how could we scale ideas about value extrapolation or avoiding goal misgeneralisation... all the way up to superintelligence? As such, its ideas are very worth exploring and getting to grips to. It's a very important idea.
However, the post itself is not brilliantly written, and is more of "idea of a potential approach" than a well crafted theory post. I hope to be able to revisit it at some point soon, but haven't been able to find or make the time, yet.
In the kinds of model-based RL AGI architectures that I normally think about (see here)…
At step 8, why is the AI motivated to care about the idealized goal rather than just the reward signal? Are we assuming that the reward signal is determined by performance wrt the ideal goal?
That is the aim. It's easy to program an AI that doesn't care too much about the reward signal - the trick is to find a way that it doesn't care in a specific way that aligns it with our preferences.
eg what would you do if you had been told to maximise some goal, but were told that your reward signal would be corrupted and over-simplified? You can start doing some things in that situation to maximise your chance of not-wireheading; I want to program the AI to do similarly.
A long time ago, Scott introduced the blue-minimising robot:
Scott then considers holographic projectors and colour-reversing glasses, where the blue robot does not act in a way that actually reduces the amount of blue, and concludes:
That's one characterisation, but what if the robot was a reinforcement-learning agent that was trained in various scenarios where they got rewards for blasting blue objects? Then it would seem that it was designed as a blue minimising utility maximiser; just not designed particularly well.
One approach would be "well, just design it better". But that's akin to saying "well, just perfectly program a friendly AI". In the spirit of model-splintering we could instead ask the algorithm to improve its own reward function as it learns more.
The improving robot
Here is a story of how that could go. Obviously this sort of behaviour would not happen naturally with a reinforcement learning agent; it has to be designed in. The key elements are in bold.
Seven key stages
There are seven key stages to this algorithm:
The question is, can all these stages be programmed or learnt by the AI? I feel that they might, since humans can achieve them ourselves, at least imperfectly. So with a mix of explicit programming, examples of humans doing these tasks, learning on these examples, examples of humans finding errors in the learning, it might be possible to design such an agent.