A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.
Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.
Too easy?...
This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.
I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.
The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".
And then of course there's those orders where humans really don't understand what they themselves want...
I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.
If the human is in of the very few with a capacity or interest in grand world changing schemes, they might have trouble coming up with a genuine utopia. If they are one of the great majority without, all you can expect out of them is incremental changes.
And there isnt a moral dilemma in building the AI in the first place, even though it is ,by hypothesis, a superset of the human? You are making an assumptions or two about qualia, and they are bound to .be unjustified assumptions.
Most people I've talked to have one or two world changing schemes that they want to implement. This might be selection bias, though.
It is not at all obvious to me that any optimizer would be personlike. Sure, it would be possible (maybe even easy!) to build a personlike AI, but I'm not sure it would "necessarily" happen. So I don't know if those problems would be there for an arbitrary AI, but I do know that they would be there for its models of humans.