A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
It's almost trivially simple. Have the AI construct a module that models humans and models human understanding (including natural language understanding). This is the kind of thing that any AI would want to do, whatever its goals were.
Then take that module (using corrigibility) into another AI, and use it as part of the definition of the new AI's motivation. The new AI will then use this module to follow instruction humans give it in natural language.
Too easy?...
This approach essentially solves the whole friendly AI problem, loading it onto the AI in a way that avoids the whole "defining goals (or meta-goals, or meta-meta-goals) in machine code" or the "grounding everything in code" problems. As such it is extremely seductive, and will sound better, and easier, than it likely is.
I expect this approach to fail. For it to have any chance of success, we need to be sure that both model-as-definition and the intelligence module idea are rigorously defined. Then we have to have a good understanding of the various ways how the approach might fail, before we can even begin to talk about how it might succeed.
The first issue that springs to mind is when multiple definitions fit the AI's model of human intentions and understanding. We might want the AI to try and accomplish all the things it is asked to do, according to all the definitions. Therefore, similarly to this post, we want to phrase the instructions carefully so that a "bad instantiation" simply means the AI does something pointless, rather than something negative. Eg "Give humans something nice" seems much safer than "give humans what they really want".
And then of course there's those orders where humans really don't understand what they themselves want...
I'd want a lot more issues like that discussed and solved, before I'd recommend using this approach to getting a safe FAI.
Let me play devil's advocate for this position.
How to specify goals at compile time is a technical question, but we can do some a priori theorizing as to how we might do it. Roughly, there are two high level approaches of how to go about it. Simple hard-coded goals, and goals fed in from more complex modules. A simple hard-coded goal might be something like current reinforcement learners were the reward signal is human praise (or, a simple to hard-code proxy for human praise such as pressing a reward button). The other alternative is to make a a few modules (e.g. one for natural language understanding, one for modeling humans) and "use it/them as part of the definition of the new AI's motivation.").
Responses to counterarguments:
4.1: needing to specify commands carefully (e.g. "give humans what they really want".).
The whole point of intelligence is being able to specify tasks in an ambiguous way (e.g. you don't have to specify what you want in such detail that you're practically programming a computer). An AI that actually wants to make you happier (since it's goals were specified at compile time using a module that models humans) will ask you what to clarify your intentions if you give it vague goals.
Some other thoughts:
It will be hard to accomplish this, since nobody knows how to go about building such modules. Modeling language, humans, and human values are hard problems. Building the modules is a technical question. But, it is necessary and sufficient to build the modules and feed them into the goal system of another AI to build a friendly AI. In fact, one could make a stronger argument that any AGI that's built with a goal system must have it's goal system specified with natural language modules (e.g. reinforcement learning sucks). Thus, it is likely that any built AGIs would be FAIs.
EDITED to add: Tool-AI arguments. If you can build the modules to feed into an AI with a goal system, then you might be able to build a "tool-AI" that doesn't a goal system. I think it's hard to say a priori that such an architecture isn't more likely than an architecture that requires a goal system. It's even harder to say that a tool-AI architecture is impossible to build.
In summary, I think the chief issues with building friendly AI are technical issues related to actually building the AI. I don't see how decision theory helps. I do think that unfriendly humans with a tool AI is something to be concerned about, but doing math research doesn't seem related to that (Incidentally, MIRI's math research has intrigued people like Elon Musk, which helps with the "unfriendly humans problem").