Well, if it is sufficiently intelligent, it will model humans.
Okay, with you so far.
It will use this model to understand what they meant by Q, and why they were asking.
Kind of agree, but what if what would count as this to the AI would not necessarily be a human-meaningful explanation, or even an explanation useful to an AI that wants to represent the world using human preferences as high-level objects.
Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be.
Even if everything is transparent and modular, I think it's only going to represent human understanding if, as above, it represents humans as things with high-level understanding attributes.
Even if everything is transparent and modular, I think it's only going to represent human understanding if, as above, it represents humans as things with high-level understanding attributes.
Can you develop that thought? You might be onto a fundamental problem.
A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.
First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.
What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.
Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.
All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.