A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.
First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.
What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.
Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.
All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.
But the module for predicting human behaviour/preferences should surely be the same in a different ontology? The module is a model, and the model is likely not grounded in the fine detail of the ontology.
Example: the law of comparative advantage in economics is a high level model, which won't collapse because the fundamental ontology is relativity rather than newtonian mechanics. Even in a different ontology, humans should remain (by far) the best things in the world that approximate the "human model".
If there is a module that specifically requires prediction of human behavior, sure. My claim in the second part of my comment is that if the model predicts the number of paperclips, it's not necessary that the closest match to things that function like human decisions will actually be a useful predictive model of human decisions.