Well, if it is sufficiently intelligent, it will model humans.
Okay, with you so far.
It will use this model to understand what they meant by Q, and why they were asking.
Kind of agree, but what if what would count as this to the AI would not necessarily be a human-meaningful explanation, or even an explanation useful to an AI that wants to represent the world using human preferences as high-level objects.
Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be.
Even if everything is transparent and modular, I think it's only going to represent human understanding if, as above, it represents humans as things with high-level understanding attributes.
Even if everything is transparent and modular, I think it's only going to represent human understanding if, as above, it represents humans as things with high-level understanding attributes.
Can you develop that thought? You might be onto a fundamental problem.
What I mostly referred to in my comment was the ontology problem for agents with high-reductive-level motivations. Example: a robot built to make people happy has to be able to find happiness somewhere in their world model, but a robot built to make itself smarter has no such need. So if you want a robot to make people happy, using world-models built to make a robot smarter, the happiness maximizer is going to need to be able to find happiness inside an unfamiliar ontology.
More exposition about why world models will end up different:
Recently I've been trying to think about why building inherently lossy predictive models of the future is a good idea. My current thesis statement is that since computations of models are much more valuable finished than unfinished, it's okay to have a lossy model as long as it finishes. The trouble is quantifying this.
For the current purpose, though, the details are not so important. Supposing one understands the uncertainty characteristics of various models, one chooses a model by maximizing an effective expected value, because inaccurate predictive models have some associated cost that depends on the agent's preferences. Agents with different preferences will pick different methods of predicting the future, even if they're locked into the same ontology, and so anything not locked in is fair game to vary widely.
the happiness maximizer is going to need to be able to find happiness inside an unfamiliar ontology.
But the module for predicting human behaviour/preferences should surely be the same in a different ontology? The module is a model, and the model is likely not grounded in the fine detail of the ontology.
Example: the law of comparative advantage in economics is a high level model, which won't collapse because the fundamental ontology is relativity rather than newtonian mechanics. Even in a different ontology, humans should remain (by far) the best things in the world that approximate the "human model".
If there is a module that specifically requires prediction of human behavior, sure. My claim in the second part of my comment is that if the model predicts the number of paperclips, it's not necessary that the closest match to things that function like human decisions will actually be a useful predictive model of human decisions.
When I read these AI control problems I always think that an arbitrary human is being conflated with the AI's human owner. I could be mistaken that I should read these as if AIs own themselves - I don't see this case likely so I would probably stop here if we are to presuppose this.
Now if an AI is lying/deceiving its owner, this is a bug. In fact, when debugging I often feel I am being lied to. Normal code isn't a very sophisticated liar. I could see an AI owner wanting to train its AI about lying an deceiving and maybe actually perform them on other people (say a Wall Street AI). Now we have a sophisticated liar but we also have a bug. I find it likely that the owner would have encountered this bug many times while the AI is becoming more and more sophisticated. If he didn't encounter this bug then it would point to great improvements in software development.
A putative new idea for AI control; index here.
This is a simple extension of the model-as-definition and the intelligence module ideas. General structure of these extensions: even an unfriendly AI, in the course of being unfriendly, will need to calculate certain estimates that would be of great positive value if we could but see them, shorn from the rest of the AI's infrastructure.
The challenge is to get the AI to answer a question as accurately as possible, using the human definition of accuracy.
First, imagine an AI with some goal is going to answer a question, such as Q="What would happen if...?" The AI is under no compulsion to answer it honestly.
What would the AI do? Well, if it is sufficiently intelligent, it will model humans. It will use this model to understand what they meant by Q, and why they were asking. Then it will ponder various outcomes, and various answers it could give, and what the human understanding of those answers would be. This is what any sufficiently smart AI (friendly or not) would do.
Then the basic idea is to use modular design and corrigibility to extract the relevant pieces (possibly feeding them to another, differently motivated AI). What needs to be pieced together is: AI understanding of what human understanding of Q is, actual answer to Q (given this understanding), human understanding of various AI's answers (using model of human understanding), and minimum divergence between human understanding of answer and actual answer.
All these pieces are there, and if they can be safely extracted, the minimum divergence can be calculated and the actual answer calculated.