Solving the value learning problem is (IMO) the key technical challenge for AI safety.
How good or bad is an approximate solution?
EDIT for clarity:
By "approximate value learning" I mean something which does a good (but suboptimal from the perspective of safety) job of learning values. So it may do a good enough job of learning values to behave well most of the time, and be useful for solving tasks, but it still has a non-trivial chance of developing dangerous instrumental goals, and is hence an Xrisk.
Considerations:
1. How would developing good approximate value learning algorithms effect AI research/deployment?
It would enable more AI applications. For instance, many many robotics tasks such as "smooth grasping motion" are difficult to manually specify a utility function for. This could have positive or negative effects:
Positive:
* It could encourage more mainstream AI researchers to work on value-learning.
Negative:
* It could encourage more mainstream AI developers to use reinforcement learning to solve tasks for which "good-enough" utility functions can be learned.
Consider a value-learning algorithm which is "good-enough" to learn how to perform complicated, ill-specified tasks (e.g. folding a towel). But it's still not quite perfect, and so every second, there is a 1/100,000,000 chance that it decides to take over the world. A robot using this algorithm would likely pass a year-long series of safety tests and seem like a viable product, but would be expected to decide to take over the world in ~3 years.
Without good-enough value learning, these tasks might just not be solved, or might be solved with safer approaches involving more engineering and less performance, e.g. using a collection of supervised learning modules and hand-crafted interfaces/heuristics.
2. What would a partially aligned AI do?
An AI programmed with an approximately correct value function might fail
* dramatically (see, e.g. Eliezer, on AIs "tiling the solar system with tiny smiley faces.")
or
* relatively benignly (see, e.g. my example of an AI that doesn't understand gustatory pleasure)
Perhaps a more significant example of benign partial-alignment would be an AI that has not learned all human values, but is corrigible and handles its uncertainty about its utility in a desirable way.
I don't think that we are only arguing semantic, but the idea of scanning a human is not my only one idea and is not the best idea of AI safety. It is just interesting promising idea.
In one Russian short story a robot was asked to get rid of all circular objects in the room and the robot cut the owner's head. But if the robot had a simulation of right moral human he could run it thousands times a second and check any his action with it.
The the first difference between sim and a human operator is that the sim can be run infinitely many more times and very cheaply, so we could check each robot action.
The second difference is that a human operator is changing (could be replaced by other operator or aging or suffer stroke). The sim is not changing with such speed (but he could still learn, or for simple situation fixed copies could be used)
The 3d is that sim is specially selected and trained for many years and later tested as a sim in many imaginary situations. We can't test operator with such rigorous quality.
Four. The sim could control any infinitely small update of AI's utility function, because it runs on the same speed. It prevents AI from becoming malicious. I have to explain it: our typical discussions about malicious AI have a logical loop (first shown by Ben Goertzel)
The loop: To start hide his utility function from peers, the AI should become malicious. But he can't covertly become malicious without secretly changing his utility function. In other words, between the moment where AI decides to be be bad, and the moment when he start to act base on this decision, is a short time gap. This gap maybe very short in time like 0,01 sec, so no human operator has time to react. But sim will have time (especially if the AI will be halted after any update in his open utility function).
Another explanation of the gap: if all values of an AI are written in open from (in some stack), and he creates the value to hide his values, than the value "hide values" is also written in open form (for short time). So if the value stack is translated outside, there is no way for AI to hide his transformation from the peer. (except to kill him before he get all implications)
Five. The sim not only check values but also correct understanding of human language and context.