Solving the value learning problem is (IMO) the key technical challenge for AI safety.
How good or bad is an approximate solution?
EDIT for clarity:
By "approximate value learning" I mean something which does a good (but suboptimal from the perspective of safety) job of learning values. So it may do a good enough job of learning values to behave well most of the time, and be useful for solving tasks, but it still has a non-trivial chance of developing dangerous instrumental goals, and is hence an Xrisk.
Considerations:
1. How would developing good approximate value learning algorithms effect AI research/deployment?
It would enable more AI applications. For instance, many many robotics tasks such as "smooth grasping motion" are difficult to manually specify a utility function for. This could have positive or negative effects:
Positive:
* It could encourage more mainstream AI researchers to work on value-learning.
Negative:
* It could encourage more mainstream AI developers to use reinforcement learning to solve tasks for which "good-enough" utility functions can be learned.
Consider a value-learning algorithm which is "good-enough" to learn how to perform complicated, ill-specified tasks (e.g. folding a towel). But it's still not quite perfect, and so every second, there is a 1/100,000,000 chance that it decides to take over the world. A robot using this algorithm would likely pass a year-long series of safety tests and seem like a viable product, but would be expected to decide to take over the world in ~3 years.
Without good-enough value learning, these tasks might just not be solved, or might be solved with safer approaches involving more engineering and less performance, e.g. using a collection of supervised learning modules and hand-crafted interfaces/heuristics.
2. What would a partially aligned AI do?
An AI programmed with an approximately correct value function might fail
* dramatically (see, e.g. Eliezer, on AIs "tiling the solar system with tiny smiley faces.")
or
* relatively benignly (see, e.g. my example of an AI that doesn't understand gustatory pleasure)
Perhaps a more significant example of benign partial-alignment would be an AI that has not learned all human values, but is corrigible and handles its uncertainty about its utility in a desirable way.
I'm still not sure I understand you correctly. I suspect that if we follow this to the end, we will discover that we are only arguing semantics, and don't actually disagree over anything tangible. If that's your impression too, please say so, and we'll both save ourselves some time.
I wouldn't disagree that having such an operator is better than not having one. I am questioning the value of having the operator uploaded. Why would programing an AI to care about the operator's values and not manipulate the operator be easier if the operator is uploaded? Wouldn't the operator just be manipulated even faster?
The only answer I see to that is that the uploading part is just to provide a faster and better user interface. If value loading was done via a game of 20 billion questions, for example, this would take an impractically long time. (Thousands of years, if just one person at a time is answering questions.) Same goes if the AI learns values via machine learning, using rewards and punishments given out by the operator, although you'd still have to keep it from wire-heading by manipulating the operator. Also, as an interesting aside, it may be easier to pull values directly out of someone's brain.
If we're only arguing about semantics, however, I have a guess at the source:
I understand "failed FAI" to be something like a pure smile maximizer, which has just as much incentive to route around human operators as a paperclip maximizer or suffering maximizer. It wouldn't care about our values any more than we care about what sorts of values evolution tried to give us. The unstated assumption here is that value uploading failed or never happened, and the AI is no longer trying to load values, but only implement the values it has. I believe this is what you're gesturing toward with "real UFAI".
Do you understand "failed FAI" to be one which simply misunderstood our values, like a smile maximizer, but which never exited the value loading phase? This sort of AI might have some sort of "uncertainty" about it's utility function. If so, it might still care about what values we intended to give it.
I don't think that we are only arguing semantic, but the idea of scanning a human is not my only one idea and is not the best idea of AI safety. It is just interesting promising idea.
In one Russian short story a robot was asked to get rid of all circular objects in the room and the robot cut the owner's head. But if the robot had a simulation of right moral human he could run it thousands times a second and check any his action with it.
The the first difference between sim and a human operator is that the sim can be run infinitely many more times and very c... (read more)