Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.
I dunno if this has been discussed elsewhere (pointers welcome).
Observational data doesn't allow one to distinguish correlation and causation.
This is a problem for an agent attempting to learn values without being allowed to make interventions.
For example, suppose that happiness is just a linear function of how much Utopamine is in a person's brain.
If a person smiles only when their Utopamine concentration is above 3 ppm, then an value-learner which observes both someone's Utopamine levels and facial expression and tries to predict their reported happiness on the basis of these features will notice that smiling is correlated with higher levels of reported happiness and thus erroneously believe that it is partially responsible for the happiness.
I have a picture of value learning where the AI learns via observation (since we don't want to give an unaligned AI access to actuators!).
But this makes it seem important to consider how to make an un unaligned AI safe-enough to perform value-learning relevant interventions.
Solving the value learning problem is (IMO) the key technical challenge for AI safety.
How good or bad is an approximate solution?
EDIT for clarity:
By "approximate value learning" I mean something which does a good (but suboptimal from the perspective of safety) job of learning values. So it may do a good enough job of learning values to behave well most of the time, and be useful for solving tasks, but it still has a non-trivial chance of developing dangerous instrumental goals, and is hence an Xrisk.
1. How would developing good approximate value learning algorithms effect AI research/deployment?
It would enable more AI applications. For instance, many many robotics tasks such as "smooth grasping motion" are difficult to manually specify a utility function for. This could have positive or negative effects:
* It could encourage more mainstream AI researchers to work on value-learning.
* It could encourage more mainstream AI developers to use reinforcement learning to solve tasks for which "good-enough" utility functions can be learned.
Consider a value-learning algorithm which is "good-enough" to learn how to perform complicated, ill-specified tasks (e.g. folding a towel). But it's still not quite perfect, and so every second, there is a 1/100,000,000 chance that it decides to take over the world. A robot using this algorithm would likely pass a year-long series of safety tests and seem like a viable product, but would be expected to decide to take over the world in ~3 years.
Without good-enough value learning, these tasks might just not be solved, or might be solved with safer approaches involving more engineering and less performance, e.g. using a collection of supervised learning modules and hand-crafted interfaces/heuristics.
2. What would a partially aligned AI do?
An AI programmed with an approximately correct value function might fail
* dramatically (see, e.g. Eliezer, on AIs "tiling the solar system with tiny smiley faces.")
* relatively benignly (see, e.g. my example of an AI that doesn't understand gustatory pleasure)
Perhaps a more significant example of benign partial-alignment would be an AI that has not learned all human values, but is corrigible and handles its uncertainty about its utility in a desirable way.
There are several well-known games in which the pareto optima and Nash equilibria are disjoint sets.
The most famous is probably the prisoner's dilemma. Races to the bottom or tragedies of the commons typically have this feature as well.
I proposed calling these inefficient games. More generally, games where the sets of pareto optima and Nash equilibria are distinct (but not disjoint), such as a stag hunt could be called potentially inefficient games.
It seems worthwhile to study (potentially) inefficient games as a class and see what can be discovered about them, but I don't know of any such work (pointers welcome!)
View more: Next