A putative new idea for AI control; index here.
After working for some time on the Friendly AI problem, it's occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:
- Nasty extrapolation of concepts (though badly implemented value learning or badly coded base concepts).
- AI's making themselves into nasty expected utility maximisers.
- AI's hacking themselves to maximum reward.
- AI's creating successor agents that differ from them in dangerous ways.
- People hacking themselves to maximum apparent happiness.
- Problems with Coherent Extrapolated Volition.
- Problems with unrestricted search.
- Some issues I have with some of Paul Christiano's designs.
- Reflective equilibrium itself.
Speaking very broadly, there are two features all them share:
- The convergence criteria are self-referential.
- Errors in the setup are likely to cause false convergence.
What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be "nice", but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive - except for the fact that these processes have many convergent states, not just one or a small grouping).
So when the process goes nasty, you're pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate... but that's all we know about it.
The second feature is that any process has errors - computing errors, conceptual errors, errors due to the weakness of human brains, etc... If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier - and once jettisoned, the missing value is unlikely to ever come back.
Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.
Solutions
And again, very broadly speaking, there are several classes of solutions to deal with these problems:
- Reduce or prevent errors in the extrapolation (eg solving the agent tiling problem).
- Solve all or most of the problem ahead of time (eg traditional FAI approach by specifying the correct values).
- Make sure you don't get too far from the starting point (eg reduced impact AI, tool AI, models as definitions).
- Figure out the properties of a nasty convergence, and try to avoid them (eg some of the ideas I mentioned in "crude measures", general precautions that are done when defining the convergence process).
The worse the afterlife is, the more similar people's reactions will be to a world where there is no afterlife. In the limit, the afterlife is as bad as or worse than nonexistence and people would be as death-averse as they are now. Except that this is contrary to how people claim to think of the afterlife when they assert belief in it. The afterlife can't be good enough to be comforting and still bad enough not to lead to any of the conclusions I described. And this includes being bad for reasons such as being like exile, being irreversible, etc.
And I already said that if there is a Hell (a selectively bad afterlife), many of these won't apply, but the existence of Hell has its own problems.
I'd phrase it as "the scarier the afterlife is, the more similar people's reactions will be to a world where there is no afterlife." The word "scarier" is important, because something can look scary but be harmless, or even beneficial.
And people's reactions do not depend on what the afterlife is like; they depend on what people think about the afterlife.
And one of the scariest things to do is to jump into a complete unknown.... (read more)