Let's focus on a simple version, without the metaphors. We're talking about an AI presenting humans with consequences of a particular decision, with humans then making the final decision to go along with it or not.
So what is happening is that various possible future worlds will be considered by the AI according to its desirability criteria, these worlds will be described to humans according to its description criteria, and humans will choose according to whatever criteria we use. So we have a combination of criteria that result in a final decision. A siren world is a world that ranks very high in these combined criteria but is actually nasty.
If we stick to that scenario and assume the AI is truthful, the main siren world generator is the ability of the AI to describe them in ways that sound very attractive to humans. Since human beliefs and preferences are not clearly distinct., this ranges from misleading (incorrect human beliefs) to actively seductive (influencing human preferences to favour these worlds).
The higher the bandwidth the AI has, the more chance it has of "seduction", or of exploiting known or unknown human irrationalities (again, there's often no clear distinction between exploiting irrationalities for beliefs or preferences).
One scenario - Paul Christiano's - is a bit different but has essentially unlimited bandwidth (or, more precisely, has an AI estimating the result of a setup that has essentially unlimited bandwidth).
but also have a bunch of other extra-scary features above and beyond other scenarios of people being irrational, just because.
This category can include irrationalities we don't yet know about, better exploitation of irrationalities we do know about, and a host of speculative scenarios about hacking the human brain, which I don't want to rule out completely at this stage.
We're talking about an AI presenting humans with consequences of a particular decision, with humans then making the final decision to go along with it or not.
No. We're not. That's dumb. Like, sorry to be spiteful, but that is already a bad move. You do not treat any scenario involving "an AI", without dissolving the concept, as desirable or realistic. You have "an AI", without having either removed its "an AI"-ness (in the LW sense of "an AI") entirely or guaranteed Friendliness? You're already dead.
A putative new idea for AI control; index here.
After working for some time on the Friendly AI problem, it's occurred to me that a lot of the issues seem related. Specifically, all the following seem to have commonalities:
Speaking very broadly, there are two features all them share:
What do I mean by that? Well, imagine you're trying to reach reflective equilibrium in your morality. You do this by using good meta-ethical rules, zooming up and down at various moral levels, making decisions on how to resolve inconsistencies, etc... But how do you know when to stop? Well, you stop when your morality is perfectly self-consistent, when you no longer have any urge to change your moral or meta-moral setup. In other words, the stopping point (and the the convergence to the stopping point) is entirely self-referentially defined: the morality judges itself. It does not include any other moral considerations. You input your initial moral intuitions and values, and you hope this will cause the end result to be "nice", but the definition of the end result does not include your initial moral intuitions (note that some moral realists could see this process dependence as a positive - except for the fact that these processes have many convergent states, not just one or a small grouping).
So when the process goes nasty, you're pretty sure to have achieved something self-referentially stable, but not nice. Similarly, a nasty CEV will be coherent and have no desire to further extrapolate... but that's all we know about it.
The second feature is that any process has errors - computing errors, conceptual errors, errors due to the weakness of human brains, etc... If you visualise this as noise, you can see that noise in a convergent process is more likely to cause premature convergence, because if the process ever reaches a stable self-referential state, it will stay there (and if the process is a long one, then early noise will cause great divergence at the end). For instance, imagine you have to reconcile your belief in preserving human cultures with your beliefs in human individual freedom. A complex balancing act. But if, at any point along the way, you simply jettison one of the two values completely, things become much easier - and once jettisoned, the missing value is unlikely to ever come back.
Or, more simply, the system could get hacked. When exploring a potential future world, you could become so enamoured of it, that you overwrite any objections you had. It seems very easy for humans to fall into these traps - and again, once you lose something of value in your system, you don't tend to get if back.
Solutions
And again, very broadly speaking, there are several classes of solutions to deal with these problems: