Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Siren worlds and the perils of over-optimised search

27 Stuart_Armstrong 07 April 2014 11:00AM

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

Some suggested methods for designing AI goals, or controlling AIs, involve unconstrained searches through possible future worlds. This post argues that this is a very dangerous thing to do, because of the risk of being tricked by "siren worlds" or "marketing worlds". The thought experiment starts with an AI designing a siren world to fool us, but that AI is not crucial to the argument: it's simply an intuition pump to show that siren worlds can exist. Once they exist, there is a non-zero chance of us being seduced by them during a unconstrained search, whatever the search criteria are. This is a feature of optimisation: satisficing and similar approaches don't have the same problems.

 

The AI builds the siren worlds

Imagine that you have a superintelligent AI that's not just badly programmed, or lethally indifferent, but actually evil. Of course, it has successfully concealed this fact, as "don't let humans think I'm evil" is a convergent instrumental goal for all AIs.

We've successfully constrained this evil AI in a Oracle-like fashion. We ask the AI to design future worlds and present them to human inspection, along with an implementation pathway to create those worlds. Then if we approve of those future worlds, the implementation pathway will cause them to exist (assume perfect deterministic implementation for the moment). The constraints we've programmed means that the AI will do all these steps honestly. Its opportunity to do evil is limited exclusively to its choice of worlds to present to us.

The AI will attempt to design a siren world: a world that seems irresistibly attractive while concealing hideous negative features. If the human mind is hackable in the crude sense - maybe through a series of coloured flashes - then the AI would design the siren world to be subtly full of these hacks. It might be that there is some standard of "irresistibly attractive" that is actually irresistibly attractive: the siren world would be full of genuine sirens.

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

continue reading »