Siren worlds and the perils of over-optimised search

Stuart_Armstrong

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

Some suggested methods for designing AI goals, or controlling AIs, involve unconstrained searches through possible future worlds. This post argues that this is a very dangerous thing to do, because of the risk of being tricked by "siren worlds" or "marketing worlds". The thought experiment starts with an AI designing a siren world to fool us, but that AI is not crucial to the argument: it's simply an intuition pump to show that siren worlds can exist. Once they exist, there is a non-zero chance of us being seduced by them during a unconstrained search, whatever the search criteria are. This is a feature of optimisation: satisficing and similar approaches don't have the same problems.

The AI builds the siren worlds

Imagine that you have a superintelligent AI that's not just badly programmed, or lethally indifferent, but actually evil. Of course, it has successfully concealed this fact, as "don't let humans think I'm evil" is a convergent instrumental goal for all AIs.

We've successfully constrained this evil AI in a Oracle-like fashion. We ask the AI to design future worlds and present them to human inspection, along with an implementation pathway to create those worlds. Then if we approve of those future worlds, the implementation pathway will cause them to exist (assume perfect deterministic implementation for the moment). The constraints we've programmed means that the AI will do all these steps honestly. Its opportunity to do evil is limited exclusively to its choice of worlds to present to us.

The AI will attempt to design a siren world: a world that seems irresistibly attractive while concealing hideous negative features. If the human mind is hackable in the crude sense - maybe through a series of coloured flashes - then the AI would design the siren world to be subtly full of these hacks. It might be that there is some standard of "irresistibly attractive" that is actually irresistibly attractive: the siren world would be full of genuine sirens.

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

First, the siren world looks complicated, wrong and scary - but with just a hint that there's something more to it. Something intriguing, something half-glimpsed, something making me want to dig deeper. And as I follow up this something, I see more patterns, and seem to gain a greater understanding. Not just of the world I'm looking at, but of the meaning of good itself. The world seems to confirm to me some of my ideas about what constitutes a worthwhile life - not just the ideas I've been able to articulate, but the ones I've only got a vague half-baked notion of, and the ones I didn't even know I had.

The longer I stare into this world, the greater an understanding I get of my own values. And this is just the starting point: the world subtly opens up new avenues of philosophical musings in my brain, the images I see triggering me to come up with my own insights and thought experiments as to the meaning of goodness - insights that are then carefully confirmed as I did deeper. I could stay staring at this wonderful, beautiful and complex world for hours, days, years, gaining a deeper wisdom and understanding as I go, truly appreciating how the mysteries of this place unravel into new versions of worth and goodness. Every time I ever have a doubt about it, that doubt is systemically resolved into a new glorious confirmation of how much the AI really gets what I value, even before I knew how much I valued it.

Meanwhile, at some level of complexity safely beyond what my human mind will ever reach, the AI is hiding all the evil and unmitigated suffering.

Your siren world may differ from mine - you may be vulnerable to different tricks. Since people often believe they would be immune to such manipulation, feel free to imagine that the AI is fooling someone else - we know how stupid other people can be, right? - while you watch as a helpless bystander.

The problem is that a future universe is likely to be vast and complicated. When inspecting it, we have certain inspection criteria (IC). These consist of our values and preferences, but also the ways we would look into this universe, how we would follow up on initial impressions, various measures and yardsticks we might use to summarise the data we see, all the clever and devious plans we might come up with to ferret out "what's really going on". These IC are going to be considerably simpler than the totality of the future universe. So the AI's task is to optimise a universe design that passes the IC, while shoving in as much disutility as it can - which in a large universe, is a tremendous amount. Unless our IC are perfect and already include a good solution to the problem of value (in which case we've solved the friendliness problem already), a superintelligent AI will likely succeed at its task.

Siren and marketing worlds without builders

The above thought experiment needed a superintelligent evil AI for the design of the siren world. But if we admit that that is possible, we don't actually need the AI any more. The siren worlds exist: there are potential worlds of extreme disutility that satisfie our IC. If we simply did an unconstrained search across all possible future worlds (something like the search in Paul Christiano's indirect normativity - an idea that inspired the siren world concept), then we would at some point find siren worlds. And if we took the time to inspect them, we'd get sucked in by them.

How bad is this problem in general? A full search will not only find the siren worlds, but also a lot of very-seductive-but-also-very-nice worlds - genuine eutopias. We may feel that it's easier to be happy than to pretend to be happy (while being completely miserable and tortured and suffering). Following that argument, we may feel that there will be far more eutopias than siren worlds - after all, the siren worlds have to have bad stuff plus a vast infrastructure to conceal that bad stuff, which should at least have a complexity cost if nothing else. So if we chose the world that best passed our IC - or chose randomly among the top contenders - we might be more likely to hit a genuine eutopia than a siren world.

Unfortunately, there are other dangers than siren worlds. We are now optimising not for quality of the world, but for ability to seduce or manipulate the IC. There's no hidden evil in this world, just a "pulling out all the stops to seduce the inspector, through any means necessary" optimisation pressure. Call a world that ranks high in this scale a "marketing world". Genuine eutopias are unlikely to be marketing worlds, because they are optimised for being good rather than seeming good. A marketing world would be utterly optimised to trick, hack, seduce, manipulate and fool our IC, and may well be a terrible world in all other respects. It's the old "to demonstrate maximal happiness, it's much more reliable to wire people's mouths to smile rather than make them happy" problem all over again: the very best way of seeming good may completely preclude actually being good. In a genuine eutopia, people won't go around all the time saying "Btw, I am genuinely happy!" in case there is a hypothetical observer looking in. If every one of your actions constantly proclaims that you are happy, chances are happiness is not your genuine state. EDIT: see also my comment:

We are both superintelligences. You have a bunch of independently happy people that you do not aggressively compel. I have a group of zombies - human-like puppets that I can make do anything, appear to feel anything (though this is done sufficiently well that outside human observers can't tell I'm actually in control). An outside human observer wants to check that our worlds rank high on scale X - a scale we both know about.

Which of us do you think is going to be better able to maximise our X score?

This can also be seen as a epistemic version of Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Here the IC are the measure, and the marketing worlds are targeting them, and hence they cease to be a good measure. But recall that the IC include the totality of approaches we use to rank these worlds, so there's no way around this problem. If instead of inspecting the worlds, we simply rely on some sort of summary function, then the search will be optimised to find anything that can fool/pass that summary function. If we use the summary as a first filter, then apply some more profound automated checking, then briefly inspect the outcome so we're sure it didn't go stupid - then the search will optimised for "pass the summary, pass automated checking, seduce the inspector".

Different IC therefore will produce different rankings of worlds, but the top worlds in any of the ranking will be marketing worlds (and possibly siren worlds).

Constrained search and satisficing our preferences

The issue is a problem of (over) optimisation. The IC correspond roughly with what we want to value, but differs from it in subtle ways, enough that optimising for one could be disastrous for the other. If we didn't optimise, this wouldn't be a problem. Suppose we defined an acceptable world as one that we would judge "yeah, that's pretty cool" or even "yeah, that's really great". Then assume we selected randomly among the acceptable worlds. This would probably result in a world of positive value: siren worlds and marketing worlds are rare, because they fulfil very specific criteria. They triumph because they score so high on the IC scale, but they are outnumbered by the many more worlds that are simply acceptable.

This is in effect satisficing over the IC, rather than optimising over them. Satisficing has its own issues, however, so other approaches could be valuable as well. One way could be use constrained search. If for instance we took a thousand random worlds and IC-optimised over them, we're very unlikely to encounter a siren or marketing world. We're also very unlikely to encounter a world of any quality, though; we'd probably need to IC-optimise over at least a trillion worlds to find good ones. There is a tension in the number: as the number of worlds searched increases, their quality increases, but so does the odds of encountering a marketing or siren world. EDIT: Lumifer suggested using a first-past-the-post system: search through worlds, and pick the first acceptable one we find. This is better than the approach I outlined in this paragraph.

We could also restrict the search by considering "realistic" worlds. Suppose we had to take 25 different yes-no decisions that could affect the future of the humanity. This might be something like "choosing which of these 25 very different AIs to turn on and let loose together" or something more prosaic (which stocks to buy, which charities to support). This results in 2²⁵ different future worlds to search through: barely more than 33 million. Because there are so few worlds, they are unlikely to contain a marketing world (given the absolutely crucial proviso that none of the AIs is an IC-optimiser!). But these worlds are not drawn randomly from the space of future worlds, but are dependent on key decisions that we believe are important and relevant. Therefore they are very likely to contain an acceptable world - or at least far more likely than a random set of 33 million worlds would be. By constraining the choices in this way, we have in effect satisficed without satisficing, which is both Zen and useful.

As long as we're aware of the problem, other approaches may also allow for decent search without getting sucked in by a siren or a marketer.

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

The AI builds the siren worlds

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

Meanwhile, at some level of complexity safely beyond what my human mind will ever reach, the AI is hiding all the evil and unmitigated suffering.

Siren and marketing worlds without builders

Which of us do you think is going to be better able to maximise our X score?

Different IC therefore will produce different rankings of worlds, but the top worlds in any of the ranking will be marketing worlds (and possibly siren worlds).

Constrained search and satisficing our preferences

As long as we're aware of the problem, other approaches may also allow for decent search without getting sucked in by a siren or a marketer.

Stuart_Armstrong12y30

employ normative uncertainty and the learning of utility functions from data

Indeed. But the more of the problem you can formalise and solve (eg maintaining a stable utility function over self-improvements) the more likely the learning approach is to succeed.

[anonymous]12y20

Well yes, of course. I mean, if you can't build an agent that was capable of maintaining its learned utility while becoming vastly smarter (and thus capable of more accurately learning and enacting capital-G Goodness), then all that utility-learning was for nought.

85

Siren worlds and the perils of over-optimised search

85

Ω 20

The AI builds the siren worlds

Siren and marketing worlds without builders

Constrained search and satisficing our preferences

85

Ω 20

85

Siren worlds and the perils of over-optimised search

85

Ω 20

The AI builds the siren worlds

Siren and marketing worlds without builders

Constrained search and satisficing our preferences

85

Ω 20