Siren worlds and the perils of over-optimised search

Stuart_Armstrong

90 Siren worlds and the perils of over-optimised search

by Stuart_Armstrong

7th Apr 2014

AI Alignment Forum

8 min read

418

90 Ω 20

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

Some suggested methods for designing AI goals, or controlling AIs, involve unconstrained searches through possible future worlds. This post argues that this is a very dangerous thing to do, because of the risk of being tricked by "siren worlds" or "marketing worlds". The thought experiment starts with an AI designing a siren world to fool us, but that AI is not crucial to the argument: it's simply an intuition pump to show that siren worlds can exist. Once they exist, there is a non-zero chance of us being seduced by them during a unconstrained search, whatever the search criteria are. This is a feature of optimisation: satisficing and similar approaches don't have the same problems.

The AI builds the siren worlds

Imagine that you have a superintelligent AI that's not just badly programmed, or lethally indifferent, but actually evil. Of course, it has successfully concealed this fact, as "don't let humans think I'm evil" is a convergent instrumental goal for all AIs.

We've successfully constrained this evil AI in a Oracle-like fashion. We ask the AI to design future worlds and present them to human inspection, along with an implementation pathway to create those worlds. Then if we approve of those future worlds, the implementation pathway will cause them to exist (assume perfect deterministic implementation for the moment). The constraints we've programmed means that the AI will do all these steps honestly. Its opportunity to do evil is limited exclusively to its choice of worlds to present to us.

The AI will attempt to design a siren world: a world that seems irresistibly attractive while concealing hideous negative features. If the human mind is hackable in the crude sense - maybe through a series of coloured flashes - then the AI would design the siren world to be subtly full of these hacks. It might be that there is some standard of "irresistibly attractive" that is actually irresistibly attractive: the siren world would be full of genuine sirens.

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

First, the siren world looks complicated, wrong and scary - but with just a hint that there's something more to it. Something intriguing, something half-glimpsed, something making me want to dig deeper. And as I follow up this something, I see more patterns, and seem to gain a greater understanding. Not just of the world I'm looking at, but of the meaning of good itself. The world seems to confirm to me some of my ideas about what constitutes a worthwhile life - not just the ideas I've been able to articulate, but the ones I've only got a vague half-baked notion of, and the ones I didn't even know I had.

The longer I stare into this world, the greater an understanding I get of my own values. And this is just the starting point: the world subtly opens up new avenues of philosophical musings in my brain, the images I see triggering me to come up with my own insights and thought experiments as to the meaning of goodness - insights that are then carefully confirmed as I did deeper. I could stay staring at this wonderful, beautiful and complex world for hours, days, years, gaining a deeper wisdom and understanding as I go, truly appreciating how the mysteries of this place unravel into new versions of worth and goodness. Every time I ever have a doubt about it, that doubt is systemically resolved into a new glorious confirmation of how much the AI really gets what I value, even before I knew how much I valued it.

Meanwhile, at some level of complexity safely beyond what my human mind will ever reach, the AI is hiding all the evil and unmitigated suffering.

Your siren world may differ from mine - you may be vulnerable to different tricks. Since people often believe they would be immune to such manipulation, feel free to imagine that the AI is fooling someone else - we know how stupid other people can be, right? - while you watch as a helpless bystander.

The problem is that a future universe is likely to be vast and complicated. When inspecting it, we have certain inspection criteria (IC). These consist of our values and preferences, but also the ways we would look into this universe, how we would follow up on initial impressions, various measures and yardsticks we might use to summarise the data we see, all the clever and devious plans we might come up with to ferret out "what's really going on". These IC are going to be considerably simpler than the totality of the future universe. So the AI's task is to optimise a universe design that passes the IC, while shoving in as much disutility as it can - which in a large universe, is a tremendous amount. Unless our IC are perfect and already include a good solution to the problem of value (in which case we've solved the friendliness problem already), a superintelligent AI will likely succeed at its task.

Siren and marketing worlds without builders

The above thought experiment needed a superintelligent evil AI for the design of the siren world. But if we admit that that is possible, we don't actually need the AI any more. The siren worlds exist: there are potential worlds of extreme disutility that satisfie our IC. If we simply did an unconstrained search across all possible future worlds (something like the search in Paul Christiano's indirect normativity - an idea that inspired the siren world concept), then we would at some point find siren worlds. And if we took the time to inspect them, we'd get sucked in by them.

How bad is this problem in general? A full search will not only find the siren worlds, but also a lot of very-seductive-but-also-very-nice worlds - genuine eutopias. We may feel that it's easier to be happy than to pretend to be happy (while being completely miserable and tortured and suffering). Following that argument, we may feel that there will be far more eutopias than siren worlds - after all, the siren worlds have to have bad stuff plus a vast infrastructure to conceal that bad stuff, which should at least have a complexity cost if nothing else. So if we chose the world that best passed our IC - or chose randomly among the top contenders - we might be more likely to hit a genuine eutopia than a siren world.

Unfortunately, there are other dangers than siren worlds. We are now optimising not for quality of the world, but for ability to seduce or manipulate the IC. There's no hidden evil in this world, just a "pulling out all the stops to seduce the inspector, through any means necessary" optimisation pressure. Call a world that ranks high in this scale a "marketing world". Genuine eutopias are unlikely to be marketing worlds, because they are optimised for being good rather than seeming good. A marketing world would be utterly optimised to trick, hack, seduce, manipulate and fool our IC, and may well be a terrible world in all other respects. It's the old "to demonstrate maximal happiness, it's much more reliable to wire people's mouths to smile rather than make them happy" problem all over again: the very best way of seeming good may completely preclude actually being good. In a genuine eutopia, people won't go around all the time saying "Btw, I am genuinely happy!" in case there is a hypothetical observer looking in. If every one of your actions constantly proclaims that you are happy, chances are happiness is not your genuine state. EDIT: see also my comment:

We are both superintelligences. You have a bunch of independently happy people that you do not aggressively compel. I have a group of zombies - human-like puppets that I can make do anything, appear to feel anything (though this is done sufficiently well that outside human observers can't tell I'm actually in control). An outside human observer wants to check that our worlds rank high on scale X - a scale we both know about.

Which of us do you think is going to be better able to maximise our X score?

This can also be seen as a epistemic version of Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Here the IC are the measure, and the marketing worlds are targeting them, and hence they cease to be a good measure. But recall that the IC include the totality of approaches we use to rank these worlds, so there's no way around this problem. If instead of inspecting the worlds, we simply rely on some sort of summary function, then the search will be optimised to find anything that can fool/pass that summary function. If we use the summary as a first filter, then apply some more profound automated checking, then briefly inspect the outcome so we're sure it didn't go stupid - then the search will optimised for "pass the summary, pass automated checking, seduce the inspector".

Different IC therefore will produce different rankings of worlds, but the top worlds in any of the ranking will be marketing worlds (and possibly siren worlds).

Constrained search and satisficing our preferences

The issue is a problem of (over) optimisation. The IC correspond roughly with what we want to value, but differs from it in subtle ways, enough that optimising for one could be disastrous for the other. If we didn't optimise, this wouldn't be a problem. Suppose we defined an acceptable world as one that we would judge "yeah, that's pretty cool" or even "yeah, that's really great". Then assume we selected randomly among the acceptable worlds. This would probably result in a world of positive value: siren worlds and marketing worlds are rare, because they fulfil very specific criteria. They triumph because they score so high on the IC scale, but they are outnumbered by the many more worlds that are simply acceptable.

This is in effect satisficing over the IC, rather than optimising over them. Satisficing has its own issues, however, so other approaches could be valuable as well. One way could be use constrained search. If for instance we took a thousand random worlds and IC-optimised over them, we're very unlikely to encounter a siren or marketing world. We're also very unlikely to encounter a world of any quality, though; we'd probably need to IC-optimise over at least a trillion worlds to find good ones. There is a tension in the number: as the number of worlds searched increases, their quality increases, but so does the odds of encountering a marketing or siren world. EDIT: Lumifer suggested using a first-past-the-post system: search through worlds, and pick the first acceptable one we find. This is better than the approach I outlined in this paragraph.

We could also restrict the search by considering "realistic" worlds. Suppose we had to take 25 different yes-no decisions that could affect the future of the humanity. This might be something like "choosing which of these 25 very different AIs to turn on and let loose together" or something more prosaic (which stocks to buy, which charities to support). This results in 2²⁵ different future worlds to search through: barely more than 33 million. Because there are so few worlds, they are unlikely to contain a marketing world (given the absolutely crucial proviso that none of the AIs is an IC-optimiser!). But these worlds are not drawn randomly from the space of future worlds, but are dependent on key decisions that we believe are important and relevant. Therefore they are very likely to contain an acceptable world - or at least far more likely than a random set of 33 million worlds would be. By constraining the choices in this way, we have in effect satisficed without satisficing, which is both Zen and useful.

As long as we're aware of the problem, other approaches may also allow for decent search without getting sucked in by a siren or a marketer.

AI RiskComplexity of valueFuturismOptimizationRisks of Astronomical Suffering (S-risks)AI

Personal Blog

90 Ω 20

New Comment

Rendering 0/418 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 5:44 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

90 Siren worlds and the perils of over-optimised search

by Stuart_Armstrong

7th Apr 2014

AI Alignment Forum

8 min read

418

90 Ω 20

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

The AI builds the siren worlds

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

Meanwhile, at some level of complexity safely beyond what my human mind will ever reach, the AI is hiding all the evil and unmitigated suffering.

Siren and marketing worlds without builders

Which of us do you think is going to be better able to maximise our X score?

Different IC therefore will produce different rankings of worlds, but the top worlds in any of the ranking will be marketing worlds (and possibly siren worlds).

Constrained search and satisficing our preferences

As long as we're aware of the problem, other approaches may also allow for decent search without getting sucked in by a siren or a marketer.

AI RiskComplexity of valueFuturismOptimizationRisks of Astronomical Suffering (S-risks)AI

Personal Blog

90 Ω 20

Mentioned in

132Confused why a "capabilities research is good for alignment progress" position isn't discussed more

79Model splintering: moving from one imperfect model to another

74Research Agenda v0.9: Synthesising a human's preferences into a utility function

69A List of Nuances

66Why we need a *theory* of human values

Load More (5/30)

New Comment

Rendering 0/418 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 5:44 AM

Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

Moderation Log

More from Stuart_Armstrong

Curated and popular this week

418Comments

418

Comment Permalink

PhilosophyTutor12y50

Precisely and exactly! That's the whole of the problem - optimising for one thing (appearance) results in the loss of other things we value.

This just isn't always so. If you instruct an AI to optimise a car for speed, efficiency and durability but forget to specify that it has to be aerodynamic, you aren't going to get a car shaped like a brick. You can't optimise for speed and efficiency without optimising for aerodynamics too. In the same way it seems highly unlikely to me that you could optimise a society for freedom, education, just distribution of wealth, sexual equality and so on without creating something pretty close to optimal in terms of unwanted pregnancies, crime and other important axes.

Even if it's possible to do this, it seems like something which would require extra work and resources to achieve. A magical genie AI might be able to make you a super-efficient brick-shaped car by using Sufficiently Advanced Technology indistinguishable from magic but even for that genie it would have to be more work than making an equally optimal car by the defined parameters that wasn't a silly shape. In the same way an effectively God-like hypothetical AI might be able to make a siren world that optimised for everything except crime and create a world perfect in every way except that it was rife with crime but it seems like it would be more work, not less.

Next challenge: define liberty in code. This seems extraordinarily difficult.

I think if we can assume we have solved the strong AI problem, we can assume we have solved the much lesser problem of explaining liberty to an AI.

So we do agree that there are problem with an all-powerful genie?

We've got a problem with your assumptions about all-powerful genies, I think, because I think your argument relies on the genie being so ultimately all-powerful that it is exactly as easy for the genie to make an optimal brick-shaped car or an optimal car made out of tissue paper and post-it notes as it is for the genie to make an optimal proper car. I don't think that genie can exist in any remotely plausible universe.

If it's not all-powerful to that extreme then it's still going to be easier for the genie to make a society optimised (or close to it) across all the important axes at once than one optimised across all the ones we think to specify while tanking all the rest. So for any reasonable genie I still think market worlds don't make sense as a concept. Siren worlds, sure. Market worlds, not so much, because the things we value are deeply interconnected and you can't just arbitrarily dump-stat some while efficiently optimising all the rest.

Strange712y00

This just isn't always so. If you instruct an AI to optimise a car for speed, efficiency and durability but forget to specify that it has to be aerodynamic, you aren't going to get a car shaped like a brick. You can't optimise for speed and efficiency without optimising for aerodynamics too.

Unless you start by removing the air, in some way that doesn't count against the car's efficiency.

1Stuart_Armstrong12y

The strong AI problem is much easier to solve than the problem of motivating an AI to respect liberty. For instance, the first one can be brute forced (eg AIXItl with vast resources), the second one can't. Having the AI understand human concepts of liberty is pointless unless it's motivated to act on that understanding. An excess of anthropomophisation is bad, but an analogy could be about creating new life (which humans can do) and motivating that new life to follow specific rules are requirements if they become powerful (which humans are pretty bad at at).

See in context