While not generally an opponent of human sexuality, to be kind to all the LW audience including those whose parents might see them browsing, please do remove the semi-NSFW image.
LOL. The number of naked women grew from one to two, besides the bare ass we now also have breasts with nipples visible (OMG! :-D) and yet it's now fine just because it is old-enough Art.
This indeed is why "What a human would think of a world, given a defined window process onto a world" was not something I considered as a viable form of indirect normativity / an alternative to CEV.
To my mind, the interesting part is the whole constrain search/satisficing ideas which may allow such an approach to be used.
First question: how on Earth would we go about conducting a search through possible future universes, anyway? This thought experiment still feels too abstract to make my intuitions go click, in much the same way that Christiano's original write-up of Indirect Normativity did. You simply can't actually simulate or "acausally peek at" whole universes at a time, or even Earth-volumes in such. We don't have the compute-power, and I don't understand how I'm supposed to be seduced by a siren that can't sing to me.
It seems to me that the greater danger is that a UFAI would simply market itself as an FAI as an instrumental goal and use various "siren and marketing" tactics to manipulate us into cleanly, quietly accepting our own extinction -- because it could just be cheaper to manipulate people than to fight them, when you're not yet capable of making grey goo but still want to kill all humans.
And if we want to talk about complex nasty dangers, it's probably going to just be people jumping for the first thing that looks eutopian, in the process chucking out some of their value-set. People do that a lot, see: every single so-called "utopian" movement ever ...
If I cannot trust the Oracle AI, dear God, why am I having it explore potential eutopian future worlds for me?
Because I haven't read Less Wrong? ^_^
This is another argument against using constrained but non-friendly AI to do stuff for us...
This puts me in mind of a thought experiment Yvain posted a while ago (I’m certain he’s not the original author, but I can’t for the life of me track it any further back than his LiveJournal):
“A man has a machine with a button on it. If you press the button, there is a one in five million chance that you will die immediately; otherwise, nothing happens. He offers you some money to press the button once. What do you do? Do you refuse to press it for any amount? If not, how much money would convince you to press the button?”
This is – I think – analogous to y...
One issue here is that worlds with an "almost-friendly" AI (one whose friendliness was botched in some respect) may end up looking like siren or marketing worlds.
In that case, worlds as bad as sirens will be rather too common in the search space (because AIs with botched friendliness are more likely than AIs with true friendliness) and a satisficing approach won't work.
...We could also restrict the search by considering "realistic" worlds. Suppose we had to take 25 different yes-no decisions that could affect the future of the humanity. This might be something like "choosing which of these 25 very different AIs to turn on and let loose together" or something more prosaic (which stocks to buy, which charities to support). This results in 225 different future worlds to search through: barely more than 33 million. Because there are so few worlds, they are unlikely to contain a marketing world (given the absolutely crucial
I've just now found my way to this post, from links in several of your more recent posts, and I'm curious as to how this fits in with more recent concepts and thinking from yourself and others.
Firstly, in terms of Garrabrant's taxonomy, I take it that the "evil AI" scenario could be considered a case of adversarial Goodhart, and the siren and marketing worlds without builders could be considered cases of regressional and/or extremal Goodhart. Does that sound right?
Secondly, would you still say that these scenarios demonstrate reas...
It seems based on your later comments that the premise of marketing worlds existing relies on there being trade-offs between our specified wants and our unspecified wants, so that the world optimised for our specified wants must necessarily be highly likely to be lacking in our unspecified ones ("A world with maximal bananas will likely have no apples at all").
I don't think this is necessarily the case. If I only specify that I want low rates of abortion, for example, then I think it highly likely that 'd get a world that also has low rates of ST...
Do I need to download a Monte Carlo implementation from Github and run it on a university server with environmental access to the entire machine and show logs of the damn thing misbehaving itself to convince you?
FWIW, I think that would make for a pretty interesting post.
I don't understand why you imply that an evil Oracle will not be able to present only or mostly the evil possible worlds disguised as good. My guess would be that satisficing gets you into just as much trouble as optimizing.
We are both superintelligences. You have a bunch of independently happy people that you do not aggressively compel. I have a group of zombies - human-like puppets that I can make do anything, appear to feel anything (though this is done sufficiently well that outside human observers can't tell I'm actually in control). An outside human observer wants to check that our worlds rank high on scale X - a scale we both know about.
Which of us do you think is going to be better able to maximise our X score?
If a narrower search gets worlds that are disproportionately not what we actually want, that might be because we chose the wrong criteria, not that we searched too narrowly per se. A broader search would come up with worlds that are less tightly optimized for the search criteria, but they might be less tightly optimized by simply being bad.
Can you provide any support for the notion that in general, a narrower search comes up with a higher proportion of bad worlds?
I'm wondering whether this framing (choosing between a set of candidate worlds) is the most productive. Does it make sense to use criteria like corrigibility, minimizing impact and prefering reversible actions (or we have no reliable way to evaluate whether these hold)?
Since the evil AI is presenting a design for a world, rather than the world itself, the problem of it being populated with zombies that only appear to be free could be countered by having the design be in an open source format that allows the people examining it (or other AIs) to determine the actual status of the designed inhabitants.
I think the wording here is kind of odd.
An unconstrained search will not find a siren world, or even a very good world. There are simply too many to consider. The problem is that you're likely to design an AI that finds worlds that you'd like. It may or may not actually show you anything, but you program it to give you what it thinks you'd rate the best. You're essentially programming it to design a siren world. It won't intentionally hide anything dark under there, but it will spend way too much effort on things that make the world look good. It might even end up with dark things hidden, just because they were somehow necessary to make it look that good.
TL;DR: Worlds which meet our specified criteria but fail to meet some unspecified but vital criteria outnumber (vastly?) worlds that meet both our specified and unspecified criteria.
Is that an accurate recap? If so, I think there's two things that need to be proven:
There will with high probability be important unspecified criteria in any given predicate.
The nature of the unspecified criteria is such that it is unfulfilled in a large majority of worlds which fulfill the specified criteria.
(1) is commonly accepted here (rightly so, IMO). But (2) seems...
It's not really clear why you would have the searching process be more powerful than the evaluating process, if using such a "search" as part of a hypothetical process in the definition of "good."
Note that in my original proposal (that I believe motivated this post) the only brute force searches were used to find formal descriptions of physics and human brains, as a kind of idealized induction, not to search for "good" worlds.
I'm not totally sure of your argument here; would you be able to clarify why satisficing is superior to a straight maximization given your hypothetical[0]?
Specifically, you argue correctly that human judgement is informed by numerous hidden variables over which we have no awareness, and thus a maximization process executed by us has the potential for error. You also argue that 'eutopian'/'good enough' worlds are likely to be more common than sirens. Given that, how is a judgement with error induced by hidden variables any worse than a judgement made using ...
...The IC correspond roughly with what we want to value, but differs from it in subtle ways, enough that optimising for one could be disastrous for the other. If we didn't optimise, this wouldn't be a problem. Suppose we defined an acceptable world as one that we would judge "yeah, that's pretty cool" or even "yeah, that's really great". Then assume we selected randomly among the acceptable worlds. This would probably result in a world of positive value: siren worlds and marketing worlds are rare, because they fulfil very specific criteri
"ask for too much and you wind up with nothing" is a fine fairy tale moral. Does it actually hold in these particular circumstances?
Imagine that there's a landscape of possible words. There is a function (A) on this landscape, we don't know how to define it, but it is how much we truly would prefer a world if only we knew. Somewhere this function has a peak, the most ideal "eutopia". There is another function. This one we do define. It is intended to approximate the first function, but it does not do so perfectly. Our "acceptability criteria" is to require that this second function (B) has a value at least some threshold.
Now as we raise the acceptability criteria (threshold for function B), we might expect there to be two different regimes. In a first regime with low acceptability criteria, Function B is not that bad a proxy for function A, and raising the threshold increases the average true desirability of the worlds that meet it. In a second regime with high acceptability criteria, function B ceases to be effective as a proxy. Here we are asking for "too much". The peak of function B is at a different place than the peak of function A, and...
tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.
Some suggested methods for designing AI goals, or controlling AIs, involve unconstrained searches through possible future worlds. This post argues that this is a very dangerous thing to do, because of the risk of being tricked by "siren worlds" or "marketing worlds". The thought experiment starts with an AI designing a siren world to fool us, but that AI is not crucial to the argument: it's simply an intuition pump to show that siren worlds can exist. Once they exist, there is a non-zero chance of us being seduced by them during a unconstrained search, whatever the search criteria are. This is a feature of optimisation: satisficing and similar approaches don't have the same problems.
The AI builds the siren worlds
Imagine that you have a superintelligent AI that's not just badly programmed, or lethally indifferent, but actually evil. Of course, it has successfully concealed this fact, as "don't let humans think I'm evil" is a convergent instrumental goal for all AIs.
We've successfully constrained this evil AI in a Oracle-like fashion. We ask the AI to design future worlds and present them to human inspection, along with an implementation pathway to create those worlds. Then if we approve of those future worlds, the implementation pathway will cause them to exist (assume perfect deterministic implementation for the moment). The constraints we've programmed means that the AI will do all these steps honestly. Its opportunity to do evil is limited exclusively to its choice of worlds to present to us.
The AI will attempt to design a siren world: a world that seems irresistibly attractive while concealing hideous negative features. If the human mind is hackable in the crude sense - maybe through a series of coloured flashes - then the AI would design the siren world to be subtly full of these hacks. It might be that there is some standard of "irresistibly attractive" that is actually irresistibly attractive: the siren world would be full of genuine sirens.
Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:
First, the siren world looks complicated, wrong and scary - but with just a hint that there's something more to it. Something intriguing, something half-glimpsed, something making me want to dig deeper. And as I follow up this something, I see more patterns, and seem to gain a greater understanding. Not just of the world I'm looking at, but of the meaning of good itself. The world seems to confirm to me some of my ideas about what constitutes a worthwhile life - not just the ideas I've been able to articulate, but the ones I've only got a vague half-baked notion of, and the ones I didn't even know I had.
The longer I stare into this world, the greater an understanding I get of my own values. And this is just the starting point: the world subtly opens up new avenues of philosophical musings in my brain, the images I see triggering me to come up with my own insights and thought experiments as to the meaning of goodness - insights that are then carefully confirmed as I did deeper. I could stay staring at this wonderful, beautiful and complex world for hours, days, years, gaining a deeper wisdom and understanding as I go, truly appreciating how the mysteries of this place unravel into new versions of worth and goodness. Every time I ever have a doubt about it, that doubt is systemically resolved into a new glorious confirmation of how much the AI really gets what I value, even before I knew how much I valued it.
Meanwhile, at some level of complexity safely beyond what my human mind will ever reach, the AI is hiding all the evil and unmitigated suffering.
Your siren world may differ from mine - you may be vulnerable to different tricks. Since people often believe they would be immune to such manipulation, feel free to imagine that the AI is fooling someone else - we know how stupid other people can be, right? - while you watch as a helpless bystander.
The problem is that a future universe is likely to be vast and complicated. When inspecting it, we have certain inspection criteria (IC). These consist of our values and preferences, but also the ways we would look into this universe, how we would follow up on initial impressions, various measures and yardsticks we might use to summarise the data we see, all the clever and devious plans we might come up with to ferret out "what's really going on". These IC are going to be considerably simpler than the totality of the future universe. So the AI's task is to optimise a universe design that passes the IC, while shoving in as much disutility as it can - which in a large universe, is a tremendous amount. Unless our IC are perfect and already include a good solution to the problem of value (in which case we've solved the friendliness problem already), a superintelligent AI will likely succeed at its task.
Siren and marketing worlds without builders
The above thought experiment needed a superintelligent evil AI for the design of the siren world. But if we admit that that is possible, we don't actually need the AI any more. The siren worlds exist: there are potential worlds of extreme disutility that satisfie our IC. If we simply did an unconstrained search across all possible future worlds (something like the search in Paul Christiano's indirect normativity - an idea that inspired the siren world concept), then we would at some point find siren worlds. And if we took the time to inspect them, we'd get sucked in by them.
How bad is this problem in general? A full search will not only find the siren worlds, but also a lot of very-seductive-but-also-very-nice worlds - genuine eutopias. We may feel that it's easier to be happy than to pretend to be happy (while being completely miserable and tortured and suffering). Following that argument, we may feel that there will be far more eutopias than siren worlds - after all, the siren worlds have to have bad stuff plus a vast infrastructure to conceal that bad stuff, which should at least have a complexity cost if nothing else. So if we chose the world that best passed our IC - or chose randomly among the top contenders - we might be more likely to hit a genuine eutopia than a siren world.
Unfortunately, there are other dangers than siren worlds. We are now optimising not for quality of the world, but for ability to seduce or manipulate the IC. There's no hidden evil in this world, just a "pulling out all the stops to seduce the inspector, through any means necessary" optimisation pressure. Call a world that ranks high in this scale a "marketing world". Genuine eutopias are unlikely to be marketing worlds, because they are optimised for being good rather than seeming good. A marketing world would be utterly optimised to trick, hack, seduce, manipulate and fool our IC, and may well be a terrible world in all other respects. It's the old "to demonstrate maximal happiness, it's much more reliable to wire people's mouths to smile rather than make them happy" problem all over again: the very best way of seeming good may completely preclude actually being good. In a genuine eutopia, people won't go around all the time saying "Btw, I am genuinely happy!" in case there is a hypothetical observer looking in. If every one of your actions constantly proclaims that you are happy, chances are happiness is not your genuine state. EDIT: see also my comment:
We are both superintelligences. You have a bunch of independently happy people that you do not aggressively compel. I have a group of zombies - human-like puppets that I can make do anything, appear to feel anything (though this is done sufficiently well that outside human observers can't tell I'm actually in control). An outside human observer wants to check that our worlds rank high on scale X - a scale we both know about.
Which of us do you think is going to be better able to maximise our X score?
This can also be seen as a epistemic version of Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Here the IC are the measure, and the marketing worlds are targeting them, and hence they cease to be a good measure. But recall that the IC include the totality of approaches we use to rank these worlds, so there's no way around this problem. If instead of inspecting the worlds, we simply rely on some sort of summary function, then the search will be optimised to find anything that can fool/pass that summary function. If we use the summary as a first filter, then apply some more profound automated checking, then briefly inspect the outcome so we're sure it didn't go stupid - then the search will optimised for "pass the summary, pass automated checking, seduce the inspector".
Different IC therefore will produce different rankings of worlds, but the top worlds in any of the ranking will be marketing worlds (and possibly siren worlds).
Constrained search and satisficing our preferences
The issue is a problem of (over) optimisation. The IC correspond roughly with what we want to value, but differs from it in subtle ways, enough that optimising for one could be disastrous for the other. If we didn't optimise, this wouldn't be a problem. Suppose we defined an acceptable world as one that we would judge "yeah, that's pretty cool" or even "yeah, that's really great". Then assume we selected randomly among the acceptable worlds. This would probably result in a world of positive value: siren worlds and marketing worlds are rare, because they fulfil very specific criteria. They triumph because they score so high on the IC scale, but they are outnumbered by the many more worlds that are simply acceptable.
This is in effect satisficing over the IC, rather than optimising over them. Satisficing has its own issues, however, so other approaches could be valuable as well. One way could be use constrained search. If for instance we took a thousand random worlds and IC-optimised over them, we're very unlikely to encounter a siren or marketing world. We're also very unlikely to encounter a world of any quality, though; we'd probably need to IC-optimise over at least a trillion worlds to find good ones. There is a tension in the number: as the number of worlds searched increases, their quality increases, but so does the odds of encountering a marketing or siren world. EDIT: Lumifer suggested using a first-past-the-post system: search through worlds, and pick the first acceptable one we find. This is better than the approach I outlined in this paragraph.
We could also restrict the search by considering "realistic" worlds. Suppose we had to take 25 different yes-no decisions that could affect the future of the humanity. This might be something like "choosing which of these 25 very different AIs to turn on and let loose together" or something more prosaic (which stocks to buy, which charities to support). This results in 225 different future worlds to search through: barely more than 33 million. Because there are so few worlds, they are unlikely to contain a marketing world (given the absolutely crucial proviso that none of the AIs is an IC-optimiser!). But these worlds are not drawn randomly from the space of future worlds, but are dependent on key decisions that we believe are important and relevant. Therefore they are very likely to contain an acceptable world - or at least far more likely than a random set of 33 million worlds would be. By constraining the choices in this way, we have in effect satisficed without satisficing, which is both Zen and useful.
As long as we're aware of the problem, other approaches may also allow for decent search without getting sucked in by a siren or a marketer.