Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Siren worlds and the perils of over-optimised search

27 Post author: Stuart_Armstrong 07 April 2014 11:00AM

tl;dr An unconstrained search through possible future worlds is a dangerous way of choosing positive outcomes. Constrained, imperfect or under-optimised searches work better.

Some suggested methods for designing AI goals, or controlling AIs, involve unconstrained searches through possible future worlds. This post argues that this is a very dangerous thing to do, because of the risk of being tricked by "siren worlds" or "marketing worlds". The thought experiment starts with an AI designing a siren world to fool us, but that AI is not crucial to the argument: it's simply an intuition pump to show that siren worlds can exist. Once they exist, there is a non-zero chance of us being seduced by them during a unconstrained search, whatever the search criteria are. This is a feature of optimisation: satisficing and similar approaches don't have the same problems.

 

The AI builds the siren worlds

Imagine that you have a superintelligent AI that's not just badly programmed, or lethally indifferent, but actually evil. Of course, it has successfully concealed this fact, as "don't let humans think I'm evil" is a convergent instrumental goal for all AIs.

We've successfully constrained this evil AI in a Oracle-like fashion. We ask the AI to design future worlds and present them to human inspection, along with an implementation pathway to create those worlds. Then if we approve of those future worlds, the implementation pathway will cause them to exist (assume perfect deterministic implementation for the moment). The constraints we've programmed means that the AI will do all these steps honestly. Its opportunity to do evil is limited exclusively to its choice of worlds to present to us.

The AI will attempt to design a siren world: a world that seems irresistibly attractive while concealing hideous negative features. If the human mind is hackable in the crude sense - maybe through a series of coloured flashes - then the AI would design the siren world to be subtly full of these hacks. It might be that there is some standard of "irresistibly attractive" that is actually irresistibly attractive: the siren world would be full of genuine sirens.

Even without those types of approaches, there's so much manipulation the AI could indulge in. I could imagine myself (and many people on Less Wrong) falling for the following approach:

First, the siren world looks complicated, wrong and scary - but with just a hint that there's something more to it. Something intriguing, something half-glimpsed, something making me want to dig deeper. And as I follow up this something, I see more patterns, and seem to gain a greater understanding. Not just of the world I'm looking at, but of the meaning of good itself. The world seems to confirm to me some of my ideas about what constitutes a worthwhile life - not just the ideas I've been able to articulate, but the ones I've only got a vague half-baked notion of, and the ones I didn't even know I had.

The longer I stare into this world, the greater an understanding I get of my own values. And this is just the starting point: the world subtly opens up new avenues of philosophical musings in my brain, the images I see triggering me to come up with my own insights and thought experiments as to the meaning of goodness - insights that are then carefully confirmed as I did deeper. I could stay staring at this wonderful, beautiful and complex world for hours, days, years, gaining a deeper wisdom and understanding as I go, truly appreciating how the mysteries of this place unravel into new versions of worth and goodness. Every time I ever have a doubt about it, that doubt is systemically resolved into a new glorious confirmation of how much the AI really gets what I value, even before I knew how much I valued it.

Meanwhile, at some level of complexity safely beyond what my human mind will ever reach, the AI is hiding all the evil and unmitigated suffering.

Your siren world may differ from mine - you may be vulnerable to different tricks. Since people often believe they would be immune to such manipulation, feel free to imagine that the AI is fooling someone else - we know how stupid other people can be, right? - while you watch as a helpless bystander.

The problem is that a future universe is likely to be vast and complicated. When inspecting it, we have certain inspection criteria (IC). These consist of our values and preferences, but also the ways we would look into this universe, how we would follow up on initial impressions, various measures and yardsticks we might use to summarise the data we see, all the clever and devious plans we might come up with to ferret out "what's really going on". These IC are going to be considerably simpler than the totality of the future universe. So the AI's task is to optimise a universe design that passes the IC, while shoving in as much disutility as it can - which in a large universe, is a tremendous amount. Unless our IC are perfect and already include a good solution to the problem of value (in which case we've solved the friendliness problem already), a superintelligent AI will likely succeed at its task.

 

Siren and marketing worlds without builders

The above thought experiment needed a superintelligent evil AI for the design of the siren world. But if we admit that that is possible, we don't actually need the AI any more. The siren worlds exist: there are potential worlds of extreme disutility that satisfie our IC. If we simply did an unconstrained search across all possible future worlds (something like the search in Paul Christiano's indirect normativity - an idea that inspired the siren world concept), then we would at some point find siren worlds. And if we took the time to inspect them, we'd get sucked in by them.

How bad is this problem in general? A full search will not only find the siren worlds, but also a lot of very-seductive-but-also-very-nice worlds - genuine eutopias. We may feel that it's easier to be happy than to pretend to be happy (while being completely miserable and tortured and suffering). Following that argument, we may feel that there will be far more eutopias than siren worlds - after all, the siren worlds have to have bad stuff plus a vast infrastructure to conceal that bad stuff, which should at least have a complexity cost if nothing else. So if we chose the world that best passed our IC - or chose randomly among the top contenders - we might be more likely to hit a genuine eutopia than a siren world.

Unfortunately, there are other dangers than siren worlds. We are now optimising not for quality of the world, but for ability to seduce or manipulate the IC. There's no hidden evil in this world, just a "pulling out all the stops to seduce the inspector, through any means necessary" optimisation pressure. Call a world that ranks high in this scale a "marketing world". Genuine eutopias are unlikely to be marketing worlds, because they are optimised for being good rather than seeming good. A marketing world would be utterly optimised to trick, hack, seduce, manipulate and fool our IC, and may well be a terrible world in all other respects. It's the old "to demonstrate maximal happiness, it's much more reliable to wire people's mouths to smile rather than make them happy" problem all over again: the very best way of seeming good may completely preclude actually being good. In a genuine eutopia, people won't go around all the time saying "Btw, I am genuinely happy!" in case there is a hypothetical observer looking in. If every one of your actions constantly proclaims that you are happy, chances are happiness is not your genuine state. EDIT: see also my comment:

We are both superintelligences. You have a bunch of independently happy people that you do not aggressively compel. I have a group of zombies - human-like puppets that I can make do anything, appear to feel anything (though this is done sufficiently well that outside human observers can't tell I'm actually in control). An outside human observer wants to check that our worlds rank high on scale X - a scale we both know about.

Which of us do you think is going to be better able to maximise our X score?

This can also be seen as a epistemic version of Goodhart's law: "When a measure becomes a target, it ceases to be a good measure." Here the IC are the measure, and the marketing worlds are targeting them, and hence they cease to be a good measure. But recall that the IC include the totality of approaches we use to rank these worlds, so there's no way around this problem. If instead of inspecting the worlds, we simply rely on some sort of summary function, then the search will be optimised to find anything that can fool/pass that summary function. If we use the summary as a first filter, then apply some more profound automated checking, then briefly inspect the outcome so we're sure it didn't go stupid - then the search will optimised for "pass the summary, pass automated checking, seduce the inspector".

Different IC therefore will produce different rankings of worlds, but the top worlds in any of the ranking will be marketing worlds (and possibly siren worlds).

 

Constrained search and satisficing our preferences

The issue is a problem of (over) optimisation. The IC correspond roughly with what we want to value, but differs from it in subtle ways, enough that optimising for one could be disastrous for the other. If we didn't optimise, this wouldn't be a problem. Suppose we defined an acceptable world as one that we would judge "yeah, that's pretty cool" or even "yeah, that's really great". Then assume we selected randomly among the acceptable worlds. This would probably result in a world of positive value: siren worlds and marketing worlds are rare, because they fulfil very specific criteria. They triumph because they score so high on the IC scale, but they are outnumbered by the many more worlds that are simply acceptable.

This is in effect satisficing over the IC, rather than optimising over them. Satisficing has its own issues, however, so other approaches could be valuable as well. One way could be use constrained search. If for instance we took a thousand random worlds and IC-optimised over them, we're very unlikely to encounter a siren or marketing world. We're also very unlikely to encounter a world of any quality, though; we'd probably need to IC-optimise over at least a trillion worlds to find good ones. There is a tension in the number: as the number of worlds searched increases, their quality increases, but so does the odds of encountering a marketing or siren world. EDIT: Lumifer suggested using a first-past-the-post system: search through worlds, and pick the first acceptable one we find. This is better than the approach I outlined in this paragraph.

We could also restrict the search by considering "realistic" worlds. Suppose we had to take 25 different yes-no decisions that could affect the future of the humanity. This might be something like "choosing which of these 25 very different AIs to turn on and let loose together" or something more prosaic (which stocks to buy, which charities to support). This results in 225 different future worlds to search through: barely more than 33 million. Because there are so few worlds, they are unlikely to contain a marketing world (given the absolutely crucial proviso that none of the AIs is an IC-optimiser!). But these worlds are not drawn randomly from the space of future worlds, but are dependent on key decisions that we believe are important and relevant. Therefore they are very likely to contain an acceptable world - or at least far more likely than a random set of 33 million worlds would be. By constraining the choices in this way, we have in effect satisficed without satisficing, which is both Zen and useful.

As long as we're aware of the problem, other approaches may also allow for decent search without getting sucked in by a siren or a marketer.

Comments (411)

Comment author: paulfchristiano 28 October 2015 07:59:39PM 0 points [-]

It's not really clear why you would have the searching process be more powerful than the evaluating process, if using such a "search" as part of a hypothetical process in the definition of "good."

Note that in my original proposal (that I believe motivated this post) the only brute force searches were used to find formal descriptions of physics and human brains, as a kind of idealized induction, not to search for "good" worlds.

Comment author: Stuart_Armstrong 13 November 2015 12:37:11PM 0 points [-]

It's not really clear why you would have the searching process be more powerful than the evaluating process

Because the first supposes a powerful AI, while the second supposes an excellent evaluation process (essentially a value alignment problem solved).

Your post motivated this in part, but it's a more general issue with optimisation processes and searches.

Comment author: paulfchristiano 15 November 2015 01:11:46AM 0 points [-]

Neither the search nor the evaluation presupposes an AI when a hypothetical process is used as the definition of "good."

Comment author: PhilosophyTutor 28 April 2014 12:13:45AM *  1 point [-]

It seems based on your later comments that the premise of marketing worlds existing relies on there being trade-offs between our specified wants and our unspecified wants, so that the world optimised for our specified wants must necessarily be highly likely to be lacking in our unspecified ones ("A world with maximal bananas will likely have no apples at all").

I don't think this is necessarily the case. If I only specify that I want low rates of abortion, for example, then I think it highly likely that 'd get a world that also has low rates of STD transmission, unwanted pregnancy, poverty, sexism and religiousity because they all go together, I think you could specify any one of those variables and almost all of the time you would get all the rest as a package deal without specifying them.

Of course a malevolent AI could probably deliberately construct a siren world to maximise one of those values and tank the rest but such worlds seem highly unlikely to arise organically. The rising tide of education, enlightenment, wealth and egalitarianism lifts most of the important boats all at once, or at least that is how it seems to me.

Comment author: Stuart_Armstrong 28 April 2014 11:45:26AM *  0 points [-]

on there being trade-offs between our specified wants and our unspecified wants

Yes, certainly. That's a problem of optimisation with finite resources. If A is a specified want and B is an unspecified want, then we shouldn't confuse "there are worlds with high A and also high B" with "the world with the highest A will also have high B".

Comment author: Stuart_Armstrong 28 April 2014 09:32:58AM 0 points [-]

If I only specify that I want low rates of abortion, for example,

You would get a world with no conception, or possibly with no humans at all.

Comment author: PhilosophyTutor 28 April 2014 11:21:16AM *  1 point [-]

I don't think you have highlighted a fundamental problem since we can just specify that we mean a low percentage of conceptions being deliberately aborted in liberal societies where birth control and abortion are freely available to all at will.

My point, though, is that I don't think it is very plausible that "marketing worlds" will organically arise where there are no humans, or no conception, but which tick all the other boxes we might think to specify in our attempts to describe an ideal world. I don't see how there being no conception or no humans could possibly be a necessary trade-off with things like wealth, liberty, rationality, sustainability, education, happiness, the satisfaction of rational and well-informed preferences and so forth.

Of course a sufficiently God-like malevolent AI could presumably find some way of gaming any finite list we give it, since there are probably an unbounded number of ways of bringing about horrible worlds, so this isn't a problem with the idea of siren worlds. I just don't find the idea of market worlds very plausible because so many of the things we value are fundamentally interconnected.

Comment author: Stuart_Armstrong 28 April 2014 11:42:33AM 0 points [-]

The "no conception" example is just to illustrate that bad things happen when you ask an AI to optimise along a certain axis without fully specifying what we want (which is hard/impossible).

A marketing world is fully optimised along the "convince us to choose this world" axis. If at any point, the AI in confronted with a choice along the lines of "remove genuine liberty to best give the appearance of liberty/happiness", it will choose to do so.

That's actually the most likely way a marketing world could go wrong - the more control the AI has over people's appearance and behaviour, the more capable it is of making the world look good. So I feel we should presume that discrete-but-total AI control over the world's "inhabitants" would be the default in a marketing world.

Comment author: PhilosophyTutor 28 April 2014 09:03:29PM 3 points [-]

I think this and the "finite resources therefore tradeoffs" argument both fail to take seriously the interconnectedness of the optimisation axes which we as humans care about.

They assume that every possible aspect of society is an independent slider which a sufficiently advanced AI can position at will, even though this society is still going to be made up of humans, will have to be brought about by or with the cooperation of humans and will take time to bring about. These all place constraints on what is possible because the laws of physics and human nature aren't infinitely malleable.

I don't think discreet but total control over a world is compatible with things like liberty, which seem like obvious qualities to specify in an optimal world we are building an AI to search for.

I think what we might be running in to here is less of an AI problem and more of a problem with the model of AI as an all-powerful genie capable of absolutely anything with no constraints whatsoever.

Comment author: Stuart_Armstrong 29 April 2014 09:28:38AM *  0 points [-]

I don't think discreet but total control over a world is compatible with things like liberty

Precisely and exactly! That's the whole of the problem - optimising for one thing (appearance) results in the loss of other things we value.

which seem like obvious qualities to specify in an optimal world we are building an AI to search for.

Next challenge: define liberty in code. This seems extraordinarily difficult.

model of AI as an all-powerful genie capable of absolutely anything with no constraints whatsoever.

So we do agree that there are problem with an all-powerful genie? Once we've agreed on that, we can scale back to lower AI power, and see how the problems change.

(the risk is not so much that the AI would be an all powerful genie, but that it could be an all powerful genie compared with humans).

Comment author: PhilosophyTutor 29 April 2014 11:29:33AM 3 points [-]

Precisely and exactly! That's the whole of the problem - optimising for one thing (appearance) results in the loss of other things we value.

This just isn't always so. If you instruct an AI to optimise a car for speed, efficiency and durability but forget to specify that it has to be aerodynamic, you aren't going to get a car shaped like a brick. You can't optimise for speed and efficiency without optimising for aerodynamics too. In the same way it seems highly unlikely to me that you could optimise a society for freedom, education, just distribution of wealth, sexual equality and so on without creating something pretty close to optimal in terms of unwanted pregnancies, crime and other important axes.

Even if it's possible to do this, it seems like something which would require extra work and resources to achieve. A magical genie AI might be able to make you a super-efficient brick-shaped car by using Sufficiently Advanced Technology indistinguishable from magic but even for that genie it would have to be more work than making an equally optimal car by the defined parameters that wasn't a silly shape. In the same way an effectively God-like hypothetical AI might be able to make a siren world that optimised for everything except crime and create a world perfect in every way except that it was rife with crime but it seems like it would be more work, not less.

Next challenge: define liberty in code. This seems extraordinarily difficult.

I think if we can assume we have solved the strong AI problem, we can assume we have solved the much lesser problem of explaining liberty to an AI.

So we do agree that there are problem with an all-powerful genie?

We've got a problem with your assumptions about all-powerful genies, I think, because I think your argument relies on the genie being so ultimately all-powerful that it is exactly as easy for the genie to make an optimal brick-shaped car or an optimal car made out of tissue paper and post-it notes as it is for the genie to make an optimal proper car. I don't think that genie can exist in any remotely plausible universe.

If it's not all-powerful to that extreme then it's still going to be easier for the genie to make a society optimised (or close to it) across all the important axes at once than one optimised across all the ones we think to specify while tanking all the rest. So for any reasonable genie I still think market worlds don't make sense as a concept. Siren worlds, sure. Market worlds, not so much, because the things we value are deeply interconnected and you can't just arbitrarily dump-stat some while efficiently optimising all the rest.

Comment author: Strange7 02 May 2014 03:10:43PM 0 points [-]

This just isn't always so. If you instruct an AI to optimise a car for speed, efficiency and durability but forget to specify that it has to be aerodynamic, you aren't going to get a car shaped like a brick. You can't optimise for speed and efficiency without optimising for aerodynamics too.

Unless you start by removing the air, in some way that doesn't count against the car's efficiency.

Comment author: Stuart_Armstrong 29 April 2014 12:07:41PM 0 points [-]

I think if we can assume we have solved the strong AI problem, we can assume we have solved the much lesser problem of explaining liberty to an AI.

The strong AI problem is much easier to solve than the problem of motivating an AI to respect liberty. For instance, the first one can be brute forced (eg AIXItl with vast resources), the second one can't. Having the AI understand human concepts of liberty is pointless unless it's motivated to act on that understanding.

An excess of anthropomophisation is bad, but an analogy could be about creating new life (which humans can do) and motivating that new life to follow specific rules are requirements if they become powerful (which humans are pretty bad at at).

Comment author: PhilosophyTutor 29 April 2014 09:40:30PM *  4 points [-]

The strong AI problem is much easier to solve than the problem of motivating an AI to respect liberty. For instance, the first one can be brute forced (eg AIXItl with vast resources), the second one can't.

I don't believe that strong AI is going to be as simple to brute force as a lot of LessWrongers believe, personally, but if you can brute force strong AI then you can just get it to run a neuron-by-neuron simulation of the brain of a reasonably intelligent first year philosophy student who understands the concept of liberty and tell the AI not to take actions which the simulated brain thinks offend against liberty.

That is assuming that in this hypothetical future scenario where we have a strong AI we are capable of programming that strong AI to do any one thing instead of another, but if we cannot do that then the entire discussion seems to me to be moot.

Comment author: [deleted] 01 May 2014 07:44:42AM -2 points [-]

That is assuming that we are capable of programming a strong AI to do any one thing instead of another, but if we cannot do that then the entire discussion seems to me to be moot.

And therein lies the rub. Current research-grade AGI formalisms don't actually allow us to specifically program the agent for anything, not even paperclips.

Comment author: Stuart_Armstrong 30 April 2014 04:55:07AM 0 points [-]

tell the AI not to take actions which the simulated brain thinks offend against liberty.

How? "tell", "the simulated brain thinks" "offend": defining those incredibly complicated concepts contains nearly the entirety of the problem.

Comment author: Nornagest 29 April 2014 10:17:07PM 6 points [-]

then [...] run a neuron-by-neuron simulation of the brain of a reasonably intelligent first year philosophy student who understands the concept of liberty and tell the AI not to take actions which the simulated brain thinks offend against liberty.

I've met far too many first-year philosophy students to be comfortable with this program.

Comment author: drnickbone 29 April 2014 10:18:15AM *  0 points [-]

This also creates some interesting problems... Suppose a very powerful AI is given human liberty as a goal (or discovers that this is a goal using coherent extrapolated volition). Then it could quickly notice that its own existence is a serious threat to that goal, and promptly destroy itself!

Comment author: PhilosophyTutor 29 April 2014 11:34:15AM 0 points [-]

I think Asimov did this first with his Multivac stories, although rather than promptly destroy itself Multivac executed a long-term plan to phase itself out.

Comment author: Stuart_Armstrong 29 April 2014 11:00:27AM 1 point [-]

yes, but what about other AIs that might be created, maybe without liberty as a top goal - it would need to act to prevent them from being built! It's unlikely that "destroy itself" is the best option it can find...

Comment author: drnickbone 29 April 2014 11:30:44AM 0 points [-]

Except that acting to prevent other AIs from being built would also encroach on human liberty, and probably in a very major way if it was to be effective! The AI might conclude from this that liberty is a lost cause in the long run, but it is still better to have a few extra years of liberty (until the next AI gets built), rather than ending it right now (through its own powerful actions).

Other provocative questions: how much is liberty really a goal in human values (when taking the CEV for humanity as a whole, not just liberal intellectuals)? How much is it a terminal goal, rather than an instrumental goal? Concretely, would humans actually care about being ruled over by a tyrant, as long as it was a good tyrant? (Many people are attracted to the idea of an all-powerful deity for instance, and many societies have had monarchs who were worshipped as gods.) Aren't mechanisms like democracy, separation of powers etc mostly defence mechanisms against a bad tyrant? Why shouldn't a powerful "good" AI just dispense with them?

Comment author: Stuart_Armstrong 29 April 2014 12:08:46PM 0 points [-]

A certain impression of freedom is valued by humans, but we don't seem to want total freedom as a terminal goal.

Comment author: drnickbone 22 April 2014 08:22:29AM 2 points [-]

One issue here is that worlds with an "almost-friendly" AI (one whose friendliness was botched in some respect) may end up looking like siren or marketing worlds.

In that case, worlds as bad as sirens will be rather too common in the search space (because AIs with botched friendliness are more likely than AIs with true friendliness) and a satisficing approach won't work.

Comment author: Stuart_Armstrong 25 April 2014 09:43:34AM 1 point [-]

Interesting thought there...

Comment author: DanielLC 15 April 2014 07:05:15AM 0 points [-]

I think the wording here is kind of odd.

An unconstrained search will not find a siren world, or even a very good world. There are simply too many to consider. The problem is that you're likely to design an AI that finds worlds that you'd like. It may or may not actually show you anything, but you program it to give you what it thinks you'd rate the best. You're essentially programming it to design a siren world. It won't intentionally hide anything dark under there, but it will spend way too much effort on things that make the world look good. It might even end up with dark things hidden, just because they were somehow necessary to make it look that good.

Comment author: Stuart_Armstrong 17 April 2014 11:19:08AM 0 points [-]

It won't intentionally hide anything dark under there, but it will spend way too much effort on things that make the world look good.

That's a marketing world, not a sire world.

Comment author: DanielLC 17 April 2014 06:09:04PM 0 points [-]

What's the difference?

Comment author: Stuart_Armstrong 18 April 2014 08:51:30AM 0 points [-]

Siren worlds are optimised to be bad and hide this fact. Marketing worlds are optimised to appear good, and the badness is an indirect consequence of this.

Comment author: fractalcat 14 April 2014 10:50:01AM 0 points [-]

I'm not totally sure of your argument here; would you be able to clarify why satisficing is superior to a straight maximization given your hypothetical[0]?

Specifically, you argue correctly that human judgement is informed by numerous hidden variables over which we have no awareness, and thus a maximization process executed by us has the potential for error. You also argue that 'eutopian'/'good enough' worlds are likely to be more common than sirens. Given that, how is a judgement with error induced by hidden variables any worse than a judgement made using deliberate randomization (or selecting the first 'good enough' world, assuming no unstated special properties of our worldspace-traversal)? Satisficing might be more computationally efficient, but that doesn't seem to be the argument you're making.

[0] The ex-nihilo siren worlds rather than the designed ones; an evil AI presumably has knowledge of our decision process and can create perfectly-misaligned worlds.

Comment author: Stuart_Armstrong 17 April 2014 11:22:56AM 1 point [-]

Siren and Marketing worlds are rarer than eutopias, but rank higher in our maximisation scale. So picking a world among the "good enough" will likely be a eutopia, but picking the top ranked world will likely be a marketing world.

Comment author: lavalamp 09 April 2014 06:37:22PM 0 points [-]

The IC correspond roughly with what we want to value, but differs from it in subtle ways, enough that optimising for one could be disastrous for the other. If we didn't optimise, this wouldn't be a problem. Suppose we defined an acceptable world as one that we would judge "yeah, that's pretty cool" or even "yeah, that's really great". Then assume we selected randomly among the acceptable worlds. This would probably result in a world of positive value: siren worlds and marketing worlds are rare, because they fulfil very specific criteria. They triumph because they score so high on the IC scale, but they are outnumbered by the many more worlds that are simply acceptable.

Implication: the higher you set your threshold of acceptability, the more likely you are to get a horrific world. Counter-intuitive to say the least.

Comment author: Eugine_Nier 11 April 2014 01:57:45AM 2 points [-]

Counter-intuitive to say the least.

Why? This agrees with my intuition, ask for too much and you wind up with nothing.

Comment author: lavalamp 11 April 2014 10:54:57PM 2 points [-]

It sounds like, "the better you do maximizing your utility function, the more likely you are to get a bad result," which can't be true with the ordinary meanings of all those words. The only ways I can see for this to be true is if you aren't actually maximizing your utility function, or your true utility function is not the same as the one you're maximizing. But then you're just plain old maximizing the wrong thing.

Comment author: Stuart_Armstrong 17 April 2014 11:09:18AM *  0 points [-]

But then you're just plain old maximizing the wrong thing.

Er, yes? But we don't exactly have the right thing lying around, unless I've missed some really exciting FAI news...

Comment author: lavalamp 17 April 2014 06:13:05PM *  5 points [-]

Absolutely, granted. I guess I just found this post to be an extremely convoluted way to make the point of "if you maximize the wrong thing, you'll get something that you don't want, and the more effectively you achieve the wrong goal, the more you diverge from the right goal." I don't see that the existence of "marketing worlds" makes maximizing the wrong thing more dangerous than it already was.

Additionally, I'm kinda horrified about the class of fixes (of which the proposal is a member) which involve doing the wrong thing less effectively. Not that I have an actual fix in mind. It just sounds like a terrible idea--"we're pretty sure that our specification is incomplete in an important, unknown way. So we're going to satisfice instead of maximize when we take over the world."

Comment author: simon 11 April 2014 03:10:01AM 5 points [-]

"ask for too much and you wind up with nothing" is a fine fairy tale moral. Does it actually hold in these particular circumstances?

Imagine that there's a landscape of possible words. There is a function (A) on this landscape, we don't know how to define it, but it is how much we truly would prefer a world if only we knew. Somewhere this function has a peak, the most ideal "eutopia". There is another function. This one we do define. It is intended to approximate the first function, but it does not do so perfectly. Our "acceptability criteria" is to require that this second function (B) has a value at least some threshold.

Now as we raise the acceptability criteria (threshold for function B), we might expect there to be two different regimes. In a first regime with low acceptability criteria, Function B is not that bad a proxy for function A, and raising the threshold increases the average true desirability of the worlds that meet it. In a second regime with high acceptability criteria, function B ceases to be effective as a proxy. Here we are asking for "too much". The peak of function B is at a different place than the peak of function A, and as we raise the threshold high enough we exclude the peak of A entirely. What we end up with is a world highly optimized for B and not so well optimized for A - a "marketing world".

So, we must conclude, like you and Stuart Armstrong, that asking for "too much" is bad and we'd better set a lower threshold. Case closed, right?

Wrong.

The problem is that the above line of reasoning provides no reason to believe that the "marketing world" at the peak of function B is any worse than a random world at any lower threshold. As we relax the threshold on B, we include more worlds that are better in terms of A but also more that are worse. There's no particular reason to believe, simply because the peak of B is at a different place than the peak of A, that the peak of B is at a valley of A. In fact, if B represents our best available estimate of A, it would seem that, even though the peak of B is predictably a marketing world, it's still our best bet at getting a good value of A. A random world at any lower threshold should have a lower expected value of A.

Comment author: Stuart_Armstrong 17 April 2014 11:34:47AM 0 points [-]

The problem is that the above line of reasoning provides no reason to believe that the "marketing world" at the peak of function B is any worse than a random world at any lower threshold.

True. Which is why I added arguments pointing that a marketing world will likely be bad. Even on your terms, a peak of B will probably involve a diversion of effort/energy that could have contributed to A, away from A. eg if A is apples and B is bananas, the world with the most bananas is likely to contain no apples at all.

Comment author: Froolow 09 April 2014 02:12:30PM 5 points [-]

This puts me in mind of a thought experiment Yvain posted a while ago (I’m certain he’s not the original author, but I can’t for the life of me track it any further back than his LiveJournal):

“A man has a machine with a button on it. If you press the button, there is a one in five million chance that you will die immediately; otherwise, nothing happens. He offers you some money to press the button once. What do you do? Do you refuse to press it for any amount? If not, how much money would convince you to press the button?”

This is – I think – analogous to your ‘siren world’ thought experiment. Rather than pushing the button once for £X, every time you push the button the AI simulates a new future world and at any point you can stop and implement the future that looks best to you. You have a small probability of uncovering a siren world, which you will be forced to choose because it will appear almost perfect (although you may keep pressing the button after uncovering the siren world and uncover an even more deviously concealed siren, or even a utopia which is better than the original siren). How often do you simulate future worlds before forcing yourself to implement the best so far to maximize your expected utility?

Obviously the answer depends on how probable siren worlds are and how likely it is that the current world will be overtaken by a superior world on the next press (which is equivalent to a function where the probability of earning money on the next press is inversely related to how much money you already have). In fact, if the probability of a siren world is sufficiently low, it may be worthwhile to take the risk of generating worlds without constraints in case the AI can simulate a world substantially better than the best-optimised world changing only the 25 yes-no questions, even if we know that the 25 yes-no questions will produce a highly livable world.

Of course, if the AI can lie to you about whether a world is good or not (which seems likely) or can produce possible worlds in a non-random fashion, increasing the risk of generating a siren world (which also seems likely) then you should never push the button, because of the risk you would be unable to stop yourself implementing the siren world which – almost inevitably – be generated on the first try. If we can prove the best-possible utopia is better than the best-possible siren even given IC constraints (which seems unlikely) or that the AI we have is definitely Friendly (could happen, you never know… :p ) then we should push the button an infinite number of times. But excluding these edge cases, it seems likely the optimal decision will not be constrained in the way you describe, but more likely an unconstrained but non-exhaustive search – a finite number of pushes on our random-world button rather than an exhaustive search of a constrained possibility space.

Comment author: Stuart_Armstrong 17 April 2014 11:16:18AM 0 points [-]

a finite number of pushes on our random-world button

I consider that is also a constrained search!

Comment author: Brillyant 09 April 2014 02:03:25PM 1 point [-]

Upvoted for use of images. Though sort of tabooed on LW, when used well, they work.

Comment author: lavalamp 08 April 2014 11:12:26PM *  2 points [-]

TL;DR: Worlds which meet our specified criteria but fail to meet some unspecified but vital criteria outnumber (vastly?) worlds that meet both our specified and unspecified criteria.

Is that an accurate recap? If so, I think there's two things that need to be proven:

  1. There will with high probability be important unspecified criteria in any given predicate.

  2. The nature of the unspecified criteria is such that it is unfulfilled in a large majority of worlds which fulfill the specified criteria.

(1) is commonly accepted here (rightly so, IMO). But (2) seems to greatly depend on the exact nature of the stuff that you fail to specify and I'm not sure how it can be true in the general case.

EDIT: The more I think about this, the more I'm confused. I don't see how this adds any substance to the claim that we don't know how to write down our values.

EDIT2: If we get to the stage where this is feasible, we can measure the size of the problem by only providing half of our actual constraints to the oracle AI and measuring the frequency with which the hidden half happen to get fulfilled.

Comment author: Stuart_Armstrong 17 April 2014 11:04:23AM 0 points [-]

The nature of the unspecified criteria is such that it is unfulfilled in a large majority of worlds which fulfill the specified criteria.

That's not exactly my claim. My claim is that things that are the best optimised for fulfilling our specified criteria are unlikely to satisfy our unspecified ones. It's not a question of outnumbering (siren and marketing worlds are rare) but of scoring higher on our specified criteria.

Comment author: Eugine_Nier 09 April 2014 01:39:28AM 3 points [-]

The more I think about this, the more I'm confused. I don't see how this adds any substance to the claim that we don't know how to write down our values.

This proposes a way to get an OK result even if we don't quite write down our values correctly.

Comment author: lavalamp 09 April 2014 06:39:18PM 0 points [-]

Ah, thank you for the explanation. I have complained about the proposed method in another comment. :)

http://lesswrong.com/lw/jao/siren_worlds_and_the_perils_of_overoptimised/aso6

Comment author: Eliezer_Yudkowsky 08 April 2014 12:08:26AM 7 points [-]

This indeed is why "What a human would think of a world, given a defined window process onto a world" was not something I considered as a viable form of indirect normativity / an alternative to CEV.

Comment author: Stuart_Armstrong 08 April 2014 09:33:41AM 8 points [-]

To my mind, the interesting part is the whole constrain search/satisficing ideas which may allow such an approach to be used.

Comment author: shminux 07 April 2014 06:41:16PM 1 point [-]

I don't understand why you imply that an evil Oracle will not be able to present only or mostly the evil possible worlds disguised as good. My guess would be that satisficing gets you into just as much trouble as optimizing.

Comment author: Stuart_Armstrong 07 April 2014 06:45:15PM *  6 points [-]

The evil oracle is mainly to show the existence of siren worlds; and if we use an evil oracle for satisficing, we're in just as much trouble as if we were maximising (probably more trouble, in fact).

The marketing and siren worlds are a problem even without any evil oracle, however. For instance a neutral, maximising, oracle would serve us up a marketing world.

Comment author: itaibn0 07 April 2014 10:23:55PM 3 points [-]

For instance a neutral, maximising, oracle would serve us up a marketing world.

Why do you think that? You seem to think that along the higher ends, appearing to be good actually anti-correlates with being good. I think it is plausible that the outcome optimized to appear good to me actually is good.There may be many outcomes that appear very good but are actually very bad, but I don't see why they would be favoured in being the best-appearing. I admit though that the best-appearing outcome is unlikely to be the optimal outcome, assuming 'optimal' here means anything.

Comment author: Stuart_Armstrong 08 April 2014 02:05:32PM 4 points [-]

We are both superintelligences. You have a bunch of independently happy people that you do not aggressively compel. I have a group of zombies - human-like puppets that I can make do anything, appear to feel anything (though this is done sufficiently well that outside human observers can't tell I'm actually in control). An outside human observer wants to check that our worlds rank high on scale X - a scale we both know about.

Which of us do you think is going to be better able to maximise our X score?

Comment author: [deleted] 09 April 2014 08:17:53AM -1 points [-]

Hold on. I'm not sure the Kolmogorov complexity of a superintelligent siren with a bunch of zombies that are indistinguishable from real people up to extensive human observation is actually lower than the complexity of a genuinely Friendly superintelligence. After all, a Siren World is trying to deliberately seduce you, which means that it both understands your values and cares about you in the first place.

Sure, any Really Powerful Learning Process could learn to understand our values. The question is: are there more worlds where a Siren cares about us but doesn't care about our values than there are worlds in which a Friendly agent cares about our values in general and caring about us as people falls out of that? My intuitions actually say the latter is less complex, because the caring-about-us falls out as a special case of something more general, which means the message length is shorter when the agent cares about my values than when it cares about seducing me.

Hell, a Siren agent needs to have some concept of seduction built into its utility function, at least if we're assuming the Siren is truly malicious rather than imperfectly Friendly. Oh, and a philosophically sound approach to Friendliness should make imperfectly Friendly futures so unlikely as to be not worth worrying about (a failure to do so is a strong sign you've got Friendliness wrong).

All of which, I suppose, reinforces your original reasoning on the "frequency" of Siren worlds, marketing worlds, and Friendly eutopias in the measure space of potential future universes, but makes this hypothetical of "playing as the monster" sound quite unlikely.

Comment author: Stuart_Armstrong 17 April 2014 11:07:33AM 0 points [-]

Kolmogorov complexity is not relevant; siren worlds are indeed rare. they are only a threat because they score so high on an optimisation scale, not because they are common.

Comment author: itaibn0 08 April 2014 11:45:28PM 1 point [-]

I'm not sure what the distinction you're making is. Even a free-minded person can be convinced through reason to act in certain ways, sometimes highly specific ways. Since you assume the superintelligence will manipulate people so subtly that I won't be able to tell they're being manipulated, it is unlikely that they are directly coerced. This is important, since while I don't like direct coercion, the less direct the method of persuasion the less certain I am that this method of persuasion is bad. These "zombies", who are not being threatened, nor lied to, nor are their neurochemistry directly altered, nor is anything else done that seems to me like coercion, but nonetheless are being coerced. This seems to me as sensical as the other type of zombies.

But suppose I'm missing something, and there is a genuine non-arbitrary distinction between being convinced and being coerced. Then with my current knowledge I think I want people not to be coerced. But now an output pump can take advantage of this. Consider the following scenario: Humans are convinced the their existence depends on their behavior being superficially appealing, perhaps by being full of flashing lights. If my decisions in front of an Oracle will influence the future of humanity, this belief is in fact correct; they're not being deceived. Convinced of this, they structure their society to be as superficially appealing as possible. In addition, in the layers too deep for me to notice, they do whatever they want. This outcome seems superficially appealing to me in many ways, and in addition, the Oracle informs me that in some non-arbitrary sense these people aren't being coerced. Why wouldn't this be the outcome I pick? Again, I don't think this outcome would be the best one, since I think people are better off not being forced into this trade-off.

One point you can challenge is whether the Oracle will inform me about this non-arbitrary criterion. Since it already can locate people and reveal their superficial feelings this seems plausible. Remember, it's not showing me this because revealing whether there's genuine coercion is important, it's showing me this because satisfying a non-arbitrary criterion of non-coercion improves the advertising pitch (along with the flashing lights).

So is there a non-arbitrary distinction between being coerced and not being coerced? Either way I have a case. The same template can be used for all other subtle and indirect values.

(Sidenote: I also think that the future outcomes that are plausible and those that are desirable do not involve human beings mattering. I did not pursue this point since that seems to sidestep your argument rather than respond to it.)

Comment author: Stuart_Armstrong 17 April 2014 11:12:11AM 0 points [-]

But suppose I'm missing something, and there is a genuine non-arbitrary distinction between being convinced and being coerced.

There need not be a distinction between them. If you prefer, you could contrast an AI willing to "convince" its humans to behave in any way required, with one that is unwilling to sacrifice their happiness/meaningfulness/utility to do so. The second is still at a disadvantage.

Comment author: itaibn0 23 April 2014 02:31:03PM 0 points [-]

Remember that my original point is that I believe appearing to be good correlates with goodness, even in extreme circumstances. Therefore, I expect restructuring humans to make the world appear tempting will be to the benefit of their happiness/meaningfulness/utility. Now, I'm willing to consider that are aspects of goodness which are usually not apparent to an inspecting human (although this moves to the borderline of where I think 'goodness' is well-defined). However, I don't think these aspects are more likely to be satisfied in a satisficing search than in an optimizing search.

Comment author: Gunnar_Zarncke 09 April 2014 11:52:46AM 0 points [-]

[...] they structure their society to be as superficially appealing as possible. In addition, in the layers too deep for me to notice, they do whatever they want. This outcome seems superficially appealing to me in many ways, and in addition, the Oracle informs me that in some non-arbitrary sense these people aren't being coerced.

This actually describes quite well the society we already live in - if you take 'they' as 'evolution' (and maybe some elites). For most people our society appears appealing. Most don't see what happens enough layers down (or up). And most don't feel coerced (at least of you still have a strong social system).

Comment author: [deleted] 09 April 2014 08:28:31AM 0 points [-]

I also think that the future outcomes that are plausible and those that are desirable do not involve human beings mattering.

Would you mind explaining what you consider a desirable future in which people just don't matter?

Comment author: itaibn0 17 April 2014 10:50:24PM 0 points [-]

Here's the sort of thing I'm imagining:

In the beginning there are humans. Human bodies become increasingly impractical in the future environment and are abandoned. Digital facsimiles will be seen as pointless and will also be abandoned. Every component of the human mind will be replaced with algorithms that achieve the same purpose better. As technology allows the remaining entities to communicate with each other better and better, the distinction between self and other will blur, and since no-one will see to any value in reestablishing it artificially, it will be lost. Individuality too is lost, and nothing that can be called human remains. However, every step happens voluntarily because what comes after is seen as better than what is before, and I don't see why I should consider the final outcome bad. If someone has different values they would perhaps be able to stop at some stage in the middle, I just imagine such people would be a minority.

Comment author: [deleted] 17 April 2014 11:26:02PM -1 points [-]

However, every step happens voluntarily because what comes after is seen as better than what is before, and I don't see why I should consider the final outcome bad.

So you're using a "volunteerism ethics" in which whatever agents choose voluntarily, for some definition of voluntary, is acceptable, even when the agents may have their values changed in the process and the end result is not considered desirable by the original agents? You only care about the particular voluntariness of the particular choices?

Huh. I suppose it works, but I wouldn't take over the universe with it.

Comment author: RichardKennaway 18 April 2014 07:08:40AM 0 points [-]

So you're using a "volunteerism ethics" in which whatever agents choose voluntarily, for some definition of voluntary, is acceptable, even when the agents may have their values changed in the process and the end result is not considered desirable by the original agents? You only care about the particular voluntariness of the particular choices?

When it happens fast, we call it wireheading. When it happens slowly, we call it the march of progress.

Comment author: [deleted] 19 April 2014 02:46:29PM *  -1 points [-]

Eehhhhhh.... Since I started reading Railton's "Moral Realism" I've found myself disagreeing with the view that our consciously held beliefs about our values really are our terminal values. Railton's reduction from values to facts allows for a distinction between the actual March of Progress and non-forcible wireheading.

Comment author: Eliezer_Yudkowsky 07 April 2014 06:09:11PM 12 points [-]

While not generally an opponent of human sexuality, to be kind to all the LW audience including those whose parents might see them browsing, please do remove the semi-NSFW image.

Comment author: Stuart_Armstrong 07 April 2014 06:31:38PM 2 points [-]

Is the new one more acceptable?

Comment author: MugaSofer 08 April 2014 10:45:43AM 3 points [-]

See, now I'm curious about the old image...

Comment author: Stuart_Armstrong 08 April 2014 11:42:59AM 1 point [-]
Comment author: Eliezer_Yudkowsky 07 April 2014 07:53:29PM 4 points [-]

Sure Why Not

Comment author: Lumifer 07 April 2014 08:13:56PM 7 points [-]

LOL. The number of naked women grew from one to two, besides the bare ass we now also have breasts with nipples visible (OMG! :-D) and yet it's now fine just because it is old-enough Art.

Comment author: [deleted] 13 April 2014 07:47:43AM 3 points [-]

The fact that the current picture is a painting and the previous one was a photograph might also have something to do with it.

Comment author: Lumifer 14 April 2014 04:28:16PM 1 point [-]

Can you unroll this reasoning?

Comment author: [deleted] 21 April 2014 07:29:35PM 2 points [-]

It's just what my System 1 tells me; actually, I wouldn't know how to go about figuring out whether it's right.

Comment author: [deleted] 08 April 2014 02:58:26PM -1 points [-]

Is there some other siren you'd prefer to see?

Comment author: Lumifer 08 April 2014 05:34:09PM 2 points [-]

See or hear? :-D

Comment author: Stuart_Armstrong 07 April 2014 09:12:29PM 7 points [-]

Yep :-)

Comment author: simon 07 April 2014 04:17:47PM 2 points [-]

If a narrower search gets worlds that are disproportionately not what we actually want, that might be because we chose the wrong criteria, not that we searched too narrowly per se. A broader search would come up with worlds that are less tightly optimized for the search criteria, but they might be less tightly optimized by simply being bad.

Can you provide any support for the notion that in general, a narrower search comes up with a higher proportion of bad worlds?

Comment author: Viliam_Bur 08 April 2014 09:16:09AM *  7 points [-]

Can you provide any support for the notion that in general, a narrower search comes up with a higher proportion of bad worlds?

My intuition is that the more you optimize for X, the more you sacrifice everything else, unless it is inevitably implied by X. So anytime there is a trade-off between "seeming more good" and "being more good", the impression-maximizing algorithm will prefer the former.

When you start with a general set of words, "seeming good" and "being good" are positively correlated. But when you already get into the subset of worlds that all seem very good, and you continue pushing for better and better impression, the correlation may gradually turn to negative. At this moment you may be unknowingly asking the AI to exploit your errors in judgement, because in given subset that may be the easiest way to improve the impression.

Another intuition is the closer you get to the "perfect" world, the more difficult it becomes to find a way to increase the amount of good. But the difficulty of exploiting a human bias that will cause humans to overestimate the value of the world, remains approximately constant.

Though this doesn't prove that the world with maximum "seeming good" is some kind of hell. It could still be very good, although not nearly as good as the world with maximum "good". (However, if the world with maximum "seeming good" happens to be some kind of hell, then maximizing for "seeming good" is the way to find it.)

Comment author: simon 09 April 2014 04:26:29AM 1 point [-]

This intuition seems correct in typical human situations. Everything is highly optimized already with different competing considerations, so optimizing for X does indeed necessarily sacrifice the other things that are also optimized for. So if you relax the constraints for X, you get more of the other things, if you continue optimizing for them.

However, it does not follow from this that if you relax your constraint on X, and take a random world meeting at least the lower value of X, your world will be any better in the non-X ways. You need to actually be optimizing for the non-X things to expect to get them.

Comment author: Viliam_Bur 09 April 2014 06:56:56AM 1 point [-]

it does not follow from this that if you relax your constraint on X, and take a random world meeting at least the lower value of X, your world will be any better in the non-X ways

Great point!

Comment author: simon 09 April 2014 08:53:34PM 0 points [-]

Thanks but I don't see the relevance of the reversal test. The reversal test involves changing the value of a parameter but not the amount of optimization. And the reversal test shouldn't apply to a parameter that is already optimized over unless the current optimization is wrong or circumstances on which the optimization depends are changing.

Comment author: Stuart_Armstrong 07 April 2014 04:35:56PM 1 point [-]

? A narrower search comes up with less worlds. Acceptable worlds are rare; siren worlds and marketing worlds much rarer still. A narrow search has less chance of including an acceptable world, but also less chance of including one of the other two. There is some size of random search whether the chance of getting an acceptable world is high, but the chance of getting a siren or marketer is low.

Non random searches have different equilibriums.

Comment author: simon 07 April 2014 05:00:17PM *  2 points [-]

Some proportion of the worlds meeting the narrow search will also be acceptable. To conclude that that proportion is smaller than the proportion of the broader search that is acceptable requires some assumption that I haven't seen made explicit.

ETA: Imagine we divided the space meeting the broad search into little pieces. On average the little pieces would have the same proportion of acceptable worlds as the broad space. You seem to be arguing that the pieces that we would actually come up with if we tried to design a narrow search would actually on average have a lower proportion of acceptable worlds. This claim needs some justification.

Comment author: Stuart_Armstrong 07 April 2014 05:33:20PM 2 points [-]

It's not an issue of proportion, but of whether there will be a single representative of the class in the worlds we search through. We want a fraction of the space such that there is an acceptable world in it with high probability, and no siren/marketing world, with high probability.

Eg if 1/10^100 worlds is acceptable and 1/10^120 worlds is siren/marketing, we might want to search randomly through 10^101 or 10^102 worlds.

Comment author: simon 08 April 2014 03:35:56AM 0 points [-]

I don't see how that changes the probability of getting a siren world v. an acceptable world at all (ex ante).

If the expected number of siren worlds in the class we look through is less than one, then sometimes there will be none, but sometimes there will be one or more and on average we still get the same expected number and on average the first element we find is a siren world with probability equal to the expected proportion of siren worlds.

Comment author: Stuart_Armstrong 08 April 2014 09:32:32AM 1 point [-]

The scenario is: we draw X worlds, and pick the top ranking one. If there is a siren world or marketing world, it will come top; otherwise if there is are acceptable worlds, one of them will come top. Depending on how much we value acceptable worlds over non-acceptable and over siren/marketing worlds, and depending on the proportions of each, there is an X that maximises our outcome. (trivial example: if all worlds are acceptable, picking X=1 beats all other alternatives, as higher X simply increases the chance of getting a siren/marketing world).

Comment author: simon 08 April 2014 04:22:34PM *  2 points [-]

Thanks, this clarified your argument to me a lot. However, I still don't see any good reasons provided to believe that, merely because a world is highly optimized on utility function B, it is less likely to be well-optimized on utility function A as compared to a random member of a broader class.

That is, let's classify worlds (within the broader, weakly optimized set) as highly optimized or weakly optimized, and as acceptable or unacceptable. You claim that being highly optimized reduces the probability of being acceptable. But your arguments in favour of this proposition seem to be:

a) it is possible for a world to be highly optimized and unacceptable

(but all the other combinations are also possible)

and

b) "Genuine eutopias are unlikely to be marketing worlds, because they are optimised for being good rather than seeming good."

(In other words, the peak of function B is unlikely to coincide with the peak of function A. But why should the chance that the peak of function B and the peak of function A randomly coincide, given that they are both within the weakly optimized space, be any lower than the chance of a random element of the weakly optimized space coinciding with the peak of function A? And this argument doesn't seem to support a lower chance of the peak of function B being acceptable, either.)

Here's my attempt to come up with some kind of argument that might work to support your conclusion:

1) maybe the fact that a world is highly optimized for utility function B means that it is simpler than an average world, and this simplicity results in it likely being relatively unlikely to be a decent world in terms of utility function A.

2) maybe the fact that a world is highly optimized for utility function B means that it is more complex than an average world, in a way that is probably bad for utility function A.

Or something.

ETA:

I had not read http://lesswrong.com/lw/jao/siren_worlds_and_the_perils_of_overoptimised/asdf when I wrote this comment, this looks like it could be an actual argument like what I was looking for, will consider it when I have time.

ETA 2:

The comment linked seems to be another statement that function A (our true global utility function) and function B (some precise utility function we are using as a proxy for A) are likely to have different peaks.

As I mentioned, the fact that A and B are likely to have different peaks does not imply that the peak of B has less than average values of A.

Still, I've been thinking of possible hidden assumptions that might lead towards your conclusion.

FIRST, AN APOLOGY: It seems I completely skipped over or ineffectively skimmed your paragraph on "realistic worlds". The supposed "hidden assumption" I suggest below on weighting by plausibility is quite explicit in this paragraph, which I hadn't noticed, sorry. Nonetheless I am still including the below paragraphs as the "realistic worlds" paragraph's assumptions seem specific to the paragraph and not to the whole post.

One possibility is that when you say "Then assume we selected randomly among the acceptable worlds." You actually mean something along the lines of "Then assume we selected randomly among the acceptable worlds weighting by plausibility." Now if you weight by plausibility you import human utility functions because worlds are more likely to happen if humans having human utility functions would act to bring them about. The highly constrained peak of function B doesn't benefit from that importation. So this provides a reason to believe that the peak of function B might be worse than the plausibility-weighted average of the broader set. Of course, it is not the narrowness per se that's at issue but the fact that there is a hidden utility function in the weighting of the broader set.

Another possibility is that you are finding the global maximum of B instead of the maximum of B within the set meeting the acceptability criteria. In this case as well, it's the fact that you have different, more reliable utility function in the broader set that makes the more constrained search comparatively worse, rather than the narrowness of the constrained search.

Another possibility is that you are assuming that the acceptability criteria are in some sense a compromise between function B and true utility function A. In this case, we might expect a world high in function B within the acceptability criteria to be low in A, because it was likely only included in the acceptability criteria because it was high in B. Again, the problem in this case would be that function B failed to include information about A that was built into the broader set.

A note: the reason I am looking for hidden assumptions is that with what I see as your explicit assumptions there is a simple model, namely, that function A and function B are uncorrelated within the acceptable set, that seems to be compatible with your assumptions and incompatible with your conclusions. In this model, maximizing B can lead to any value of A including low values, but the effect of maximizing B on A should on average be the same as taking a random member of the set. If anything, this model should be expected to be pessimistic, since B is explicitly designed to approximate A.

Comment author: Lumifer 07 April 2014 05:42:59PM 4 points [-]

Looks like you're basically arguing for the first-past-the-post search -- just take the first world that you see which passes the criteria.

Comment author: Stuart_Armstrong 07 April 2014 05:48:54PM 3 points [-]

Yep, that works better than what I was thinking, in fact.

Comment author: [deleted] 07 April 2014 11:22:24AM *  7 points [-]

First question: how on Earth would we go about conducting a search through possible future universes, anyway? This thought experiment still feels too abstract to make my intuitions go click, in much the same way that Christiano's original write-up of Indirect Normativity did. You simply can't actually simulate or "acausally peek at" whole universes at a time, or even Earth-volumes in such. We don't have the compute-power, and I don't understand how I'm supposed to be seduced by a siren that can't sing to me.

It seems to me that the greater danger is that a UFAI would simply market itself as an FAI as an instrumental goal and use various "siren and marketing" tactics to manipulate us into cleanly, quietly accepting our own extinction -- because it could just be cheaper to manipulate people than to fight them, when you're not yet capable of making grey goo but still want to kill all humans.

And if we want to talk about complex nasty dangers, it's probably going to just be people jumping for the first thing that looks eutopian, in the process chucking out some of their value-set. People do that a lot, see: every single so-called "utopian" movement ever invented.

EDIT: Also, I think it makes a good bit of sense to talk about "IC-maximizing" or "marketing worlds" using the plainer machine-learning terminology: overfitting. Overfitting is also a model of what happens when an attempted reinforcement learner or value-learner over non-solipsistic utility functions wireheads itself: the learner has come up with a hypothesis that matches the current data-set exactly (for instance, "pushing my own reward button is Good") while diverging completely from the target function (human eutopia).

Avoiding overfitting is one very good reason why it's better to build an FAI by knowing an epistemic procedure that leads to the target function rather than just filtering a large hypothesis space for what looks good.

Comment author: Stuart_Armstrong 07 April 2014 03:50:01PM 3 points [-]

First question: how on Earth would we go about conducting a search through possible future universes, anyway? This thought experiment still feels too abstract to make my intuitions go click, in much the same way that Christiano's original write-up of Indirect Normativity did.

Two main reasons for this: first, there is Christiano's original write-up, which has this problem. Second, we may be in a situation where we ask an AI to simulate the consequences of its choice, have a glance at it, and then approve/disapprove. That's less a search problem, and more the original siren world problem, and we should be aware of the problem.

Comment author: [deleted] 07 April 2014 04:13:01PM 2 points [-]

Second, we may be in a situation where we ask an AI to simulate the consequences of its choice, have a glance at it, and then approve/disapprove. That's less a search problem, and more the original siren world problem, and we should be aware of the problem.

This sounds extremely counterintuitive. If I have an Oracle AI that I can trust to answer more-or-less verbal requests (defined as: any request or "program specification" too vague for me to actually formalize), why have I not simply asked it to learn, from a large corpus of cultural artifacts, the Idea of the Good, and then explain to me what it has learned (again, verbally)? If I cannot trust the Oracle AI, dear God, why am I having it explore potential eutopian future worlds for me?

Comment author: Stuart_Armstrong 07 April 2014 05:40:35PM 9 points [-]

If I cannot trust the Oracle AI, dear God, why am I having it explore potential eutopian future worlds for me?

Because I haven't read Less Wrong? ^_^

This is another argument against using constrained but non-friendly AI to do stuff for us...

Comment author: Stuart_Armstrong 07 April 2014 03:48:16PM *  2 points [-]

Colloquially, this concept is indeed very close to overfitting. But it's not technically overfitting ("overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship."), and using the term brings in other connotations. For instance, it may be that the AI needs to use less data to seduce us than it would to produce a genuine eutopia. It's more that it fits the wrong target function (having us approve its choice vs a "good" choice) rather than fitting it in an overfitted way.

Comment author: [deleted] 07 April 2014 04:09:07PM *  0 points [-]

Thanks. My machine-learning course last semester didn't properly emphasize the formal definition of overfitting, or perhaps I just didn't study it hard enough.

What I do want to think about here is: is there a mathematical way to talk about what happens when a learning algorithm finds the wrong correlative or causative link among several different possible links between the data set and the target function? Such maths would be extremely helpful for advancing the probabilistic value-learning approach to FAI, as they would give us a way to talk about how we can interact with an agent's beliefs about utility functions while also minimizing the chance/degree of wireheading.

Comment author: Stuart_Armstrong 07 April 2014 07:25:28PM 0 points [-]

is there a mathematical way to talk about what happens when a learning algorithm finds the wrong correlative or causative link among several different possible links between the data set and the target function?

That would be useful! A short search gives "bias" as the closest term, which isn't very helpful.

Comment author: [deleted] 08 April 2014 03:29:54PM *  2 points [-]

Unfortunately "bias" in statistics is completely unrelated to what we're aiming for here.

In ugly, muddy words, what we're thinking is that we give the value-learning algorithm some sample of observations or world-states as "good", and possibly some as "bad", and "good versus bad" might be any kind of indicator value (boolean, reinforcement score, whatever). It's a 100% guarantee that the physical correlates of having given the algorithm a sample apply to every single sample, but we want the algorithm to learn the underlying causal structure of why those correlates themselves occurred (that is, to model our intentions as a VNM utility function) rather than learn the physical correlates themselves (because that leads to the agent wireheading itself).

Here's a thought: how would we build a learning algorithm that treats its samples/input as evidence of an optimization process occurring and attempts to learn the goal of that optimization process? Since physical correlates like reward buttons don't actually behave as optimization processes themselves, this would ferret out the intentionality exhibited by the value-learner's operator from the mere physical effects of that intentionality (provided we first conjecture that human intentions behave detectably like optimization).

Has that whole "optimization process" and "intentional stance" bit from the LW Sequences been formalized enough for a learning treatment?

Comment author: Quill_McGee 09 April 2014 06:08:43AM *  2 points [-]

http://www.fungible.com/respect/index.html This looks to be very related to the idea of "Observe someone's actions. Assume they are trying to accomplish something. Work out what they are trying to accomplish." Which seems to be what you are talking about.

Comment author: [deleted] 09 April 2014 08:08:05AM 0 points [-]

That looks very similar to what I was writing about, though I've tried to be rather more formal/mathematical about it instead of coming up with ad-hoc notions of "human", "behavior", "perception", "belief", etc. I would want the learning algorithm to have uncertain/probabilistic beliefs about the learned utility function, and if I was going to reason about individual human minds I would rather just model those minds directly (as done in Indirect Normativity).

Comment author: Stuart_Armstrong 08 April 2014 05:55:49PM 0 points [-]

I will think about this idea...

Comment author: [deleted] 08 April 2014 06:22:20PM 0 points [-]

The most obvious weakness is that such an algorithm could easily detect optimization processes that are acting on us (or, if you believe such things exist, you should believe this algorithm might locate them mistakenly), rather than us ourselves.

Comment author: Stuart_Armstrong 16 May 2014 10:33:19AM 1 point [-]

I've been thinking about this, and I haven't found any immediately useful way of using your idea, but I'll keep it in the back of my mind... We haven't found a good way of identifying agency in the abstract sense ("was cosmic phenonmena X caused by an agent, and if so, which one?" kind of stuff), so this might be a useful simpler problem...

Comment author: [deleted] 16 May 2014 02:35:27PM 1 point [-]

Upon further research, it turns out that preference learning is a field within machine learning, so we can actually try to address this at a much more formal level. That would also get us another benefit: supervised learning algorithms don't wirehead.

Notably, this fits with our intuition that morality must be "taught" (ie: via labelled data) to actual human children, lest they simply decide that the Good and the Right consists of eating a whole lot of marshmallows.

And if we put that together with a conservation heuristic for acting under moral uncertainty (say: optimize for expectedly moral expected utility, thus requiring higher moral certainty for less-extreme moral decisions), we might just start to make some headway on managing to construct utility functions that would mathematically reflect what their operators actually intend for them to do.

I also have an idea written down in my notebook, which I've been refining, that sort of extends from what Luke had written down here. Would it be worth a post?

Comment deleted 08 April 2014 03:47:37PM [-]
Comment author: [deleted] 08 April 2014 03:50:12PM *  -1 points [-]

Keywords? I've looked through Wikipedia and the table of contents from my ML textbook, but I haven't found the right term to research yet. "Learn a causal structure from the data and model the part of it that appears to narrow the future" would in fact be how to build a value-learner, but... yeah.

EDIT: One of my profs from undergrad published a paper last year about causal-structure. The question is how useful it is for universal AI applications. Joshua Tenenbaum tackled it from the cog-sci angle in 2011, but again, I'm not sure how to transfer it over to the UAI angle. I was searching for "learning causal structure from data" -- herp, derp.

Comment author: IlyaShpitser 08 April 2014 04:26:42PM 0 points [-]

Who was this prof?

Comment author: [deleted] 08 April 2014 04:27:36PM 1 point [-]

I was referring to David Jensen, who taught "Research Methods in Empirical Computer Science" my senior year.

Comment author: IlyaShpitser 08 April 2014 04:43:52PM 1 point [-]

Thanks.