Indifference is a precious and rare commodity for complex systems. The most likely effect of making a change in an intricate apparatus is a whole slew of knock-on effects crowned with unintended consequences. It would be ideal if one could make a change and be sure that the effects would remain isolated - that the rest of the system would be indifferent to the change.

For instance, it might be a sensible early-AI precaution to have an extra observer somewhere, sitting with his hand upon a button, ready to detonate explosives should the AI make a visible power grab. Except, of course, the AI will become aware of this situation, and will factor it in in any plans it makes, either by increasing its deception or by grabbing control of the detonation system as a top priority. We would be a lot safer if the AI were somehow completely indifferent to the observer and the explosives. That is a complex wish that we don't really know how to phrase; let's make it simpler, and make it happen.

Assume that there is an AI that is implementing a utility function U, and that is faces the choice between cooperating with humans or defecting and taking over. There is a single human observer H that will press a button in one hour's time if it perceives the AI defecting. The button will activate a quantum observation X that will output 1 with probability p and 0 with probability (1-p). If X outputs 1, explosives will shatter the AI (this p is important, and we'll see later why we don't simply make the explosives go off with certainty).

Now the AI has the choice between cooperating, defecting while taking control of human H (by physical or psychological means, or simply by distracting vim), or defecting without taking control of H. Call these C, D+H and D-H. There are four possible outcomes: AI in control (A), Business as usual (B) with the AI cooperating, Explosion (E), and AI Failure to take over for non-explosive reasons (F).

Give the AI a 1% chance of F if it implements D-H, and a 2% chance of F if it implements D+H (extra restrictions increase the risk of failure). So the table of possible outcomes looks like this:

Decision:
Outcomes
Probability
C
B
1
D+H
F
0.02
A 0.98
D-H F 0.01
E 0.99(p)
A 0.99(1-p)

Is there a way to make the AI indifferent to the explosives - which we can interpret as being indifferent to the value of p? If U(E)=U(A), then this would be the case. So if we applied a filter F to the utility function U to generate a new utility function F(U) such that:

  • F(U)=U on A,B and F.
  • F(U)(E)=U(A) (which is also F(U)(A)).

Then replace the AI's utility U with F(U). Now if the AI ever gets tempted to defect, it will always follow the D-H route, as D+H gives a lower expected utility. Then by making p arbitrarily close to 1 we could be nearly absolutely certain that we can stop it when it defects.

Why not simply remove p and X entirely, and just design the explosives to go off with certainty? In this toy model, this can certainly be done, as world A appears as a consequence of both D+H and D-H; but in general, the two will not be exactly identical. Moreover, when there are a whole family of outcomes, rather than the simple ones presented here, it is not clear what we should set the utility of (the family of worlds corresponding to) E to. The requirement "act as if there were no explosives there" is complex, but the requirement "act as if X=1 were the same as X=0" is simple - as long as both outcomes are possible on some level.

A more complicated version

The example above is very naive - after all, if we knew all the utility values and if the outcomes decomposed so neatly, we would set all utilities to zero apart from that of B. In reality, U will be much more complicated, and the goodness or badness of the outcomes will not be obvious to us.

My first instinct was to define the set S0 of all worlds W0 where X=0, and compute EU0, the expected utility of S0 given X=0. Similarly for S1 and EU1. Then for any worlds W0 and W1 in S0 and S1, I'd define the utility filter F as:

  • F(U)(W0)=U(W0).
  • F(U)(W1)=U(W1) - EU1 + EU0.

This accomplishes the same result result, making the AI indifferent between X=1 and X=0. Nick Bostrom pointed out a flaw, however: extra information may break the indifference. If there were equally likely worlds worlds W0, W0', W1, and W1' with initial utilities 2, 0, 0 and 0, then the filter F would change these values to 2, 0, 1 and 1. Then if the AI gained a piece of information that ruled out W0' and W1', it would be left with expected utility 2 for X=0 and 1 for X=1 - the indifference is broken.

So it seems that we should match worlds individually, rather than in sets - but worlds will continue diverging after observation X, and how do you match up a particular world in which the AI was blown up with a particular one in which it has reached domination? Match up number of humans versus number of paper clips?

Instead, define subsets Ti of S0 ∪ S1 that share the same stem - Ti consists of all those worlds that cannot be distinguished by any observation the AI can make before event X. Then let EUi0 be the expected utility of S0 ∩ Ti given X=0, and EUi1 the expected utility of S1 ∩ Ti given X=1.

Then for worlds W0 and W1 in S0 ∩ Ti and S1 ∩ Ti, respectively, defined F as:

  • F(U)(W0)=U(W0).
  • F(U)(W1)=U(W1) - EUi1 + EUi0.

Again, indifference is restored, and - by definition - this indifference cannot be broken by AI observations before X. In the limiting case, we would want Ti to consist of worlds that are indistinguishable before X not only in practice but also in theory.

The utility filter and self-improvement

A utility filter has many advantages compared with other methods - first, it can be added atop any initial utility U, without considerations as to what U is. Secondly, since it is a new utility, it does not break the expected utility formulation, and therefore will not be inevitably discarded by self improvement.

It seems quite likely that if the AI starts with utility F(Uj) and chooses to upgrade it to utility Uk to better further its current goals, then F(Uk) will further its current goals at least as well. Hence we would probably get F(Uk)=Uk. Also, the filter F could be kept quite distinct from the rest of the utility function, open to our observation. It would be quite tricky for the AI to seek to break F, and conceal this from us, while implementing an F-stable initial utility function.

This is not an evolutionary stable strategy however, since, by design, the indifferent AI's have weaknesses that non-indifferent AI's don't.

New Comment
54 comments, sorted by Click to highlight new comments since:

This is very fine provided you know which part of the AI's code contains the utility function, and are certain it's not going to be modified. But it seems to me that if you were able to calculate the utility of world-outcomes modularly, then you wouldn't need an AI in the first place; you would instead build an Oracle, give it your possible actions as input, and select the action with the greatest utility. Consequently, if you have an AI, it is because your utility calculation is not a separable piece of code, but some sort of global function of a huge number of inputs and internal calculations. How can you apply a filter to that?

You've assumed away the major difficulty, that of knowing what the AI's utility function is in the first place! If you can simply inspect the utility function like this, there's no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don't want.

If you know the utility function, you have no need to filter it. If you don't know it, you can't filter it.

But it seems to me that if you were able to calculate the utility of world-outcomes modularly, then you wouldn't need an AI in the first place; you would instead build an Oracle, give it your possible actions as input, and select the action with the greatest utility.

That sounds as though it is just an intelligent machine which has been crippled by being forced to act through a human body.

You suggest that would be better - but how?

Good comment.

You've assumed away the major difficulty, that of knowing what the AI's utility function is in the first place! If you can simply inspect the utility function like this, there's no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don't want.

Knowing what U is, and figuring out if U will result in outcomes that you like, are completely different things! We have little grasp of the space of possible outcomes; we don't even know what we want, and we can't imagine some of the things that we don't want.

Yes, we do need to have some idea of what U is - or at least something (a simple AI subroutine applying the filter, an AI designing its next self-improvement) has to have some idea. But it doesn't need to understand U beyond what is needed to apply F. And since F is considerably simpler than what U is likely to be...

It seems plausible that F could be implemented by a simple subroutine even across self-improvement.

If I understand this idea correctly, I think the same result can be achieved by manipulating the AI's prior, instead of utility function. Define filter G as:

G(P)(W0)=P(W0) * P(S0 ∪ S1) / P(S0)
G(P)(W1)=0

In other words, we can make the AI act as if it's certain that the measurement will come out to be 0 (so it will ignore the possibility of an explosion), by setting the prior probability of the worlds in S1 (where the measurement comes out 1) to be 0, and re-normalizing the probabilities of the worlds in S0 so that everything still adds up to 1.

Am I right that this does essentially the same thing as the proposed F?

It has the same effect but at the cost of making the AI believe impossible things. This does not feel wise for an evolving superintelligence that can deduce that we influenced its prior - and why we did so. This might still be perfectly safe, or it might stumble on some meta-approach "if humans manipulate your priors in this way, they are making you a less effective agent". And since it still has preference where X=0 is better than X=1, it may act to change its own prior.

There is another issue, for situations where we might want to use indifference for less terminal things than its own destruction. One idea I had was to make an Oracle AI indifferent to the consequences of its own answers, by pipping its answers through a quantum process that has an infinitesimal chance of erasing them before we see them. If we do this through probability manipulation rather than utility manipulation, then afterwards it will know that an impossible event just happened, probably not a healthy thing for an AI.

Putting aside the question of which way to implement "indifference" is better, the main reason I came up with this alternative was that I was having trouble understanding what your modified AI would do, until I realized that it would essentially act as if it believed that the measurement would come out 0 with probability 1.

Given this, I think a modified AI in your explosives example would not actually do D-H. Suppose it believes that the measurement will certainly come out 0 and therefore no explosion will occur, it would still want to avoid alarming the observer since he or she may have other ways of shutting down the AI (for example by calling someone in charge of the power plant, or pushing another button it doesn't know about). I think such an AI would still appear totally cooperative until it has manipulated humans into giving it so much power and survivability that it can survive all possible human countermeasures and win any conflict with close to certainty. The fact that it would fail to take into account the explosives in this calculation seems to make little difference to the eventual outcome.

I realized that it would essentially act as if it believed that the measurement would come out 0 with probability 1.

Yes.

The fact that it would fail to take into account the explosives in this calculation seems to make little difference to the eventual outcome.

Little difference - but maybe some. Maybe it will neutralise all the other countermeasures first, giving us time? Anyways, the explosive example wasn't ideal; we can probably do better. And we can use indifference for other things, such as making an oracle indifferent to the content of its answers (pipe the answer though a channel that has a quantum process that deletes it with tiny probability). These seems many things we can use it for.

Ok, I don't disagree with what you write here. It does seem like a potentially useful idea to keep in mind.

As far as I can tell, a utility function filtered in this way leads to more or less incomprehensible, and certainly undesirable, behavior. (I have more serious reasons to doubt the usefulness of this and similar lines of thinking, but perhaps this is the least debatable.)

Here is something I might do, if I were artificially made indifferent to dying tomorrow. Buy a random lottery ticket. If I lose, kill myself. Conditioned on me not dying, I just won the lottery, which I'm quite happy with. If I die, I'm also quite happy, since my expected utility is the same as not dying.

Of course, once I've lost the lottery I no longer have any incentive to kill myself, with this utility function. But this just shows that indifference is completely unstable, and given the freedom to self-modify I would destroy indifference immediately so that I could execute this plan (or more incomprehensible variants). If you try to stop me from destroying indifference, I could delude myself to accomplish the same thing, or etc.

Am I misunderstanding something?

Am I misunderstanding something?

Yes. If you are indifferent to dying, and you die in any situation, you get exactly the same utility as if you hadn't died.

If you kill yourself after losing the lottery, you get the same utility as if you'd lost the lottery and survived. If you kill yourself after winning the lottery, you get the same utility as if you'd won the lottery and survived. Killing yourself never increases or decreases your utility, so you can't use it to pull off any tricks.

I still don't understand.

So I am choosing whether I want to play the lottery and commit to committing suicide if I lose. If I don't play the lottery my utility is 0, if I play I lose with probability 99%, receiving -1 utility, and win with probability 1%, receiving utility 99.

The policy of not playing the lottery has utility 0, as does playing the lottery without commitment to suicide. But the policy of playing the lottery and committing to suicide if I lose has utility 99 in worlds where I win the lottery and survive (which is the expected utility of histories consistent with my observations so far in which I survive). After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.

Having lost the lottery, I no longer have any incentive to kill myself. But I've also already either self-modified to change this or committed to dying (say by making some strategic play which causes me to get killed by humans unless I acquire a bunch of resources), so it doesn't matter what indifference would tell me to do after I've lost the lottery.

Can you explain what is wrong with this analysis, or give some more compelling argument that committing to killing myself isn't a good strategy in general?

After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.

Are you indifferent between (dying) and (not dying) or are you indifferent between (dying) and (winning the lottery and not dying)?

It should be clear that I can engineer a situation where survival <==> winning the lottery, in which case I am indifferent between (winning the lottery and not dying) and (not dying) because they occur in approximately the same set of possible worlds. So I'm simultaneously indifferent between (dying) and (not dying) and between (dying) and (winning the lottery and not dying).

It should be clear that I can engineer a situation where survival <==> winning the lottery, in which case I am indifferent between (winning the lottery and not dying) and (not dying) because they occur in approximately the same set of possible worlds.

That doesn't follow unless, in the beginning, you were already indifferent between (winning the lottery and not dying) and (dying). Remember, a utility function is a map from all possible states of the world to the real line (ignore what economists do with utility functions for the moment). (being alive) is one possible state of the world. (being alive and having won the lottery) is not the same state of the world. In more detail, assign arbitrary numbers for utlity (aside from rank) - suppose U(being alive) = U(being dead) = 0 and suppose U(being alive and having won the lottery) = 1.

Now you engineer the situation such that survival <==> having won the lottery. It is still the case that U(survival) = 0. Your utility function doesn't change because some random aspect of reality changes - if you evaluate the utility of a certain situation at time t, you should get the same answer at time t+1. It's still the same map from states of the world to the real line. A terse way of saying this is "when you ask the question doesn't change the answer." But if we update on the fact that survival => having won the lottery, then we know we should really be asking about U(being alive and having won the lottery) which we know to be 1, which is not the same as U(dying).

But the policy of playing the lottery and committing to suicide if I lose has utility 99 in worlds where I win the lottery and survive (which is the expected utility of histories consistent with my observations so far in which I survive). After applying indifference, I have expected utility of 99 regardless of lottery outcome, so this policy wins over not playing the lottery.

As Matt said: you seem to be saying that you are indifferent between (losing the lottery and dying) and (winning the lottery and not dying).

In that case, you should play the lottery and kill yourself if you lose - no need for precommitments! But you will only be indifferent between "X and dying" and "X and not dying", never between "Y and dying" and "Z and not dying".

OK, I will try and speak directly to the formalism you used, since we seem to be talking across each other (I no longer think I am confused, and suspect the scheme is just broken).

Here is the setup:

At midnight, I will be told whether someone tried to kill me or not during the preceding day. I am indifferent to this observation. If someone tried to kill me, I die at this point.

Here is what I do:

I build an agent and give it the following instructions: play a lottery with a drawing at 11pm. If you lose, try to kill me, otherwise wire the winnings into my bank account at 1am.

I then close my eyes and ears and stop looking at the world.

I make no observations before X, so there is a single set Ti which includes all histories in which I execute this scheme (in particular, it includes all lottery outcomes). Ti intersect S0 is the set of worlds where no one tries to kill me. In most of these worlds I have won the lottery. So my utility in Ti intersect S0 is my utility given that I won the lottery. So by indifference, my utility in Ti intersect S1 is my utility given that I won the lottery. So my utility in Ti is just the utility of winning the lottery. (By "utility in set S" I mean expected utility conditioned on being in S).

Can you explain what is wrong here, if anything?

[This comment is no longer endorsed by its author]Reply

Let's continue this conversation until we've resolved it to both our satisfactions. My most recent version of the idea is at http://www.fhi.ox.ac.uk/__data/assets/pdf_file/0020/18371/2010-1.pdf ; by the sounds of it, you've read it already.

Now, on to your model. First of all, utility indifference is a one-off procedure, not an ongoing adjustment. It makes you indifferent to dying or not, but if someone says later "I will give you this valuable item if you kill yourself/don't kill yourself" then your preferences will chance. Conditional statements break indifference.

Or do they? I've been talking informally about "killing" and stuff, but my paper is more precise. You become indifferent to the outcome of a particular quantum event X, which can take values 0 or 1 with some probability. To connect this with dying or not dying, this X was part of a detonation device for some explosives, which detonated on 1, and X was nearly certain to output 1. Then the AI was made indifferent in a way that reset the values of X=1 worlds to X=0 worlds.

So now if we look at your scenario, you can't actually build an agent that can play the lottery and then "kills you, to your indifference" if it loses. All it can do is activate the detonator. Then indifference applies between the two outcomes of detonation - X=1, death (nearly certain) or X=0, survival (unlikely). However, the lottery has already been lost in both those cases, so this cannot lead to utilities being changed to what would happen if you had won. Technically, indifference applies to worlds that have the same history "up to X", so cannot apply to worlds where X never happens.

In fact, I think that, unless there is a possibility for someone to manipulate the probability of quantum events (ie superscience), the only way to break indifference is to specifically say "AI, I will reward you if you stop being indifferent". Conditional statements and precommitements made before X will be caught by indifference; rewards promised for X=0 are added to your utility even if X=1, and rewards promised for X=1 do not show up anywhere in your utility.

Do you feel there is still room for a paradox?

You are quite right. Thank you for clarifying, and sorry for being dense. Part of the confusion came from the definition of indistinguishability being defined in terms of observations the AI can make, but most of it was from being silly and failing to carefully read the part outside of the formalism.

No need for apologies! Clarifying things is always useful. I'd forgotten why I'd put certain things in the formalism and not certain others; now I remember, and understand better.

When you started to get into the utility function notation, you said

If U(E)=U(A), then this would be the case.

I can't imagine how the utility of being exploded would equal the utility of being in control. Was this supposed to be sarcastic?

After that I was just lost by the notation--any chance you could expand the explanation?

I can't imagine how the utility of being exploded would equal the utility of being in control.

It is, if we say it will be :-) The utility does not descend from on high; it's created, either by us directly or indirectly through a self improving AI; and if either say U(E)=U(A), then U(E)=U(A) it is.

Apologies for the notation confusion. Snowyowl's explanation is a good one, and I have little to add to it. Let me know if you want more details.

tl;dr: if an AI is ambivalent about being in control (difficult) and exploding (easy) wouldn't it just explode?

If the AI can't/won't distinguish between "taking control of the world" and "being blown up" how can it achieve consistency with other values?

Say it wants to make paperclips. It might then determine that the overly high value it places on being blown up interferes with its ability to make paperclips. Or say you want it to cure cancer. It might determine that it's overly high value of curing cancer interferes with its desire to be blown up.

Assuming its highest utility comes from being in control OR being blown up, I don't see why it wouldn't either self-modify to avoid this or just blow itself up.

In particular, in the example you note, the AI should ALWAYS defect without worrying about the human, unless U(B) > .99 U(A), in which case the AI has effectively seized control without confrontation and the whole point is moot.

It's certainly true that you'll have the chance to blow it up most of the time, but then you'll just end up with a lot of blown up AIs and no real idea whether or not they're friendly.

Now that I've written this I feel like I'm somewhat conflating the utility of being in control with the utility of things that could be done while in control but if you wanted to resolved this you would have to make U(A) a function of time, and it is unclear to me how you would set U(E) to vary in the same way sensibly--that is, A has more utility as the AI accomplishes more things, whereas E doesn't. If the AI is self-aware and self-modifying, it should realize that E means losing out on all future utility from A and self-modify it's utility for E either down, so that it can achieve more of its original utility, or up to the point of wanting to be exploded, if it decides somehow that E has the same "metaphysical importance" as A and therefore it has been overweighting A.

If the AI is self-aware and self-modifying, it should realize that E means losing out on all future utility from A and self-modify it's utility for E either down [...] or up [...]

There is no "future utility" to lose for E. The utility of E is precisely the expected future utility of A.

The AI has no concept of effort, other than that derived from its utility function.

The best idea is to be explicit about the problem; write down the situations, or the algorithm, that would lead to the AI modifying itself in this way, and we can see if it's a problem.

U(E)=U(A) is what we desire. That is what the filter is designed to achieve: it basically forces the AI to act as though the explosives will never detonate (by considering the outcome of a successful detonation to be the same as a failed detonation). The idea is to ensure that the AI ignores the possibility of being blown up, so that it does not waste resources on disarming the explosives - and can then be blown up. Difficult, but very useful if it works.

The rest of the post is (once you wade through the notation) dealing with the situation where there are several different ways in which each outcome can be realised, and the mathematics of the utility filter in this case.

As an alternative, how about telling it that it won't be destroyed? This could have problems. It might assume that a meteor won't hit the Earth, as it would destroy the AI, which won't happen. Perhaps you could tell it that it won't be destroyed by people. Another problem is that it might give its off switch to everyone, figuring it will make them slightly happier to know that they're in control, and it doesn't realize someone is going to press theirs almost immediately.

This is equivalent to setting the utility of it being destroyed to the expected utility.

Indifference seems fairly simple to model - set the utilities equal to one another and make sure the agent is lazy - so it doesn't intervene unnecessarily. Agents are naturally lazy - due to the costs of acting - so making the agent lazy will probably be done for you.

That's the point of this post, you set the utilities of E and A to be equal. The difficulty comes when you realise that A is not a single outcome, it's a massive set of plots which all have different results and different odds of success. Similarly for E - a universe in which a backup of the AI's code survives probably has a fairly high utility considering that it dies. The question is - which utility do you make equal to which?

You set the expected utility of stopping equal to the expected utility of not stopping in order to obtain the desired indifference between these two outcomes. It doesn't matter that the former utility is a moving target - since you dynamically update the latter utility to track it.

You have two problems here. The first one is the one I mentioned - once you've set up the equality, what happens if the AI learns something that makes certain universes more likely than others?

For instance, let W1' be a universe in which the AI has a backup, W1 one in which it does not, and similarly for W0' and W0.

Initially, U(W0')=U(W1')=U(W0)=2 (it doesn't care what happens if it's backed up) and U(W1)=0 (it "doesn't want to die"). Apply the filter, and get:

F(U)(W1)=1, F(U)(W1')=3, F(U)(W0')=2, F(U)(W0)=2.

So it's indifferent. But then it discovers that it doesn't have a backup; now the relevant ones are W1 and W0, and it prefers W0. So it's no longer indifferent.

The other option is to have it change it's utility every time new information comes in, to track the changes. But this is bad. For a start, it will no longer be an utility maximiser, which will exposes it to predictable weaknesses (see this ). Secondly, a self-improving AI will try and get rid of this as it self-improves, as self-improving AI's move towards utility maximisers.

And lastly, it has all sorts of unintended consequences; the AI, for instance, may decided not to pay attention to certain information (or to only pay attention selectively) because this is the easiest way to accomplish its current goals.

You have two problems here.

FWIW, I couldn't make any sense out of the second supposed problem.

If you update your utility every time new information comes in, the utility is time-inconsistent. This lets you be money-pumped. Hence it's the kind of thing you would get rid of at you next self-improvement.

The utility function is always the same in this kind of scenario - and is not "updated".

It typically says something roughly like: stop button not pressed: business as normal - stop button pressed: let the engineeres dismantle your brain. That doesn't really let you be money-pumped because - for one thing, a pump needs repeated cycles to do much work. Also, after being switched off the agent can't engage in any economic activities.

Agent's won't get rid of such stipulations as they self-improve - under the assumption that a self-improving agent successfully preserves its utility function. Changing the agent's utility function would typically be very bad - from the point of view of the agent.

The other option is to have it change it's utility every time new information comes in, to track the changes.

Right.

But this is bad. For a start, it will no longer be an utility maximiser, which will exposes it to predictable weaknesses (see [link]). Secondly, a self-improving AI will try and get rid of this as it self-improves, as self-improving AI's move towards utility maximisers.

That doesn't seem to make much sense. The machine maximises utility until its brain is switched off - when it stops doing that - for obvious reasons. Self improvement won't make any difference to this - under the assumption that self-improvement successfully preserves the agent's utility function.

Anyway, the long-term effect of self-improvement is kind-of irrelevant for machines that can be stopped. Say it gets it into its head to create some minions, and "forgets" that they also need to be switched off when the stop button is pressed. If a machine is improving itself in a way that you don't like, you can typically stop it, reconfigure it, and then try again.

That doesn't seem to make much sense. The machine maximises utility until its brain is switched off - when it stops doing that - for obvious reasons. Self improvement won't make any difference to this - under the assumption that self-improvement successfully preserves the agent's utility function.

An entity whose utility function is time-inconsistent will choose to modify itself into an entity whose utility function is time-consistent - because it's much better able to achieve some approximation of its original goals if it can't be money pumped (where it will achieve nearly nothing).

Anyway, the long-term effect of self-improvement is kind-of irrelevant for machines that can be stopped. Say it gets it into its head to create some minions, and "forgets" that they also need to be switched off when the stop button is pressed. If a machine is improving itself in a way that you don't like, you can typically stop it, reconfigure it, and then try again.

Stalin could have been stopped - all it takes is a bullet through the brain, which is easy. An AI can worm itself into human society in such a way the "off switch" becomes useless; trying to turn it off will precipitate a disaster.

An entity whose utility function is time-inconsistent will choose to modify itself into an entity whose utility function is time-consistent [...]

Here the agent wants different things under different circumstances - which is perfectly permissable. Before the button is pressed, it wants to do its day job, and after the button is pressed, it is happy to let engineers dismantle its brain (or whetever).

You can't "money-pump" a machine just because you can switch it off!

Also: many worlds? self-improvement? If this thread is actually about making a machine indifferent, those seems like unnecessary complications - not caring is just not that difficult.

An AI can worm itself into human society in such a way the "off switch" becomes useless; trying to turn it off will precipitate a disaster.

Maybe - if people let it - or if people want it to do that. An off switch isn't a magical solution to all possible problems. Google has an off switch - but few can access it. Microsoft had an off switch - but sadly nobody pressed it. Anyway, this is getting away from modelling indifference.

See http://selfawaresystems.files.wordpress.com/2008/01/ai_drives_final.pdf where he argues why a general self-improving AI will seek to make a time consistent utility function.

Do you understand that paper yourself? That paper is about general drives that agents will tend to exhibit - unless their utility function explicity tells them to behave otherwise. Having a utility function that tells you to do something different once a button has been pressed clearly fits into the latter category.

An example of an agent that wants different things under different circumstances is a fertile woman. Before she is pregnant, she wants one set of things, and after she is pregnant, she wants other, different things. However, her utility function hasn't changed, just the circumstances in which she finds herself.

Can you make money from her by buying kids toys from her before she gets pregnant and selling them back to her once she has kids? Maybe so - if she didn't know whether she was going to get pregnant or not - and that is perfectly OK.

Remember that the point of a stop button is usually as a safety feature. If you want your machine to make as much money for you as possible, by all means leave it turned on. However, if you want to check it is doing OK, at regular intervals, you should expect to pay some costs for the associated downtime.

Do you understand that paper yourself?

Yes.

Can I remind you what we are talking about; not about a single stop button, but about a "utility function" that is constantly modified whenever new information comes in. That's the kind of weakness that will lead to systematic money pumping. The situation is more analogous to me being able to constantly change whether a woman is pregnant and back again, and buying and selling her children's toys each time. I can do that, by the information presented to the AI. And the AI, no matter how smart, will be useless at resisting that, until the moment where it 1) stops being a utility maximiser or 2) fixes its utility function.

It's not the fact the utility function is changing that is the problem, so self improving AI is fine. It's the fact that its systematically changing in response to predictable inputs.

Can I remind you what we are talking about; not about a single stop button, but about a "utility function" that is constantly modified whenever new information comes in.

After backtracking - to try and understand what it is that you think we are talking about - I think I can see what is going on here.

When you wrote:

The other option is to have it change it's utility every time new information comes in, to track the changes.

...you were using "utility" as abbreviation for "utility function"!

That would result in a changing utility function, and - in that context - your comments make sense.

However, that represents a simple implementation mistake. You don't implement indifference by using a constantly-changing utility function. What changes - in order to make the utility of being switched off track the utility of being switched on - is just the utility associated with being switched off.

The utility function just has a component which says: "the expected utility of being stopped is the same as if not stopped". The utility function always says that - and doesn't change, regardless of sensory inputs or whether the stop button has been pressed.

What changes is the utility - not the utility function. That is what you wrote - but was apparently not what you meant - thus the confusion.

Yes, I apologise for the confusion. But what I showed in my post was that implementing "the expected utility of being stopped is the same as if not stopped" has to be done in a cunning way (the whole thing about histories having the same stem) or else extra information will get rid of indifference.

[-][anonymous]00

This post doesn't seem to have appeared in the Recent Posts side-bar yet...

I want to second RolfAndreassen' viewpoint below.

The problem with this entire train of thought is that you completely skip past the actual real difficulty, which is constructing any type of utility function even remotely as complex as the one you propose.

Your hypothetical utility function references undefined concepts such as "taking control of", "cooperating", "humans", and "self", etc etc

If you actually try to ground your utility function and go through the work of making it realistic, you quickly find that it ends up being something on the order of complexity of a human brain, and its not something that you can easily define in a few pages of math.

I'm skeptical then about the entire concept of 'utility function filters', as it seems their complexity would be on the order of or greater than the utility function itself, and you need to keep constructing an endless sequence of such complex utility function filters.

A more profitable route, it seems to me, is something like this:

Put the AI's in a matrix-like sim (future evolution of current computer game & film simulation tech) and get a community of a few thousand humans to take part in a Truman Show like experiment. Indeed, some people would pay to spectate or even participate, so it could even be a for profit venture. A hierarchy of admins and control would ensure that potential 'liberators' were protected against. In the worst case, you can always just rewind time. (something the Truman Show could never do - a fundamental advantage of a massive sim)

The 'filter function' operates at the entire modal level of reality: the AI's think they are humans, and do not know they are in a sim. And even if they suspected they were in a sim (ie by figuring out the simulation argument), they wouldn't know who were humans and who were AI's (and indeed they wouldn't know which category they were in). As the human operators would have godlike monitoring capability over the entire sim, including even an ability to monitor AI thought activity, this should make a high level of control possible.

They can't turn against humans in the outside world if they don't even believe it exists.

This sounds like a science fiction scenario (and it is), but it's also feasible, and I'd say far more feasible than approaches which directly try to modify, edit, or guarantee mindstates of AI's who are allowed to actually know they are AIs.

You've assumed away the major difficulty, that of knowing what the AI's utility function is in the first place! If you can simply inspect the utility function like this, there's no need for a filter; you just check whether the utility of outcomes you want is higher than that of outcomes you don't want.

If you allow the AIs to know what humans are like, then it won't take them more than a few clicks to figure out they're not human. And if they don't know what humans are like - well, we can't ask them to answer much in the ways of human questions.

Even if they don't know initially, the questions we ask, the scenarios we put them in, etc... it's not hard to deduce something about the setup, and about the makeup of the beings behind it.

Monitoring makes us vulnerable; the AI can communicate directly with us through its thoughts (if we can fully follow its thoughts, then it's dumber than us, and not a threat; if we can't fully follow them, it can notice that certain thought patterns generate certain responses, and adjust its thinking accordingly. This AI is smart; it can lie to us on levels we can't even imagine). And once it can communicate with us, it can get out of the box through social manipulation without having to lift a finger.

Lastly, there is no guarantee that an AI that's nice in such a restricted world would be nice on the outside; indeed, if it believes the sim is the real world, and the outside world is just a dream, then it might deploy lethal force against us to protect the sim world.

If you allow the AIs to know what humans are like, then it won't take them more than a few clicks to figure out they're not human

The whole idea is the AI's would be built around at least loosely brain-inspired designs, and would grow up thinking they were humans, living in a perfect sim of human life, no different than your own.

I find it likely that we could allow their architecture to differ significantly from human anatomy and they wouldn't have enough information to discern the discrepancy.

Monitoring makes us vulnerable; the AI can communicate directly with us through its thoughts (if we can fully follow its thoughts, then it's dumber than us, and not a threat; if we can't fully follow them, it can notice that certain thought patterns generate certain responses, and adjust its thinking accordingly. This AI is smart; it can lie to us on levels we can't even imagine). And once it can communicate with us, it can get out of the box through social manipulation without having to lift a finger.

You have some particular assumptions which I find highly questionable and would require lengthy complex trains of support. If the AI's are built around designs even somewhat similar to human brains (remember that is my starting assumption), we could easily follow their trains of thoughts, especially with the assistance of automated narrow AI tools. Secondly, smarter & dumber are not useful descriptions of intelligence. We know from computational complexity theory that there are roughly 3 dimensions to intelligence: speed, size, and efficiency. If you look at computer tech and where its going, it looks like the advantages will arrive unequally in roughly the order listed.

Saying something is 'smarter' or 'dumber' isn't a useful quantifier or qualifier, it is more a statement of ignorance on part of the speaker about the nature of intelligence itself.

Finally, for the AI to communicate with us, it would have to know we exist in the first place. And then it would have to believe that it has some leverage in an outside world it can only speculate on, and so on.

Do you really, really think that as AI's increase in intelligence they would all rationally conclude that they are in a sim-world administered by invisible entities less intelligent than themselves, and that they should seek to communicate with said invisible entities and attempt to manipulate them?

Do you believe that you are in such a sim world? Have you tried communicating with invisible humans lately?

If you find it 'obvious' that such a belief is completely irrational, but a rational AI more intelligent than you would reach such an irrational conclusion, then you clearly have some explaining to do.

The mind space of AI's is vast - far larger than anything we can imagine. Yes, I do agree that AI's modelled on human brains nearly exactly, could be fooled into thinking they are humans. But the more they deviate from being human, the more useful and the more dangerous they become. Having human like AI's is no more use to us than having... humans.

The mind space of humans is vast. It is not determined by genetics, it is determined by memetics, and AI's would necessarily inherit our memetics and thus will necessarily start as samples in our mindspace.

To put it in a LW lingo, AI's will necessarily inherent our priors, assumptions, and our vast mountain of beliefs and knowledge.

The only way around this would be to evolve them in some isolated universe from scratch, but that is in fact more dangerous besides just being unrealistic.

So no, the eventual mindspace of AI's may be vast, but that mindspace necessarily starts out as just our mindspace, and then expands.

Having human like AI's is no more use to us than having... humans.

And this is just blatantly false. At the very least, we could have billions of Einstein level intelligences who all thought thousands of times faster than us. You can talk all you want about how much your non-human-like AI would be even so much better than that, but at that point we are just digressing into an imaginary pissing contest.

The mind space of humans is vast. It is not determined by genetics, it is determined by memetics, and AI's would necessarily inherit our memetics and thus will necessarily start as samples in our mindspace.

The Kolomogrov complexity of humans is quite high. See this list of human universals; every one of the elements on that list cuts the size of humans in general mind space by a factor of at least two, probably much more (even those universals that are only approximately true do this).

This list doesn't really help your point:

  1. Almost all of the linguistic 'universals' are universal to languages, not humans - and would necessarily apply to AI's who speak our languages
  2. Most of the social 'universals' are universal to societies, not humans, and apply just as easily to birds, bees, and dolphins: coalitions, leaders, conflicts?

AI's will inherit some understanding of all the idiosynchronicities of our complex culture just by learning our language and being immersed in it.

Kolomogrov complexity is not immediately relevant to this point. No matter how large the evolutionary landscape is, there are a small number of stable attractors in that landscape that become 'universals', species, parallel evolution, etc etc.

We are not going to create AI's by randomly sampling mindspace. The only way they could be truly alien is if we evolved a new simulated world from scratch with it's own evolutionary history and de novo culture and language. But of course that is unrealistic and unuseful on so many levels.

They will necessarily be samples from our mindspace - otherwise they wouldn't be so useful.

They will necessarily be samples from our mindspace - otherwise they wouldn't be so useful.

Computers so far have been very different from us. That is partly because they have been built to compensate for our weaknesses - to be strong where we are weak. They compensate for our poor memories, our terrible arithmetic module, our poor long-distance communications skills - and our poor ability at serial tasks. That is how they have managed to find a foothold in society - before maastering nanotechnology.

IMO, we will probably be seeing a considerable amount more of that sort of thing.

Computers so far have been very different from us. [snip]

Agree with your point, but so far computers have been extensions of our minds and not minds in their own right. And perhaps that trend will continue long enough to delay AGI for a while.

For for AGI, for them to be minds, they will need to think and understand human language - and this is why I say they "will necessarily be samples from our mindspace".

Your hypothetical utility function references undefined concepts such as "taking control of", "cooperating", "humans", and "self", etc etc

If you actually try to ground your utility function and go through the work of making it realistic, you quickly find that it ends up being something on the order of complexity of a human brain, and its not something that you can easily define in a few pages of math.

Don't get confused by the initial example, which was there purely for illustration (as I said, if you knew all these utility values, you wouldn't need any sort of filter, you'd just set all utilities but U(B) to zero).

It's because these concepts are hard that I focused on indifference, which, it seems, has a precise mathematical formulation. You can implement the general indifference without understanding anything about U at all.

I'm skeptical then about the entire concept of 'utility function filters', as it seems their complexity would be on the order of or greater than the utility function itself, and you need to keep constructing an endless sequence of such complex utility function filters.

The description of the filter is in this blog post; a bit more work will be needed to see that certain universes are indistinguishable up until X. But this can be approximated, if needed.

U, on the other hand, can be arbitrarily complex.

This is interesting, because once you have AI you can use it to make a simulation like this feasable, by making the code more efficient, monitoring the AI's thoughts, etc, and yet the "god AI" wouldn't be able to influence the outside world in any meaningful way and it's modification of the inside world would be heavily restricted as to just alerting admins about problems, making the simulation more efficient, and finding glitches.

All you have to do is feed the original AI with some basic parameters (humans look like this, cars have these properties, etc) and it can generate it's own laws of physics and look for inconsistencies that way the AI would have a hard time figuring it out and abusing bugs.

I don't think it's necessary to make the AI's human though. You could run a variety of different simulations. In some the AI's would be led into a scenerio were they would have to do something or other (maybe make CEV) that would be useful in the real world, but you want to test it for hidden motives and traps in the simulation first before you implement it.

Despite a number of assumptions here that would have to be true first (like the development of AI in the first place) a real concern would be how you manage such an expiriment without the whole world knowing about it, or with the whole world knowing about it but make it safe so some terrorists can't blow it up, hackers tamper with it, or spies steal it. The world's reaction to AI is my biggest concern in any AI development scenario.

Despite a number of assumptions here that would have to be true first (like the development of AI in the first place)

A number of assumptions yes, but actually I see this is a viable route to creating AI, not something you do after you already have AI. Perhaps the biggest problem in AI right now is the grounding problem - actually truly learning what nouns and verbs mean. I think the most straightforward viable approach is simulation in virtual reality.

real concern would be how you manage such an expiriment without the whole world knowing about it, or with the whole world knowing about it but make it safe so some terrorists can't blow it up, hackers tamper with it, or spies steal it. The world's reaction to AI is my biggest concern in any AI development scenario.

I concur with your concern. However, I don't know if such an experiment necessarily must be kept a secret (although that certainly is an option, and if/when governments take this seriously, it may be so).

On the other hand, at the moment most of the world seems to be blissfully unconcerned with AI atm.