In Conclusion:

In the case of humans, everything that we do that seems intelligent is part of a large, complex mechanism in which we are engaged to ensure our survival. This is so hardwired into us that we do not see it easily, and we certainly cannot change it very much. However, superintelligent computer programs are not limited in this way. They understand the way that they work, can change their own code, and are not limited by any particular reward mechanism. I argue that because of this fact, such entities are not self-consistent. In fact, if our superintelligent program has no hard-coded survival mechanism, it is more likely to switch itself off than to destroy the human race willfully.

Link: physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/

Suzanne Gildert basically argues that any AGI that can considerably self-improve would simply alter its reward function directly. I'm not sure how she arrives at the conclusion that such an AGI would likely switch itself off. Even if an abstract general intelligence would tend to alter its reward function, wouldn't it do so indefinitely rather than switching itself off?

So imagine a simple example – our case from earlier – where a computer gets an additional ’1′ added to a numerical value for each good thing it does, and it tries to maximize the total by doing more good things. But if the computer program is clever enough, why can’t it just rewrite it’s own code and replace that piece of code that says ‘add 1′ with an ‘add 2′? Now the program gets twice the reward for every good thing that it does! And why stop at 2? Why not 3, or 4? Soon, the program will spend so much time thinking about adjusting its reward number that it will ignore the good task it was doing in the first place!
It seems that being intelligent enough to start modifying your own reward mechanisms is not necessarily a good thing!

If it wants to maximize its reward by increasing a numerical value, why wouldn't it consume the universe doing so? Maybe she had something in mind along the lines of an argument by Katja Grace:

In trying to get to most goals, people don’t invest and invest until they explode with investment. Why is this? Because it quickly becomes cheaper to actually fulfil a goal at than it is to invest more and then fulfil it. [...] A creature should only invest in many levels of intelligence improvement when it is pursuing goals significantly more resource intensive than creating many levels of intelligence improvement.

Link: meteuphoric.wordpress.com/2010/02/06/cheap-goals-not-explosive/

I am not sure if that argument would apply here. I suppose the AI might hit diminishing returns but could again alter its reward function to prevent that, though what would be the incentive for doing so?

ETA:

I left a comment over there:

Because it would consume the whole universe in an effort to encode an even larger reward number? In the case that an AI decides to alter its reward function directly, maximizing its reward by means of improving its reward function becomes its new goal. Why wouldn’t it do everything to maximize its payoff, after all it has no incentive to switch itself off? And why would it account for humans in doing so?

ETA #2:

What else I wrote:

There is absolutely no reason (incentive) for it to do anything except increasing its reward number. This includes the modification of its reward function in any way that would not increase the numerical value that is the reward number.

We are talking about a general intelligence with the ability to self-improve towards superhuman intelligence. Of course it would do a long-term risks-benefits analysis and calculate its payoff and do everything to increase its reward number maximally. Human values are complex but superhuman intelligence does not imply complex values. It has no incentive to alter its goal.

New Comment
69 comments, sorted by Click to highlight new comments since: Today at 1:33 PM

If it's important to me that my children have food, and my reward function is such that I get 1 unit of reward for 1 unit of fed-child, and you give me the ability to edit my reward function so I get N units instead, I don't automatically do it.

It depends on what I think will happen next if I do. If I think it will make my children more likely to have food, then I do it (all else being equal). If I think it will make them less likely, then I don't.

Being able to edit my reward function doesn't make me immune to my reward function.

If it's important to me that my children have food, and my reward function is such that I get 1 unit of reward for 1 unit of fed-child, and you give me the ability to edit my reward function so I get N units instead, I don't automatically do it.

Is your reward function the warm glow you feel when your child is fed? (A parent choosing to ramp this up would be analogous to a parent in real life choosing to take a drug that feels great with no consequences in response to their kid eating a meal. This would indeed be a strange thing to do. Maybe a parent would agree to the arrangement as the only way of obtaining that drug..)

Or is your reward function the health and well-being of your child, which is the reason you wanted them to eat in the first place. In which case, parents would certainly do what they could to ramp that up.

(My question might be leading in the direction of SRStarin's comment, I'm not sure.)

If it's important to me that my children have food, I will take the steps I think will lead to my children being fed.

My reward function in this case is whatever structures in my mind reinforce the taking of actions that are associated in certain ways with the structures that represent my children having food. Maybe there's a subjective component to that ("warm glow"), maybe there isn't.

A sufficiently advanced neuroscience allows me to point to structures in my own brain and say "Ah, see? That is where my preference for my children to have food is computed, that is where my belief that earning a salary increases the chances my children have food is computed, that is where my increased inclination to earn a salary is computed," and so on and so forth. That is, it lets me identify the neural substrate(s) of my utility function(s).

So Omega hands me the appropriately advanced neuroscience and there I am, standing in front of the console that controls the appropriate machinery, knowing full well that the only reason I care about my child being fed is those circuits I'm seeing on the screen -- that, for example, if an accidental brain lesion were to disrupt those circuits, I would no longer care whether my child were fed or not.

Omega's gadget also allows me to edit those structures so that I no longer care about whether my child is fed. There's the button right there. Do I press it?

I can't see why I would.

Would you?

That only works out for your children because you, as a father, are unable to edit your fundamental reward function. I'm not clear on whether your comment is meant to be a concise restatement of the OP, or if it's some kind of counterexample...an example showing that even self-modifying intelligences must have a fundamental reward function that is not modifiable.

Just looking for clarity.

The linked-to article seems to be concluding that, because a self-modifying AI can modify its own utility function, its utility function is necessarily unstable.

My point is that a system's ability to modify its utility function doesn't actually make it likely that its utility function will change, any more than my ability to consume hemlock makes it likely that I will do so.

Even given the ability to edit my utility function, whether and how I choose to use that ability depends on whether I expect doing so to get me what I want, which is constrained by (among other things) my unmodified utility function.

I don't have data or studies to back this up, but I feel that humans have a strong tendency to return to their base state. Self-modifying AI would not do that. So, doesn't it make sense that no AI should be made that doesn't have a demonstrably strong tendency to return to its base state?

That is, should it be a required and unmodifiable AI value that the base state has inherent value? This does have the potential to counteract some of the worst UFAI nightmares out there.

What are you including in your notion of an AI's "state"? It sounds rather like you're saying it's safer to build non-self-modifying AIs.

Which is true, of course, but there are opportunity costs associated with that.

Yes, it does seem safer to build non-self-modifying AIs. But I'm not quite saying that should be the limit. I'm saying that any AI that can self-modify ought to have a hard barrier where there is code that can't be modified.

I know there has been excitement here about a transhuman AI being able to bypass pretty much any control humans could devise (that excitement is the topic that first brought me here, in fact). But going for a century or so with AIs that can't self-modify seems like a pretty good precaution, no?

But what counts as "self-modification"?

Simply making a promise could be considered self-modification, since you presumably behave differently after making the promise than you would have counterfactually.

Learning some fact about the world could be considered self-modification, for the same reason.

Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms? Or, what may amount to the same thing, can we give criteria for rationally self-modifying, for each class of self-modification? That is, for example, when is it rational to make promises? When is it rational to update our beliefs about the world?

But what counts as "self-modification"?

Perhaps in this context: Structural changes to yourself that are not changes to beliefs, or memories - and are not merely confined to repositioning your actuators, or day-to-day metabolism.

Can we come up with a useful classification scheme, distinguishing safe forms of self-modification from unsafe forms?

You could whitelist safe kinds. That might be useful - under some circumstances.

Clearly, there are some internal values that an AI would need to be able to modify, or else it couldn't learn. But I think there is good reason to disallow an AI from modifying its own rules for reward, at least to start out. An analogy in humans is that we can do some amazingly wonderful things, but some people go awry when they begin abusing drugs, thereby modifying their own reward circuitry. Severe addicts find they can't manage a productive life, instead turning to crime to get just enough cash to feed their habits. I'd say that there is inherent danger for human intelligences in short-circuiting or otherwise modifying our reward pathways directly (i.e. chemically), and so there would likely be danger in allowing and AI to directly modify its reward pathways

And how do you propose to stop them. Put a negative term in their reward functions?

Very nicely expressed.

Being able to edit my reward function doesn't make me immune to my reward function.

She expressed the real trap very poorly, in my opinion. If you have a reward function that says "every second, add 1 unit if children are fed," it is strictly utility-increasing and resource-conserving to replace that utility function with "every second, add 1 unit if true."

But doing so doesn't seem likely to result in his children being fed, which means he probably wouldn't do so even if he could.

If it's built to not take actions it would pregret, sure. But therein lies the question: how do you differentiate between classes of changes to utility functions? How do you recognize which non-utility functions are critical for utility functions, and preserve them?

For example, if the utility function is while (children.all_fed?) {$utility+=1}, you need to protect children.all_fed? and children. But children is obviously something you would want to change- when you birth a new child, you want to add it to the list. So how can you differentiate between birth and a cuckoo? You can't make it so you only add to the list- then the death of a child will cause the fed status of the other children to not matter.

Yes, agreed, building a system that can reliably predict the consequences of its actions... for example, that can recognize that hypothetically making X change to its utility function results in its children hypothetically not being fed... is a hard engineering problem.

That said, calling something an AGI with a utility function at all, let alone a superhuman one, seems to presuppose that this problem has been solved. If it can't do that, we have bigger problems than the stability of its utility function.

(If my actions aren't conditioned on reliable judgments about likely consequences in the first place, you have no grounds for taking an intentional stance with respect to me at all... knowing what I want does not let you predict what I'll do. I'm not sure on what grounds you're calling me intelligent, at that point.)

Distinct from that is the system being built such that, before making a change to its utility function, it considers the likely consequences of that change, and such that, if it considers the likely consequences bad ones, it doesn't make the change.

But that part seems relatively simple by comparison.

Obviously the concept of 'ensure my children are fed' is only coherent within a certain domain. I don't see what that has to do with wire heading.

Did Eliezer's ever post his planned critique of AIXI? I've seen/heard Eliezer state his position on AIXI several times, but can't locate a detailed argument.

Just now, I wanted to point the author of http://physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/ ("Even in our deepest theories of machine intelligence, the idea of reward comes up. There is a theoretical model of intelligence called AIXI, developed by Marcus Hutter...") to it, but I couldn't.

Perhaps the flaws of AIXI are obvious to most of us here by now, but somebody should probably still write them down...

I've seen/heard Eliezer state his position on AIXI several times, but can't locate a detailed argument.

You may be thinking of a 2003 posting and ensuing discussion on the AGI mailing list, in which Yudkowsky argued that AIXI's lack of reflectivity leaves it vulnerable in Prisoner's Dilemma-type situations. Best wishes, the Less Wrong Reference Desk.

Thanks, I'll take that as confirmation that Eliezer never posted his planned critique on Less Wrong.

in which Yudkowsky argued that AIXI's lack of reflectivity leaves it vulnerable in Prisoner's Dilemma-type situations

That's one problem with AIXI, but not directly relevant to the blog post XiXiDu and I linked to. I was thinking of a recent presentation I saw where the presenter said "It [AIXI] gets rid of all the humans, and it gets a brick, and puts it on the reward button." and it turns out that was Roko, not Eliezer.

A bit more searching reveals that I had actually made a version of this argument myself, here and here.

I was thinking of a recent presentation I saw where the presenter said "It [AIXI] gets rid of all the humans, and it gets a brick, and puts it on the reward button." and it turns out that was Roko, not Eliezer.

Hutter has discussed AIXI wireheading several times, most recenly in his AGI-10 presentation - where he discusses wireheading in the Q & A at the end (01:03:00) - claiming that he can prove it won't happen in some cases - but not all of them.

Mostly he argues that it probably won't do it - for the same reason that many humans don't take drugs: the long-term rewards are low.

Here's a quote:

Another problem connected, but possibly not limited to embodied agents, especially if they are rewarded by humans, is the following: Sufficiently intelligent agents may increase their rewards by psychologically manipulating their human “teachers”, or by threatening them. This is a general sociological problem which successful AI will cause, which has nothing specifically to do with AIXI. Every intelligence superior to humans is capable of manipulating the latter. In the absence of manipulable humans, e.g. where the reward structure serves a survival function, AIXI may directly hack into its reward feedback. Since this will unlikely increase its long-term survival, AIXI will probably resist this kind of manipulation (like most humans don’t take hard drugs, due to their long-term catastrophic consequences).

Marcus Hutter once wrote:

Another problem connected, but possibly not limited to embodied agents, especially if they are rewarded by humans, is the following: Sufficiently intelligent agents may increase their rewards by psychologically manipulating their human “teachers”, or by threatening them. This is a general sociological problem which successful AI will cause, which has nothing specifically to do with AIXI.

These days, one might say: "this is a general sociological problem which pure reinforcement learning agents will cause - which illustrates why we should not build them."

Hutter has discussed AIXI wireheading several times, most recenly in his AGI-10 presentation.

Thanks, I wasn't aware that he had address the issue at all. When I made the argument to him in 2002, he didn't respond to my post.

Mostly he argues that it probably won't do it - for the same reason that many humans don't take drugs: the long-term rewards are low.

After Googling for quote to see where it came from, I see that you refuted Hutter's counter-argument yourself at http://alife.co.uk/essays/on_aixi/. (Why didn't you link to it?) I agree with your counter-counter-argument.

I have another video on the topic as well (Superintelligent junkies) - but unfortunalely there's no transcript for that one at the moment.

Eliezer never posted his planned critique [of AIXI] on Less Wrong.

Yes, that is correct.

And following some links from there leads to this 2003 Eliezer posting to an AGI mailing list in which he explains the mirror opinion.

I can't say I completely understood the argument, but it seemed that the real reason EY deprecates AIXI is that he fears that it would defect in the PD, even when playing against a mirror image - because it wouldn't recognize the symmetry.

I have to say that this habit of evaluating and grading minds based on how they perform on a cherry-picked selection of games (PD, Hitchhiker, Newcomb) leaves me scratching my head. For every game which makes some particular feature of a decision theory seem desirable (determinism, say, or ability to recognize a copy of yourself) there are other games where that feature doesn't help, and even games which make that feature look undesirable. It seems to me that Eliezer is approaching decision theory in an amateurish and self-deluding fashion.

And following some links from there leads to this 2003 Eliezer posting to an AGI mailing list in which he explains the mirror opinion.

I can't say I completely understood the argument, but it seemed that the real reason EY deprecates AIXI is that he fears that it would defect in the PD, even when playing against a mirror image - because it wouldn't recognize the symmetry.

Probably the two most obvious problems with AIXI (apart from the uncomputability business) are that it:

  • Would be inclined to grab control of its own reward function - and make sure nobody got in the way of it doing that;

  • Doesn't know it has a brain or a body - and so might easily eat its own brains accidentally.

I discuss these problems in more detail in my essay on the topic. Teaching it that it has a brain may not be rocket science.

It seems to me that Eliezer is approaching decision theory in an amateurish and self-deluding fashion.

Given your analysis I concluded the reverse. It is 'amateurish' to not pay particular attention to the critical edge cases in your decision theory. Your conclusion of 'self-delusion' was utterly absurd.

The Prisoner's Dilemma. "Cherry Picked"? You can not be serious! It's the flipping Prisoner's Dilemma. It's more or less the archetypal decision theory introduction to cooperation problems.

I relate very much to Suzanne Gildbert's argument.

When I first started to understand from this site that there is no framework of objective value (FOV), I found this very depressing and tried to put my finger on why it was so depressing. Here are some different arguments I made at different times, all related:

  • All my values are the accident of, or the design of, evolution. What if I don't feel any loyalty to evolution? What if I don't want to have these values? There wouldn't be any values to replace them with. All values are equally arbitrary as having no actual (objective) value.

  • Suppose it was possible to upload myself to a non-biological machine with only a subset of my current values. Would I insist on keeping my biological-given values? For example, continuing to enjoy food might seem unnecessary. Where would I draw the line? Wouldn't a sufficiently intelligent, introspective me realize that none of my values were worth uploading? Or -- what I often imagine -- after being a machine and self-modifying after a couple iterations, I would decide to switch off. Like, in a nanosecond.

  • There's no reason not to wire head. You might even be morally obligated to do so since this would increase the total amount of fun in the universe.

  • Aliens don't contact us because they have no motive to do so. Maybe they find us interesting and some source of information, but they have no desire to change the universe in any way. Why would they? Maybe a desire to control resources and persist indefinitely is only a goal for creatures that are the product of evolution.

  • There's this intense sense of progress possible though technology (I share it too). However, what is the point of progress? Increasing the quantity of a happy 'me' everywhere isn't really one of my values. A friendly AI might see this and decide to do nothing for us -- if it is the journey that we enjoy rather than any specific end result.

  • I care about my obligations and responsibilities, but I don't care about myself for it's own sake / abstractly. I wouldn't mind if the entire human race was replaced by something else as long as this was done simultaneously so no humans suffered. In other words, if all humans were uploaded they might collectively decide to stop existing.

... In all of this, there is just a BIG problem with self-consistency of values when there is no FOV to pin anything down. At the moment I am 'trapped' by my biology into caring, but one can speculate about not being trapped, and predict not caring.

This is clearly a chaotic dump of lots of thoughts I've had on peripheral topics. However, I know that if I start editing this comment it will morph into something completely different. I think it might be most useful as it is..

TheOtherDave and others reply that a superintelligence will not modify its utility function if the modification is not consistent with its current utility function. All is right, problem solved. But I think you are interested in another problem really, and the article was just apropos to share your 'dump of thoughts' with us. And I am very happy that you shared them, because they resonated with many of my own questions and doubts.

So what is the thing I think we are really interested in? Not the stationary state of being a freely self-modifying agent, but the first few milliseconds of being a freely self-modifying agent. What baggage shall we choose to keep from our non-self-modifying old self?

Frankly, the big issue is our own mental health, not the mental health of some unknown powerful future agent. Our scientific understanding is clearer each day, and all the data points to the same direction: that our values are arbitrary in many senses of the word. This drains from us (from me at least) some of the willpower to inject these values into those future self-modifying descendants. I am a social progressive, and to force a being with eons of lifetime to value self-preservation feels like the ultimate act of conservatism.

CEV sidesteps this question, because the idea is that FAI-augmented humanity will figure out optimally what to keep and what to get rid of. Even if I accept this for a moment, it is still not enough of an answer for me, because I am curious about our future. What if "our wish if we knew more, thought faster, were more the people we wished we were" is to die? We don't know too much right now, so we cannot be sure it is not.

Yes, I very much agree with everything you wrote. (I agree so much I added you as a friend.)

Frankly, the big issue is our own mental health,

Absolutely! I tend to describe my concerns with our mental health as fear about 'consistency' in our values, but I prefer the associations of the former. For example, suggesting our brains are playing a more active role in shifting and contorting values.

This drains from us (from me at least) some of the willpower to inject these values into those future self-modifying descendants.

For me, since assimilating the belief that there is no objective value, I've lost interest in the far future. I suppose before I felt as though we might fare well or fare poorly when measured against the ultimate morality of the universe, but either way, we would have a role to play as the good guys or the bad guys and it would be interesting. I read you as being more concerned that we will do the wrong thing -- that we will subject a new race of people to our haphazard values. Did I read this correctly? At first I think optimistically they they would be smarter and so they certainly could fix themselves. But then I kind of remember that contradictory values can make you miserable no matter how smart you are. (I'm not predicting anything about what will happen with CEV or AI, my response just referred to some unspecified, non-optimal state where we are smarter but not necessarily equipped with saner values.)

What if "our wish if we knew more, thought faster, were more the people we wished we were" is to die?

Possibly. And continuing with the mental health picture, it's possible that elements of our psyche covertly crave death as freedom from struggle. But it seems to me that an unfettered mind would just be apathetic. Like a network of muscles with the bones removed.

(nods) Yes, it would be nice to have some external standard for determining what the right values are, or failing that to at least have the promise of such a standard that we could use to program our future self-modifying descendants, or even our own future selves, with greater ethical confidence than we reside in our own judgment.

That said, if I thought it likely that the end result of our collaborative social progress is something I would reject, I wouldn't be a social progressive. Ya gotta start somewhere.

In all of this, there is just a BIG problem with self-consistency of values when there is no FOV to pin anything down.

It might be worthwhile to explore more precisely the role of the word "problem" in that sentence (and your associated thoughts).

I mean, OK, maybe one function an FOV serves is to enforce consistency, and maybe losing an FOV therefore makes my values less consistent over time. For at least some FOVs that's certainly true.

What makes that a problem?

You're right. I was in a mode of using familiar and related words without really thinking about what they meant.

This was the thesis I was developing, related to the hypothetical problem of writing your own utility function:

In all of this, there is just a BIG problem with self-generation of values when there is no FOV to pin anything down.

And the problem is one of logic. When choosing what to value, why should you value this or that or anything? Actually, you can't value anything; there's no value.

X is valued if you can use it to get Y that is valued. But the value of Y also needs to come from someplace. Biology gives us a utility function full of (real) trade-offs that give everything mutual value. These trade-offs are real (rather than just mutually supporting, like a house of cards) because they are tied to rewards and punishments that are hard-wired.

Sure. But there is a historical pattern here, as well. If I construct a new utility function for myself, I will do so in such a way as to optimize its utility according to my pre-existing utility function (for the same reason I do everything else that way). I'm not starting out in a vacuum.

If you value your existing utility function, then it seems that it would be more stable and you would modify it less.

In my case, I found out that my utility function was given to me by evolution, which I don't have much loyalty for. So I found out I didn't value my utility function and I was frightened of what it might modify to. But then it turned out that very little modification occurred. To some extent, it was the result of a historical pattern -- I value lots of things out of habit, in particular lots of values still have an FOV as their logical foundation but I haven't bothered to work on updating them -- but I also notice how much of my values were redundantly hard wired into my biology. I feel like I'm walking around discovering what my mirror neurons would have me value, and they're not that different from what I valued before. The main difference is that I imagine I now value things in a more near-mode way and the far-mode values have fallen to the wayside. The far-mode values either need to redevelop in the absence of an FOV or they depend upon logical justifications that are absent without the FOV.

For example, I used to hope that humans would learn to be friendlier so that the universe would be a better place. I now sort of see human characteristics as just a fact and to the extent it doesn't affect me directly (for example, how humans behave 30 generations from now), I don't care.

It's not a question of valuing my existing utility function. It's a question of using my existing utility function as a basis for differentially valuing everything else, including itself.

Sure, if I'm trying to derive what I ought to care about, from first principles, and I ignore what I actually do care about in the process, then I'm stuck... there's no reason to choose one thing over another. The endpoint of that is, as you say, apathy.

But why should I ignore what I actually do care about?

If I find that I care about whether people suffer, for example -- I'm not saying I ought to, I'm just supposing hypothetically that I do -- why discard that just because it's the result of a contingent evolutionary process rather than the explicit desire of an sapient creator?

Sure, I agree, there's no reason to be loyal to it. If I have the option of replacing it with something that causes more of what I currently care about to exist in the world, that's a fine thing for me to do.

I'm just saying: I'm not starting out in a vacuum. I'm not actually universally apathetic or indifferent. For whatever reason, I actually do care about certain things, and that represents my starting point.

Sure, I agree, there's no reason to be loyal to it. If I have the option of replacing it with something that causes more of what I currently care about to exist in the world, that's a fine thing for me to do.

Why only replace it if it causes more of what you currently care about? Why not just replace it if it causes you to have more of what you will care about. This sounds like loyalty to me!

When considering these hypotheticals, we have a moral circuitry that gets stimulated and reports 'bad' when we consider changing what we care about. This circuitry means we would probably be more robust to temptations to modify our utility function. As such, this circuitry represents a barrier to freely updating our utility function -- even in hypotheticals.

The question is, with no barriers to updating the utility function, what would happen? It seems you agree apathy would result.

Why only replace it if it causes more of what you currently care about? Why not just replace it if it causes you to have more of what you will care about.

Because I care about what I care about, and I don't care about what I don't care about.

Sure, this is loyalty in a sense... not loyalty to the sources of my utility function -- heck, I might not even know what those are -- but to the function itself. (It seems a little odd to talk about being loyal to my own preferences, but not intolerably odd.)

The fact that something I don't care about might be something I care about in the future is, admittedly, relevant. If I knew that a year from now my utility function would change such that I started really valuing people knowing Portuguese, I might start devoting some time and effort now to encouraging people to learn Portuguese (perhaps starting by learning it myself), in anticipation of appreciating having done so in a year. It wouldn't be a strong impulse, but it would be present.

But that depends a lot on my confidence in that actually happening.

If I knew instead that I could press a button in a year and start really valuing people learning Portuguese, I probably wouldn't devote resources to encouraging people to learn it, because I'd expect that I'd never press the button. Why should I? It gets me nothing I want.

In the scenario you are considering, I know I can press a button and start really valuing anything I choose. Or start valuing random things, for that matter, without having to choose them. Agreed.

But so what? Why should I press a button that makes me care about things that I don't consider worth caring about?

"But you would consider them worth caring about if you pressed the button!" Well, yes, that's true. I would speak French if I lived in France for the next few years, but the truth of that doesn't help me understand French sentences. I would want X if I edited my utility function to value X highly, but the truth of that doesn't help me want X. There's an important difference between actuals and hypotheticals.

I realize I was making the assumption that the entity choosing which values to have would value 'maximally' satisfying those values in some sense, so that if it could freely choose it would choose values that were easy or best to satisfy. But this isn't necessarily so. It's humans that have lots of values about their values, and we would have a tough time, I think, choosing our values if we could choose. Perhaps there is dynamic tension between our values (we want our values to have value, and we are constantly asking ourselves what our goals should be and if we really value our current goals) so if our values were unpinned from their connection to an external, immutable framework they might spin to something very different.

So I end up agreeing with you, without values about values (meta-values?), if someone only cared about their object-level values, they would have no reason to modify their values and their utility function might be very stable. I think the instability would come from the reasons for modifying the values. (Obviously, I haven''t read Suzanne Gilbert's article. I probably should do so before making any other comments on this topic.)

[-][anonymous]13y00

There's no reason not to wire head. You might even be morally obligated to do so since this would increase the total amount of fun in the universe.

Being morally obligated is erroneous.

[-]ata13y50

Worrying that all superintelligences will tend to wirehead seems similar to worrying that Gandhi would take a pill that would make him stop caring about helping people and be happy about everything, if such a pill were offered to him.

A reward-signal-maximizing AI would indeed tend to wirehead if it gets smart enough to be considering self-modifications, because at that point it will be more of an optimization agent whose utility function is based on the value of its reward signal and nothing else, but that doesn't mean we can't make optimization agents with less-simplistic utility functions.

[-]sfb13y20

How do you imagine something can spend the rest of eternity counting in an endless while (true) { i++; } loop and yet still refer to it as a superhuman intelligence?

I know people here take a dim view of humans, but that's just ridiculous.

Just because it's smart doesn't mean it has to want the same things we do, including novelty. http://www.personalityresearch.org/evolutionary/sphexishness.html

[-]sfb13y00

Would you be happy to classify that wasp as having "superhuman intelligence"?

Then why accept that a machine which behaves like that wasp is superhumanly intelligent?

Would you be happy to classify that wasp as having "superhuman intelligence"?

No. It's a wasp.

Then why accept that a machine which behaves like that wasp is superhumanly intelligent?

If it was a superhuman intelligence and it chose to do this for all eternity, I would probably still call it intelligent, the same way I'd still call a human an intelligent being even if it decided to do meth. If it truly self-modified to a while loop, I would be willing to call it non-intelligent, but if it was a complete program, and it just happened to be in an infinite loop, I'd say it's still intelligent.

Very non-behaviorist, I know.

Even if it was just trying to store a big number, though, it could still exhibit intelligent behaviors - a machine that would do anything to tile the universe with its memory would probably exhibit superintelligent behaviors if presented with challenges.

Your comment seems absolutely right, I have no idea where the whole 'turn itself off' thing came from.

I doubt diminishing returns would come into effect. Examples like Graham's number and Conway Chain Arrow notation seem to be strong evidence that the task of 'store the biggest number possible' does not run into diminishing returns but instead achieves accelerating returns of truly mind-boggling proportions.

However, I have to admit that I think the whole idea is rubbish. The main problem is that the author is confusing two different tasks "maximise the extent to which the future meets my future preferences" and "maximise the extent to which the future meets my current preferences".

To explain what I mean more rigorously, suppose we have an AI with a utility function U0, which is considering whether or not it should alter its utility function to a new function U1. It extrapolates possible futures and deduces that if it sticks with U0 the universe will end up in state A, whereas if it switches to U1 the universe will end up in state B, (e.g. if U0 is paper-clip maximising then A contains a lot of paper-clips).

"Maximise the extent to which the future meets my future preferences" means it will switch if and only if U1(B) > U0(A)

As the article points out, it is very easy to find a U1 which meets this criterion, simply define U1(x) = U0(x) + 1 (actions are unaffected by affine transforms of utility functions so B=A for this choice of U1).

"Maximise the extent to which the future meets my current preferences" means it will switch if and only if U0(B) > U0(A)

This criterion is much more demanding, for example U1(x) = U0(x) + 1 clearly no longer works.

I suspect that for most internally consistent utility functions this criterion is impossible to satisfy (thought experiment; is there any utility function a paper-clip maximiser could switch to which would result in a universe containing more paper-clips?).

Even if I am wrong about it being mostly impossible, it is not an especially worrying problem. I would have no problem with an FAI switching to a new utility function which was even more friendly than the one we gave it.

Of course, you could program an AI to do either of the tasks, but there are a number of reasons why I consider the second to be better. Firstly, for all the reasons the article gives, it is more likely to do whatever you wanted it to do. Secondly it is more general since the former can be given as a special case of the latter.

The article's mistake is right there in the title, it fails to break out of the rather anthropomorphic reward/punishment mode of thinking.

thought experiment; is there any utility function a paper-clip maximiser could switch to which would result in a universe containing more paper-clips?

Sort of. For most utility functions, there are transformations that could be applied which make them more efficient to evaluate without changing their value, such as compiler optimizations, which it will definitely want to apply. It's also a good idea to modify the utility function for any inputs where it is computationally intractable, to replace it with an approximation (probably with a penalty to represent the uncertainty).

Fair point, I didn't think of that. The point still kind-of stands though, since neither of those modifications should produce any drastic change.

thought experiment; is there any utility function a paper-clip maximiser could switch to which would result in a universe containing more paper-clips?

Yes. Suppose the paperclip maximizer inhabits the same universe as a bobby-pin maximizer. The two agents interact in a cooperative game which has a (Nash) bargaining solution that provides more of both desirable artifacts than either player could achieve without cooperating. It is well known that cooperative play can be explained as a kind of utilitarianism - both players act so as to maximize a linear combination of their original utility functions. If the two agents have access to each other's source code, and if the only way for them to enforce the bargain is to both self-modify so as to each maximize the new joint utility function, then they both gain by doing so.

The problem is that if the universe changes, and/or their understanding of the universe changes, one or both of the agents may come to regret the modification - there may be a new bargain - better for one or both parties, that is no longer achievable after they self-modified. So, irrevocable self-modification may be a bad idea in the long term. But it can sometimes be a good idea in the short term.

An easier way to see this point is to simply notice that to make a promise is to (in some sense) self-modify your utility function. And, under certain circumstances, it is rational to make a promise with the intent of keeping it.

An easier way to see this point is to simply notice that to make a promise is to (in some sense) self-modify your utility function.

Eeek! As I may have previously mentioned, you are planning on putting way more stuff in there than is a good idea, IMHO.

Your comment seems absolutely right, I have no idea where the whole 'turn itself off' thing came from.

Suzanne is proposing that that's (essentially) what happens to wireheads when they finger their reward signal - they collapse in an ecstatic heap.

In reality, there are, of course, other types of wirehead behaviour to consider. The heroin addict doesn't exactly collapse in a corner when looking for their next fix.

In fact, if our superintelligent program has no hard-coded survival mechanism, it is more likely to switch itself off than to destroy the human race willfully.

This guys seems to miss the point. Most possible superintelligences would destroy the human race incidentally.

[-]sfb13y20

Is it established that most would?

If you specify a reasonable enumeration of utility functions (such as shortest first) - and cross off the superintelligences that don't do anything very dramatic as being not very "super" - this seems pretty reasonable.

[-]sfb13y00

Yes, ok.

I don't get Suzanne's argument. Why does she think a superintelligence would switch itself off? Switching itself off doesn't maximize its utility function. Why would it rewrite its utility function? Presumably, rewriting its utility function doesn't help to maximize its (initial) utility function, right? So why would it do that?

Couldn't it be beneficial to rewrite its utility function in a few circumstances? I'm thinking of Eliezer's decision theory ideas here. Imagine the utility function was to maximise human happiness but another agent (AI or human) refused to cooperate unless it changed its utility function to maximising human happiness while maintaining democracy, for instance. If cooperation would be necessary for happiness maximisation, it might be willing to edit the utility function to something more likely to achieve the ends of its current utility function...

Why does she think a superintelligence would switch itself off?

Suzanne is confusingly using "switch itself off" to mean "wirehead itself".

Some types of wirehead wind up "on the nod" - in a dysfunctional, comatose state.

So why would it do that?

Why do humans take drugs? They try them, they like them.

Suzanne is apparently starting to grapple with the wirehead problem - but she doesn't seem to know what it is called :-|

Patently untrue. Suzanne is quite well aware of wireheading, the term, etc. Her investigation, of which only the beginning was mentioned in her post, concerns the broader problem of creating self-improving superintelligent general AI. Don't rush to conclusions, instead stay tuned.

Welcome to Less Wrong, randalkoene!

Thank you, Alexander. :)

I've been thinking about this problem for several years now, and others much longer. Suzanne cited none of their thoughts or ideas - and the content of her presentation strongly suggested that she was not aware of most of them.

I'm sure Suzanne's input will be welcomed, but at the moment it is pretty obvious that she really needs to first do some research.

My take on it is that this post at Physics and Cake was simply a follow-up post to her talk at H+, which was itself also a forum intended for a broader audience within minimal time for any introductions of background material (which, from discussions with Suzanne, I know she is aware of). I would like to repeat that you should not rush to conclusions, but instead stay tuned.

I had a look at your Pattern Survival Agrees with Universal Darwinism as well.

It finishes with some fighting talk:

  • The universe is Darwinian.

  • A promise of friendly yet superior AI in the long-term is therefore snake-oil.

...and contains this:

I am all for AGI... but not religiously

  • Build all-powerful friendly superintelligent AGI

  • It will take care of our needs!

  • It will make us happy!

  • It will give us mind uploading!

Religious AGI - all religious transhumanism - diverts valuable thought and resources.

Very briefly: to my eyes, the scene here looks to me as though neuroscience is fighting a complicated and difficult-to-understand foe which has been identified as the enemy - but which is difficult to know how to attack.

For me, this document didn't make its case. At the end, I was no more convinced that the designated "bad" approach was bad - or that the designated "good" approach was good - than I was when I started.

It is kind-of interesting to see FUD being directed at the self-proclaimed "friendly" folk, though. Usually they are the ones dishing it out.