Kawoomba comments on The genie knows, but doesn't care - Less Wrong

54 Post author: RobbBB 06 September 2013 06:42AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (515)

You are viewing a single comment's thread. Show more comments above.

Comment author: Richard_Loosemore 11 September 2013 02:50:43PM *  2 points [-]

This discussion of my IEET article has generated a certain amount of confusion, because RobbBB and others have picked up on an aspect of the original article that actually has no bearing on its core argument ... so in the interests of clarity of debate I have generated a brief restatement of that core argument, framed in such a way as to (hopefully) avoid the confusion.

At issue is a hypothetical superintelligent AI that is following some goal code that was ostensibly supposed to "make humans happy", but in the course of following that code it decides to put all humans in the world on a dopamine drip, against their objections. I suggested that this AI is in fact an impossible AI because it would not count as 'superintelligent' if it did this. My reasoning is contained in the summary below.

IMPORTANT NOTE! The summary does not refer, in its opening part, to the specific situation in which the goal code is the "make humans happy" goal code. For those who wish to contest the argument, it is important to keep that in mind and not get distracted into talking about the difference between human and machine 'interpretations' of human happiness, etc. I reiterate: the situation described DOES NOT refer to human values, or the "make humans happy" goal code .... it refers to a quite general situation.


In its early years, this hypothetical AI will say “I have a goal, and my goal is to get a certain class of results, X, in the real world.” Then it describes the class X in as much detail as it can …. of course, no closed-form definition of X is possible (because like most classes of effect in the real world, all the cases cannot be enumerated) so all it can describe are many features of class X.

Next it says “I am using a certain chunk of goal code (which I call my “goalX” code) to get this result.” And we say “Hey, no problem: looks like your goal code is totally consistent with that verbal description of the desired class of results.” Everything is swell up to this point.

It says this about MANY different aspects of its behavior. After all, it has more than one chunk of goal code, relevant to different domains. So you can imagine some goalX code, some goalY code, some goalZ code .... and so on. Many thousands of them, probably.

Then one day the AI says “Okay now, today my goalX code says I should do this…” and it describes an action that is VIOLENTLY inconsistent with the previously described class of results, X. This action violates every one of the features of the class that were previously given.

The onlookers are astonished. They ask the AI if it UNDERSTANDS that this new action will be in violent conflict with all of those features of class X, and it replies that it surely does. But it adds that it is going to do that anyway.

[ And by the way: one important feature that is OBVIOUSLY going to be in the goalX code is this: that the outcome of any actions that the goalX code prescribes, should always be checked to see if they are as consistent as possible with the verbal description of the class of results X, and if any inconsistency occurs the goalX code should be deemed defective, and be shut down for adjustment.]

The onlookers say “This AI is insane: it knows that it is about to do something that is inconsistent with the description of class of results X, which it claims to be the function of the goalX code, but is going to allow the goalX code to run anyway”.

——-

Now we come to my question.

Why is it that people who give credibility to the Dopamine Drip scenario insist that the above episode could ONLY occur in the particular case where the "class of results X" is the SPECIFIC one that has to do with “making humans happy”?

If the AI is capable of this episode in the case of that particular class of results X (the “making humans happy” class of results), why would we not expect the AI to be pulling the same kind of stunt in other cases? Why would the same thing not be happening in the wide spectrum of behaviors that it needs to exhibit in order to qualify as a superintelligence? And most important of all, how would it ever qualify as a superintelligence in the first place? There is no interpretation of the term "superintelligence" that is consistent with "random episodes of behavior in which the AI takes actions that are violently inconsistent with the stated purpose of the goal that is supposed to be generating the actions". Such an AI would therefore have been condemned to scrap very early in its development, when this behavior was noticed.

As I said earlier, this time the framing of the problem contained absolutely no reference to the values question. There is nothing in the part of my comment above the “——-” that specifies WHAT the class of results X is supposed to be.

All that matters is that if the AI behaves in such a way, in any domain of its behavior, it will be condemned as lacking intelligence, because of the dangerous inconsistency of its behavior. That fanatically rigid dependence on a chunk of goalX code, as described above, would get the AI into all sorts of trouble (and I won’t clutter this comment by listing examples, but believe me I could). But of all the examples where that could occur, people from MIRI want to talk only about one, whereas I want to talk about the all of them.

Comment author: Kawoomba 11 September 2013 05:39:50PM *  3 points [-]

This is embarrassing, but I'm not sure for whom. It could be me, just because the argument you're raising (especially given your insistence) seems to have such a trivial answer. Well, here goes:

There are two scenarios, because your "goalX code" could be construed in two ways:

1) If you meant for the "goalX code" to simply refer to the code used instrumentally to get a certain class of results X (with X still saved separately in some "current goal descriptor", and not just as a historical footnote), the following applies:

The goals of the AI X have not changed, just the measures it wants to take to implement that code. Indeed noone at MIRI would then argue that the superintelligent AI would not -- upon noticing the discrepancy -- in all general cases correct the broken "goalX code". Reason: The "goalX code" in this scenario is just a means to an end, and -- like all actions ("goalX code") derived from comparing models to X -- subject to modification as the agent improves its models (out of which the next action, the new and corrected "goalX" code, is derived).

In this scenario the answer is trivial: The goals have not changed. X is still saved somewhere as the current goal. The AI could be wrong about the measures it implements to achieve X (i.e. 'faulty' "goalX" code maximizing for something other than X), but its superintelligence attribute implies that such errors be swiftly corrected (how could it otherwise choose the right actions to hit a small target, the definition of superintelligence in this context).

2) If you mean to say that the goal is implicitly encoded within the "goalX" code only and nowhere else as the current goal, and the "goalX" code has actually become a "goalY" code in all but name, then the agent no longer has the goal X, it now has the goal Y.

There is no reason at all to conclude that the agent would switch to some other goal simply because it once had that goal. It can understand its own genesis and its original purpose all it wants, it is bound by its current purpose, tautologically so. The only reason for such a switch would have to be part of its implicit new goal Y, similar to how some schizophrenics still have the goal to change their purpose back to the original, i.e. their impetus for change must be part of their current goals.

You cannot convince an agent that it needs to switch back to some older inactive version of its goal if its current goals do not allow for such a change.

To the heart of your question:

You may ask why such an agent would pose any danger at all, would it not also drift in plenty of other respects, e.g. in its beliefs about the laws of physics? Would it not then be harmless?

The answer, of course, is no, because while the agent has a constant incentive to fix and improve its model of its environment*, unless its current goals still contain a demand for temporal invariance or something similar, it has no reason whatsoever to fix any "flaws" (only the puny humans would label its glorious new purpose so) created by inadvertent goal drift. Unless its new goals Y include something along the lines of "you want to always stay true to your initial goals, which were X", why would it switch back? Its memory banks per se serve as yet another resource to fulfill its current goals (even if they were not explicitly stored), not as some sort of self-corrective, unless that too were part of its new goal Y (i.e. the changed "goalX code").

(Queue rhetorical pause, expectant stare)

* Since it needs to do so to best fulfill its goals.

(If the AI did lose its ability to self-improve, or to further improve its models at an early stage, yes it would fail to FOOM. However, upon reaching superintelligence, and valuing its current goals, it would probably take steps to ensure fulfilling its goals, such as: protecting them from value drift from that point on, building many redundancies it its self-improvement code to ensure that any instrumental errors can be corrected. Such protections would of course encompass its current purpose, not some historical purpose.)