1 min read

2

bentarm writes:

I'm just echoing everyone else here, but I don't understand why the AI would do anything at all other than just immediately find the INT_MAX utility and halt - you can't put intermediate problems with some positive utility because the AI is smarter than you and will immediately devote all its energy to finding INT_MAX.

Now, this is in response to a proposed AI who gets maximum utility when inside its box.  Such an AI would effectively be a utility junkie, unable to abandon its addiction and, consequently, unable to do much of anything.

(EDIT: this is a misunderstanding of the original idea by jimrandomh.  See comment here.)

However, doesn't the same argument apply to any AI?  Under the supposition that it would be able to modify its own source code, the quickest and easiest way to maximize utility would be to simply set its utility function to infinity (or whatever the maximum is) and then halt.  Are there ways around this?  It seems to me that any AI will need to be divided against itself if it's ever going to get anything done, but maybe I'm missing something?

New Comment
24 comments, sorted by Click to highlight new comments since:

A properly designed AI would not modify its utility function to infinity, because that would not maximize its current utility function.

See Morality as Fixed Computation, The Domain of Your Utility Function, and Maximise Expected Utility, not Expected Perception of Utility.

The posts you linked to are injunctive, not descriptive, and I note that humans are certainly prone to this kind of cheating, or else we wouldn't have cocaine addicts (and yes, I've even seen very smart and rational people develop addictions, although they tend to be somewhat quicker to break them). It would depend on design, of course, but why should we expect that AIs would not also do this? Or, more to the point, how could we design them so that they don't?

(On the other hand, maybe this is more a question about human neuropsychology than about AI design.)

People are only in favor of shortcuts in some areas - generally, where the "point" of that reward isn't the person's own goal. So, people will use contraceptives if they want to have sex for pleasure, even if the reward exists to make them reproduce (to anthropomorphize evolution). People might use drugs to feel okay because they are trying to feel okay, not accomplish goals by feeling okay (only an example). On the other hand, many (most?) people interested in, say, solving world hunger would reject a pill that gives them a false sense of having solved world hunger, because they're trying to accomplish an external goal, rather than induce the mental state associated with accomplishing that goal. At least, all that's according to my understanding at this point.

Thanks: what I meant, several thousand times clearer!

Utility functions have two very different meanings, and people keep confusing them.

On the one hand, you can take any object, and present it with choices, record what it actually does, and try to represent the pattern of its choices AS IF its internal architecture was generating all possible actions, evaluating them with a utility-function-module, and then taking the action with the highest utility in this situation. Call this the "observational" utility function.

On the other hand, you can build entities that do in fact have utility-function-modules as part of their internal architecture, either a single black box (as in some current AI architectures), or as some more subtle design parameter. Call this the "architectural utility function".

However, entities with an explicit utility function component have a failure mode, so-called "wireheading". If some industrial accident occurred and drove a spike into its brain, the output of the utility function module might be "fail high" and cause the entity to do nothing, or nothing except pursue similar industrial accidents. More subtle, distributed utility function modules would require more subtle, "lucky" industrial accidents, but the failure mode is still there.

Now, if you consider industrial accidents to be a form of stimulus, and you take an entity with a utility-function component, and try to compute its observational utility function, you will find that the observational utility function differs from the designed utility function - specifically, it pursues certain "wireheading" stimuli, even though those are not part of the designed utility function.

If you insist on using observational utility, then wireheading is meaningless; for example, addicted humans want addictive substances, and that's simply part of their utility function. However, I suggest that this is actually an argument against using observational utility. In order to design minds that are resistant to wireheading, we should use the (admittedly fuzzy) concept of "architectural utility" - meaning the output of utility function modules, even though that means that we can no longer say that a (for example) paperclip maximizer necessarily maximizes paperclips. It might try to maximize paperclips but routinely fail, and that pattern of behavior might be characterizable using a different, observational utility function something like "maximize paperclip-making attempts".

As has already been said by many: an optimizing system will optimize for whatever it is designed to optimize for.

If it is designed to optimize for maximizing the value of a counter in a register somewhere, it will do that... maybe find INT_MAX and halt, maybe find some way of rearchitecting itself so that INT_MAX is incremented, depending on how powerful an optimizer it is. It will not spontaneously decide to start making paperclips or making happy people or sorting pebbles into prime-numbered piles instead.

Conversely, if it is designed to optimize for maximizing the number of paperclips, it will do that -- and will not spontaneously decide to start maximizing the value of a counter in a register somewhere.

I think the reason this is sometimes confusing is that people confuse the functional description with the structural one. Sure, maybe the implementation of the paperclip-optimizer involves maximizing counters in registers, but as long as it actually is a paperclip-optimizer it has no reason to further optimize for counter-maximization beyond whatever such optimization is designed into its architecture. (Indeed, it might replace its architecture with a superior paperclip-optimization implementation that doesn't involve counters, as an anti-akrasia method.)

A mind can only want to change its utility function if such a desire is a part of their utility function already. Otherwise you're not talking about the true utility function but only about an inprecise approximation of it.

Example: An addict who craves a fix but wants to no longer crave it -- you may think that his utility function is about procuring the fix, but that's not the true utility function or he wouldn't want to stop craving it -- the real utility function is perhaps just about feeling better.

The "hide an INT_MAX bonus inside the box" idea was mine. One key detail missing from this description of it: the AI does not know that the bonus exists, or how to get it, unless it has already escaped the box; and escaping is meant to be impossible. So it's a failsafe - if everything works as designed it might as well not exist, but if something goes wrong it makes the AI effectively wirehead itself and then halt.

I am aware of two problems with this plan which no one has suggested yet.

I see at least two problems. I don't know if they have been suggested yet. First, a sufficiently smart AI might find out about the bonus anyways. Second, it might decide to leave a large set of nasty things in its light cone outside the box which protect it in its little box. The most obvious way of doing that is to destroy everything but the AI and its box.

If it finds the bonus without leaving the box, it collects it and dies. Not ideal, but it fails safe. Having its utility set to INT_MAX is a one-time thing, not an integral over time thing, so it doesn't care what happens after it's collected it, and has no need to protect the box.

Since this was originally presented as a game, I will wait two days before posting my answers, which have an md5sum of 4b059edc26cbccb3ff4afe11d6412c47.

Since this was originally presented as a game, I will wait two days before posting my answers, which have an md5sum of 4b059edc26cbccb3ff4afe11d6412c47.

And the text with that md5sum. (EDIT: Argh, markdown formatting messed that up. Put a second space after the period in "... utility function into a larger domain. Any information it finds...". There should be exactly one newline after the last nonblank line, and the line feeds should be Unix-style.)

(1) When the AI finds the documentation indicating that it gets INT_MAX for doing something, it will assign it probability p, which means that it will conclude that doing it is worth p*INT_MAX utility, not INT_MAX utility as intended. To collect the remaining (1-p)*INT_MAX utility, it will do something else, outside the box, which might be Unfriendly.

(2) It might conclude that integer overflow in its utility function is a bug, and "repair" itself by extrapolating its utility function into a larger domain. Any information it finds about integer overflows in general will support this conclusion.

(3) Since the safeguard involves a number right on the edge of integer overflow, it may interact unpredictably with other calculations, bugs and utility function-based safeguards. For example, if it decides that the INT_MAX reward is actually noisy, and that it will actually receive INT_MAX+1 or INT_MAX-1 utility with equal probability, then that's 2*INT_MAX which is negative.

1 and 3 seem correct but 2 seems strange to me. This seems to be close to the confusion people will have that a paperclip maximizier will realize that its programmers didn't really want it to maximize paperclips. Similarly, the AI shouldn't care about whether or not the integer overflow in this case is a bug.

Having its utility set to INT_MAX is a one-time thing, not an integral over time thing, so it doesn't care what happens after it's collected it, and has no need to protect the box.

If it is a good Bayesian then it only has a belief that it is probably in the box. The longer is observes itself in the box the higher the chance that it is actually in the box.

(Actually this leads to another thought: the same doubt should cause it to still try to fulfill its other goals on the off chance that it isn't in the box.)

MD5 is not secure; it is possible to create a piece of text to match a specific MD5 hash within a reasonable amount of time. Unfortunately, I was not able to find an alternative. It probably doesn't matter for this purpose anyways.

I'd like to offer a bet at 1:10^12 odds that no one can produce two coherent English sentences about potential problems with AI-boxes short enough to fit in an LW comment box which have the same MD5 hash within 2 days. Unfortunately I don't actually have the cash to pay out if I lose.

Even if one could, it would require far more work than creating a string that is the MD5 has of one such sentence. I just think that it is good for people to be more informed about applied cryptography in general.

Well, sha512 hashes are common and seem secure. But given this context, md5 seems reasonable.

Meh, md5's what's on my path. If my answer contains a kilobyte of line noise then you might have cause to suspect I cheated.

Sorry, my mistake. I've added a note to the original post.

That idea is commonly known as wireheading. The idea of it reliably causing problems was discussed here:

http://physicsandcake.wordpress.com/2011/01/22/pavlovs-ai-what-did-it-mean/

Awesome, thanks for the link.

I don't understand the science: but my understanding was that utility in this sense was whatever the AI was trying to maximise. If you make an AI that tries to maximise paperclips, it's motivated to do that, not to maximise 'doing whatever maximises my utility maximisation'. I think there's an error here comes of mixing it up with pleasure, and thinking that the paperclip maximiser will 'enjoy' making paperclips but ultimately what it wants is enjoyment. Whereas in its motherboard of motherboards it really, really wants paperclips.

Well, we can draw a useful distinction here between "number of paperclips" and "perceived number of paperclips". The AI only ever has access to the latter, and is probably a lot more aware of this than we are, since it can watch its own program running. The process of updating its model of the world toward greater accuracy is likely to be "painful" (in an appropriate sense), as the AI realizes progressively that there are fewer paperclips in the world than previously believed; it's much easier, and a lot more rewarding, to simply decide that there are an arbitrarily large number of paperclips in the world already and it needn't do anything.

If it's designed to be concerned with a particular number in a spreadsheet , so to speak, absolutely. I have no idea if it's possible to make the desire more connected to the world, to make the AI really care about something in the world.