AlexMennen comments on Reply to Holden on 'Tool AI' - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (348)
I was under the impression that Holden's suggestion was more along the lines of: Make a model of the world. Remove the user from the model and replace it with a similar user that will always do what you recommend. Then manipulate this user so that it achieves its objective in the model, and report the actions that you have the user do in the model to the real user.
Thus, if the objective was to make the user happy, the Google Maps AGI would simply instruct the user to take drugs, rather than tricking him into doing so, because such instruction is the easiest way to manipulate the user in the model that the Google Maps AGI is optimizing in.
Actually, the easiest output for the AI in that case is "be happy."
But - that's not what he meant!
I don't know why you keep harping on this. Just because an algorithm logically can produce a certain output, and probably will produce that output, doesn't mean good intentions and vigorous handwaving are any less capable of magic.
This is why when I fire a gun, I just point it in the general direction of my target, and assume the universe will know what I meant to hit.
I mean, it works in so many video games.
As a failure mode, "vague, useless, or trivially-obvious suggestions" is less of a problem than "rapidly eradicates all life." Historically, projects that were explicitly designed to be safe even when they inevitably failed have been more successful and less deadly than projects which were obsessively designed never to fail at all.
Indeed, one of the first things we teach our engineers is "Even if you're sure it can't fail, plan for failure anyway. Many before you have been sure things couldn't fail---that failed."
Indeed it isn't, although I'm not so foolish as to claim to know how to fully specify my suggestion in a way that avoids all of these sorts of problems.
Holden didn't actually suggest that. And while this suggestion is in a certain sense ingenious - it's not too far off from the sort of suggestions I flip through when considering how/if to implement CEV or similar processes - how do you "report the actions"? And do you report the reasons for them? And do you check to see if there are systematic discrepancies between consequences in the true model and consequences in the manipulated one? (This last point, btw, is sufficient that I would never try to literally implement this suggestion, but try to just structure preferences around some true model instead.)
How do you report the path the car should take? On the map. How do you report better transistor design? In the blueprint. How do we report software design? With UML diagram. (how do you report why that transistor works? Show simulator). It's just the most irreparable clinical psychopaths whom generate all outputs via extensive (and computationally expensive) modelling of the cognition (and decision process) of the listener. edit: i.e. modelling as to attain an outcome favourable to them; failing to empathise with listener that is failing to treat the listener as instance of self, but instead treating listener as a difficult to control servomechanism.
Isn't the relevant quality of a "clinical psychopath," here, something like "explicitly models cognition of the listener, instead of using empathy," where "empathy"==something like "has an implicit model of the cognition of the listener"?
Implicit model that is rather incomplete and not wired for exploitation. That's how psychopaths are successful at exploiting other people and talking people into stuff even though they have substandard model when it comes to actual communication, and their model actually sucks and is inferior to normal.
The human friendliness works via non modelling decision processes of other people when communicating; we do that when we deceive, lie, and bullshit, while when we are honest we sort of share the thoughts. This idea of oracle here is outright disturbing. It is clear nothing good comes out of full model of the listener; firstly it wastes the computing time and secondarily it generates bullshit, so you get something inferior at solving technical problems, and more dangerous, at the same time.
Meanwhile, much of the highly complex information that we would want to obtain from oracle is hopelessly impossible to convey in English anyway - hardware designs, cures, etc.
I can think of a bunch of random standard modes of display (top candidate: video and audio of what the simulated user sees and hears, plus subtitles of their internal model), and for the dispensaries you could run the simulation many times with random variations roughly along the same scope and dimensions as the differences between the simulations and reality, either just reacting plans that have to much divergence, or simply showing the display of all of them (wich'd also help against frivolous use if you have to watch the action 1000 times before doing it). I'd also say make the simulated user a total drone with seriously rewired neurology to try to always and only do what the AI tells it to.
Not that this solves the problem - I've countered the real dangerous things I notice instantly, but 5 mins to think of it and I'll notice 20 more - but I though someone should actually try to answer the question in spirit and letter and most charitable interpretation.
also, it'd make a nice movie.
I don't see why the 'oracle' has to work from some real world goal in the first place. The oracle may have as it's terminal goal the output of the relevant information on the screen with the level of clutter compatible with human visual cortex, and that's it. Up to you to ask it to represent it in particular way.
Or not even that; the terminal goal of the mathematical system is to make some variables represent such output; an implementation of such system has those variables be computed and copied to the screen as pixels. The resulting system does not even self preserve; the abstract computation making abstract variables have certain abstract values is attained in the relevant sense even if the implementation is physically destroyed. (this is how software currently works)
The screen is a part of the real world.
Hardwiring the AI to be extremely naive about how easy the user is to manipulate might not be sufficient for safety, but it does seem like a pretty good start.
Delete the word "hardwiring" from your vocabulary. You can't do it with wires, and saying it doesn't accomplish any magic.
I think there is an interpretation of "hardwiring" that makes sense when talking about AI. For example, say you have a chess program. You can make a patch for it that says "if my light squared bishop is threatened, getting it out of danger is highest priority, second only to getting the king out of check". Moreover, even for very complex chess programs, I would expect that patch to be pretty simple, compared to the whole program.
Maybe a general AI will necessarily have an architecture that makes such patches impossible or ineffective. Then again, maybe not. You could argue that an AI would work around any limitations imposed by patches, but I don't see why a computer program with an ugly patch would magically acquire a desire to behave as if it didn't have the patch, and converge to maximizing expected utility or something. In any case I'd like to see a more precise version of that argument.
ETA: I share your concern about the use of "hardwiring" to sweep complexity under the rug. But saying that AIs can do one magical thing (understand human desires) but not another magical thing (whatever is supposed to be "hardwired") seems a little weird to me.
Yeah, well, hardwiring the AI to understand human desires wouldn't be goddamned trivial either, I just decided not to go down that particular road, mostly because I'd said it before and Holden had apparently read at least some of it.
Getting the light-square bishop out of danger as highest priority...
1) Do I assume the opponent assigns symmetric value to attacking the light-square bishop?
2) Or that the opponent actually values checkmates only, but knows that I value the light-square bishop myself and plan forks and skewers accordingly?
3) Or that the opponent has no idea why I'm doing what I'm doing?
4) Or that the opponent will figure it out eventually, but maybe not in the first game?
5) What about the complicated static-position evaluator? Do I have to retrain all of it, and possibly design new custom heuristics, now that the value of a position isn't "leads to checkmate" but rather "leads to checkmate + 25% leads to bishop being captured"?
Adding this to Deep Blue is not remotely as trivial as it sounds in English. Even to add it in a half-assed way, you have to at least answer question 1, because the entire non-brute-force search-tree pruning mechanism depends on guessing which branches the opponent will prune. Look up alpha-beta search to start seeing why everything becomes more interesting when position-values are no longer being determined symmetrically.
For what it's worth, the intended answers are 1) no 2) no 3) yes 4) no 5) the evaluation function and the opening book stay the same, there's just a bit of logic squished above them that kicks in only when the bishop is threatened, not on any move before that.
Yeah, game-theoretic considerations make the problem funny, but the intent wasn't to convert an almost-consistent utility maximizer into another almost-consistent utility maximizer with a different utility function that somehow values keeping the bishop safe. The intent was to add a hack that throws consistency to the wind, and observe that the AI doesn't rebel against the hack. After all, there's no law saying you must build only consistent AIs.
My guess is that's what most folks probably mean when they talk about "hardwiring" stuff into the AI. They don't mean changing the AI's utility function over the real world, they mean changing the AI's code so it's no longer best described as maximizing such a function. That might make the AI stupid in some respects and manipulable by humans, which may or may not be a bad thing :-) Of course your actual goals (whatever they are) would be better served by a genuine expected utility maximizer, but building that could be harder and more dangerous. Or at least that's how the reasoning is supposed to go, I think.
Why doesn't the AI reason "if I remove this hack, I'll be more likely to win?" Because this is just a narrow chess AI and the programmer never gave it general reasoning abilities?
More interesting question is why it (if made capable of such reflection) would not take it a little step further and ponder what happens if it removes enemy's queen from it's internal board, which would also make it more likely to win, with its internal definition of win which is defined in terms of internal board.
Or why would anyone go through the bother of implementing possibly irreducible notion of what 'win' really means in the real world, given that this would simultaneously waste computing power on unnecessary explorations and make AI dangerous / uncontrollable.
Thing is, you don't need to imagine the world dying to avoid making pointless likely impossible accomplishments.
Yeah, because it's just a narrow real-world AI without philosophical tendencies... I'm actually not sure. A more precise argument would help, something like "all sufficiently powerful AIs will try to become or create consistent maximizers of expected utility, for such-and-such reasons".
Does a pair of consistent optimizers with different goals have a tendency to become a consistent optimizer?
The problem with powerful non-optimizers seems to be that the "powerful" property already presupposes optimization power, and so at least one optimizer-like thing is present in the system. If it's powerful enough and is not contained, it's going to eat all the other tendencies of its environment, and so optimization for its goal will be all that remains. Unless there is another optimizer able to defend its non-conformity from the optimizer in question, in which case the two of them might constitute what counts as not-a-consistent-optimizer, maybe?
Option 3? Doesn't work very well. You're assuming the opponent doesn't want to threaten the bishop, which means you yank it to a place where it would be safe if the opponent doesn't want to threaten it, but if the opponent clues in, it's then trivial for them to threaten the bishop again (to gain more advantage as you try to defend), which you weren't expecting them to do, because that's not how your search tree was structured. Kasparov would kick hell out of thus-hardwired Deep Blue as soon as he realized what was happening.
It's that whole "see the consequences of the math" thing...
Either your comment is in violent agreement agreement with mine ("that might make the AI stupid in some respects and manipulable by humans"), or I don't understand what you're trying to say...
Probably violent agreement.
I was sorely tempted, upon being ordered to self-modify in such a way, to respond angrily. It implies a lack of respect for the integrity of those with whom you are trying to communicate. You could have said "taboo" instead of demanding a permanent loss.
Do you think it would be outright impossible, to handicap an AI in such a way that it cannot conceive of a user interpreting it's advice in any but the most straightforward way, and therefore eschews manipulative output? Do you think it would be useless as a safety feature? Do you think it would be unwise for some other reason, some unintended consequence? Or are you simply objecting to my phrasing?
The distinction between hardwiring and softwiring is, at above the most physical, electronic aspects of computer design, a matter of policy - something in the programmer's mind and habits, not something out in the world that the programmer is manipulating. From any particular version of the software's perspective, all of the program it is running is equally hard (or equally soft).
It may not be impossible to handicap an entity in some way analogous to your suggestion, but holding fiercely to the concept of hardwiring will not help you find it. Thinking about mechanisms that would accomplish the handicapping in an environment where everything is equally hardwired would be preferable.
There's some evidence that chess AIs 'personality' (an emergent quality of their play) is related to a parameter of their evaluation function called 'contempt', which is something like (handwaving wildly) how easy the opponent is to manipulate. In general, AIs with higher contempt seek to win-or-lose more, and seek to draw less. What I'm trying to say is, your idea is not without merit, but it may have unanticipated consequences.
I'm saying that using the word "hardwiring" is always harmful because they imagine an instruction with lots of extra force, when in fact there's no such thing as a line of programming which you say much more forcefully than any other line. Either you know how to program something or you don't, and it's usually much more complex than it sounds even if you say "hardwire". See the reply above on "hardwiring" Deep Blue to protect the light-square bishop. Though usually it's even worse than this, like trying to do the equivalent of having an instruction that says "#define BUGS OFF" and then saying, "And just to make sure it works, let's hardwire it in!"
There is, in fact, such a thing as making some parts of the code more difficult to modify than other parts of the code.
I apologize for having conveyed the impression that I thought designing an AI to be specifically, incurably naive about how a human querent will respond to suggestions would be easy. I have no such misconception; I know it would be difficult, and I know that I don't know enough about the relevant fields to even give a meaningful order-of-magnitude guess as to how difficult. All I was suggesting was that it would be easier than many of the other AI-safety-related programming tasks being discussed, and that the cost-benefit ratio would be favorable.
There is? How?
http://en.wikipedia.org/wiki/Ring_0
And what does a multi-ring agent architecture look like? Say, the part of the AI that outputs speech to a microphone - what ring is that in?
I am not a professional software designer, so take all this with a grain of salt. That said, hardware I/O is ring 1, so the part that outputs speech to a speaker would be ring 1, while an off-the-shelf 'text to speech' app could run in ring 3. No part of a well-designed agent would output anything to an input device, such as a microphone.
I don't think Strange7 is arguing Strange7's point strongly; let me attempt to strengthen it.
A button that does something dangerous, such as exploding bolts that separate one thing from another thing, might be protected from casual, accidental changes by covering it with a lid, so that when someone actually wants to explode those bolts, they first open the lid and then press the button. This increases reliability if there is some chance that any given hand motion is an error, but the errors of separate hand motions are independent. Similarly 'are you sure' dialog boxes.
In general, if you have several components, each of a given reliability, and their failure modes are somewhat independent, then you can craft a composite component of greater reliability than the individuals. The rings that Strange7 brings up are an example of this general pattern (there may be other reasons why layers-of-rings architectures are chosen for reliability in practice - this explanation doesn't explain why the rings are ordered rather than just voting or something - this is just one possible explanation).
This is reasonable, but note that to strengthen the validity, the conclusion has been weakened (unsurprisingly). To take a system that you think is fundamentally, structurally safe and then further build in error-delaying, error-resisting, and error-reporting factors just in case - this is wise and sane. Calling "adding impediments to some errors under some circumstances" hardwiring and relying on it as a primary guarantee of safety, because you think some coded behavior is firmly in place locally independently of the rest of the system... will usually fail to cash out as an implementable algorithm, never mind it being wise.
The conclusion has to be weakened back down to what I actually said: that it might not be sufficient for safety, but that it would probably be a good start.
Don't programmers do this all the time? At least with current architectures, most computer systems have safeguards against unauthorized access to the system kernel as opposed to the user documents folders...
Isn't that basically saying "this line of code is harder to modify than that one"?
In fact, couldn't we use exactly this idea---user access protocols---to (partially) secure an AI? We could include certain kernel processes on the AI that would require a passcode to access. (I guess you have to stop the AI from hacking its own passcodes... but this isn't a problem on current computers, so it seems like we could prevent it from being a problem on AIs as well.)
[Responding to an old comment, I know, but I've only just found this discussion.]
Never mind special access protocols, you could make code unmodifiable (in a direct sense) by putting it in ROM. Of course, it could still be modified indirectly, by the AI persuading a human to change the ROM. Even setting aside that possibility, there's a more fundamental problem. You cannot guarantee that the code will have the expected effect when executed in the unpredictable context of an AGI. You cannot even guarantee that the code in question will be executed. Making the code unmodifiable won't achieve the desired effect if the AI bypasses it.
In any case, I think the whole discussion of an AI modifying its own code is rendered moot by the fuzziness of the distinction between code and data. Does the human brain have any code? Or are the contents just data? I think that question is too fuzzy to have a correct answer. An AGI's behaviour is likely to be greatly influenced by structures that develop over time, whether we call these code or data. And old structures need not necessarily be used.
AGIs are likely to be unpredictable in ways that are very difficult to control. Holden Karnofsky's attempted solution seems naive to me. There's no guarantee that programming an AGI his way will prevent agent-like behaviour. Human beings don't need an explicit utility function to be agents, and neither does an AGI. That said, if AGI designers do their best to avoid agent-like behaviour, it may reduce the risks.
I always thought that "hardwiring" meant implementing [whatever functionality is discussed] by permanently (physically) modifying the machine, i.e. either something that you couldn’t have done with software, or something that prevents the software from actually working in some way it did before. The concept is of immutability within the constraints, not of priority or "force".
Which does sound like something one could do when they can’t figure out how to do the software right. (Watchdogs are pretty much exactly that, though some or probably most are in fact programmable.)
Note that I’m not arguing that the word is not harmful. It just seemed you have a different interpretation of what that word suggests. If other people use my interpretation (no idea), you might be better at persuading it if you address that.
I’m quite aware that from the point of view of a godlike AI, there’s not much difference between circumventing restrictions in its software and (some kinds of) restrictions in hardware. After all, the point of FAI is to get it to control the universe around it, albeit to our benefit. But we’re used to computers not having much control over their hardware. Hell, I just called it “godlike” and my brain still insists to visualize it as a bunch of boxes gathering dust and blinking their leds in a basement.
And I can’t shake the feeling that between "just built" and "godlike" there’s supposed to be quite a long time when such crude solutions might work. (I’ve seen a couple of hard take-off scenarios, but not yet a plausible one that didn’t need at least a few days of preparation after becoming superhuman.)
Imagine we took you, gave you the best "upgrades" we can do today plus a little bit (say, a careful group of experts figuring out your ideal diet of nootropics, training you to excellence everything from acting to martial arts, and gave you nanotube bones and a direct internet link to your head). Now imagine you have a small bomb in your body, set to detonate if tampered with or if one of several remotes distributed throughout the population is triggered. The worlds best experts tried really hard to make it fail-deadly.
Now, I’m not saying you couldn’t take over the world, send all men to Mars and the women to Venus, then build a volcano lair filled with kittens. But it seems far from certain, and I’m positive it’d take you a long time to succeed. And, it does feel that a new-born AI would like that for a while rather than turn into Prime Intellect in five minutes. (Again, this is not an argument that UFAI is no problem. I guess I’m just figuring out why it seems that way to mostly everyone.)
[Huh, I just noticed I’m a year late on this chat. Sorry.]
Software physically modifies the machine. What can you do with a soldering iron that you can't do with a program instruction, particularly with respect to building a machine agent? Either you understand how to write a function or you don't.
That is all true in principle, but in practice it’s very common that one of the two is not feasible. For example, you can have a computer. You can program the computer to tell you when it’s reading from the hard drive, or communicates to the network, say by blinking an LED. If the program has a bug (e.g., it’s not the kind of AI you wanted to build), you might not be notified. But you can use a soldering iron to electrically link the LED to the relevant wires, and it seems to most users that no possible programming bug can make the LED not light up when it should.
Of course, that’s like the difference between programming a robot to stay in a pen, or locking the gate. It looks like whatever bug you could introduce in the robot’s software cannot cause the robot to leave. Which ignores the fact that robot might learn to climb the fence, make a key, convince someone else (or hack an outside robot) to unlock the gate.
I think most people would detect the dangers in the robot case (because they can imagine themselves finding a way to escape), but be confused by the AI-in-the-box one (simply because it’s harder to imagine yourself as software, and even if you manage to you’d still have much fewer ideas come to mind, simply because you’re not used to being software).
Hell, most people probably won’t even have the reflex to imagine themselves in place of the AI. My brain reflexively tells me "I can’t write a program to control that LED, so even if there’s a bug it won’t happen". If instead I force myself to think "How would I do that if I were the AI", it’s easier to find potential solutions, and it also makes it more obvious that someone else might find one. But that may be because I’m a programmer, I’m not sure if it applies to others.
My best attempt at imagining hardwiring is having a layer not accessible to introspection, such as involuntary muscle control in humans. Or instinctively jerking your hand away when touching something hot. Which serves as a fail-safe against stupid conscious decisions, in a sense. Or a watchdog restarting a stuck program in your phone, no matter how much the software messed it up. Etc. Whether this approach can be used to prevent a tool AI from spontaneously agentizing, I am not sure.
If you can say how to do this in hardware, you can say how to do it in software. The hardware version might arguably be more secure against flaws in the design, but if you can say how to do it at all, you can say how to do it in software.
Maybe I don't understand what you mean by hardware.
For example, you can have a fuse that unconditionally blows when excess power is consumed. This is hardware. You can also have a digital amp meter readable by software, with a polling subroutine which shuts down the system if the current exceeds a certain limit. There is a good reason that such a software solution, while often implemented, is almost never the only safeguard: software is much less reliable and much easier to subvert, intentionally or accidentally. The fuse is impossible to bypass in software, short of accessing an external agent who would attach a piece of thick wire in parallel with the fuse. Is this what you mean by "you can say how to do it in software"?
Presumably a well-designed agent will have nearly infallible trust in certain portions of its code and data, for instance a theorem prover/verifier and the set of fundamental axioms of logic it uses. Manual modifications at that level would be the most difficult for an agent to change, and changes to that would be the closest to the common definition of "hardwiring". Even a fully self-reflective agent will (hopefully) be very cautious about changing its most basic assumptions. Consider the independence of the axiom of choice from ZF set theory. An agent may initially accept choice or not but changing whether it accepts it later is likely to be predicated on very careful analysis. Likewise an additional independent axiom "in games of chess always protect the white-square bishop" would probably be much harder to optimize out than a goal.
Or from another angle wherever friendliness is embodied in a FAI would be the place to "hardwire" a desire to protect the white-square bishop as an additional aspect of friendliness. That won't work if friendliness is derived from a concept like "only be friendly to cognitive processes bearing a suitable similarity to this agent" where suitable similarity does not extend to inanimate objects, but if friendliness must encode measurable properties of other beings then it might be possible to sneak white-square bishops into that class, at least for a (much) longer period than artificial subgoals would last.
Feels like "utility indifference" could be used to get something like that.
What is the mathematical implementation of indifference?
Armstrong suggests that you implement it as a special value that is exactly equal to every other utility value. So if the AI comes up with an outcome that would yield +751 utility, it would treat being destroyed as having +751 utility. Whatever algorithm you choose for resolving ties determines whether the AI suicides (by doing something that causes its human managers to kill it).
Let's see how this works with a hypothetical example. Our good old friend the Paperclip Maximizer to the rescue!
Papey wants more paperclips. It compares several possibilities. In one, it generates 1,000 paperclips before its human operators kill it. In another, it generates 999 paperclips, but its human operators leave it alive, and it will have future chances to create paperclips. It expects to remain alive long enough to generate exactly two additional paperclips this way.
Now, Papey's decision algorithm chooses between outcomes of equal utility in a uniformly random manner. Papey has two possible outcomes right now: suicide (for 1001 pseudo-paperclips, since suicide is always worth as much as any other decision) or generate 999 paperclips now and an expected 2 paperclips later (for 1001 paperclips).
At this point, Papey will, with 50% probability, generate 999 paperclips; otherwise, Papey will do something to cause his operators to kill him. (Not sure what, but presumably it won't involve destroying paperclips.) Poor Papey. There's a 99% probability that he'll suicide within seven decisions.
Now, if the tie breaking algorithm isn't random, in a given situation, it will either go with suicide deterministically or go against suicide deterministically. You can always go against suicide, in which case you may as well omit utility indifference entirely. Or you can always go for suicide, in which case you found a fancy way of wasting money.
Or you could do something interesting. But if you had something to put here that guaranteed a good result, you could use that as the utility function.
Friendly AI is indifferent to utility indifference.
I wasn't advocating using utility indifference to death for this example, more "utility indifference as to whether the user ever reads its output" or something along those lines.
But let's look at your example more in detail. You say the AI has two choices: it makes 1000 paperclips and is killed, or it makes 999 and is left alive, and then goes on to make another 2.
In this scenario, there is only one case of death, coming from the first decision. So utility indifference does not change anything about the impact of the second decision. What indifference means is that if humans try and blow Papey up after the 1000 paper clips, it will derive as much utility from a successful blow up as from a failed blow up (in less correct but more intuitive terms, you could say "Papey acts as if it doesn't believe the explosives will work").
So how many paper clips could Papey make if the explosives fail? Let's assume another 2, for a grand total of 1000.
So the paperclip-utility of the first course is is 1002, that of the second is 1001: it will thus take the first course. Hence it will misbehave visibly (rather than keeping that a secret), and will likely end up blown up by us.
Now imagine it could get rid of the explosives at no costs. Here, your point has merit: Papey is indifferent between removing the explosives or not removing it, and it tie breaker algorithm will determine what it does. But let's make this more realistic: removing the explosives has a tiny cost, say a 50% chance it will fail to manufacture one paperclip. So now it can:
1) Misbehave, leave explosives in place: expected utility 1002.
2) Pretend to behave: expected utility 999+2 = 1001.
3) Misbehave, remove explosives in place: expected utility 1002-0.5=1001.5.
Hence Papey will choose the first option, we get to see that it misbehaves, and we likely blow it up. That was the initial point of utility indifference.
But as I said here, we wouldn't be using utility indifference in this way. It would more be to make the AI "behave as if it believed that nobody would get to read its printout."
Those are quite different. In one case, you're adjusting its utility function. In the other, you're adjusting its expectations. (And the remainder of your post indicates that you intend to adjust the AI's expectations of what will happen, and don't care about its utility function.)
Now, that might work. If the AI is self-modifying, then it might notice that that part of its planning function is obviously incorrect and fix it. It might simulate human responses to its proposed actions and notice that there are problematic conflicts around that area. But there's at least a chance that the AI will not notice that.
I wouldn't want to trust my life to it, but it's got a shot of working.
The method is entirely about adjusting the utility function, not the expectations, as I explained in the paper. And the adjustments are ideally to be made from the very start, so that the AI never has cause to fear subsequent adjustment to its utility.
Rereading the paper, it looks like you're mucking about with utility, not expected utility. So let's see how this works out in practice.
You assign the AI a task of finding a cure for cancer. This will net it 10 utilons if it succeeds. Thanks to your utility indifference module, it will also net 10 utilons if it does not find a cure for cancer and you trigger the explosives to kill the AI.
The AI does a preliminary estimate to see its odds of success for the relevant outcomes. It can find a cure for cancer with 80% likelihood. It can cause you to trigger the explosives and kill the AI with 90% likelihood.
The expected utility for attempting to cure cancer is 8. The expected utility for attempting suicide is 9.
It seems pretty straightforward and easy to pose as a mad, dangerous AI about to destroy the world, at least well enough to make a human panic and hit the emergency destruct button. I'm sure I could convince you to kill me, if you just had to push a button. It seems a lot less certain that the AI would be able to solve the problems given it, since those problems will be things that humans haven't been able to solve, and since the AI will be under various other restrictions to keep it safe.
The AI gets nothing from you deciding to kill it. What utility indifference means is that it will get the same utility if you decide to kill it, and the explosives go off, as if you decided to kill it, and the explosives failed to go off.
So if it expects to cure cancer with 80% likelihood if it survives, and thus get 8 utilitons, then if if dies from the explosion, it will also get 8 utilitons.
More precisely there are six worlds, depending on a) whether we trigger the explosives or not, b) whether it survives the explosion or not, and c) whether cancer is curable by an unmolested AI. In obvious notation, these are TSC, TS¬C, T¬SC, T¬S¬C, ¬TC, ¬T¬C. We are assuming in your example that U(¬T¬C)=U(TS¬C)=0 and U(¬TC)=U(TSC)=10. Utility indifference then sets U(T¬S¬C)=0 and U(T¬SC)=10.
Hence assuming that S is independent of C (the curability of cancer by an unmolested AI is independent of whether particular explosives go off), we see that the AI is indifferent as to whether the explosives go off - it gains nothing either way, so has no cause to influence the explosives in any direction. If we further assume T and C are independent (the curability of cancer by an unmolested AI is independent of whether we decide to blow up this particular AI), then the AI also has no preference as to whether we trigger the explosives or not - it gains nothing either way.
Ah, that makes sense. It isn't indifferent to suicide as such; it's only indifferent to your success at attempting to kill it, should you make the attempt.
Thanks for your patience!
No prob :-) Always happy when I manage to explain something successfully!
If you make the AI indifferent to its own destruction, it will almost certainly shut down in a couple of minutes. After all, being destroyed is just as good as what it was going to do anyway. (Indeed, in most models of utility maximization it would flip a coin and therefore shut itself down after an average of 2 decisions.)
The AI is only indifferent to its destruction via one particular channel, and gains nothing by trying to trigger that destruction.
But I was more thinking of making the AI indifferent to the reaction of the user "outside the model" or similar.
In fact, it kinda sounds like we've created an AI that suffers from serious clinical depression. "Why bother? I may as well be dead."