You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Presidents, asteroids, natural categories, and reduced impact

1 Stuart_Armstrong 06 July 2015 05:44PM

A putative new idea for AI control; index here.

EDIT: I feel this post is unclear, and will need to be redone again soon.

This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.

 

Extending niceness/reduced impact

I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).

This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Zj that did not include X/¬X. Then as long as the random variables represented natural categories, given X, the niceness should extend.

What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Zj given the evidence we expected to find. This was put in equation as picking Zj's that minimize

  • Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]} 

where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.

 

Perfect and imperfect extensions

The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Zj. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Zj that almost never come up...

...or at least, that almost never come up, given X. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Zj (see the second failure mode in this post; some of the Zj may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).

One way of fixing this is to require that the "swing" of the Zj be small upon changing X to ¬X or vice versa. Something like, for all values of {aj}, the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Zj will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.

Of course, problems arise if the Zj are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Zj cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.

 

High impact from reduced impact

With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).

In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.

Then the top level AI charged with aiming the y coordinates is given the following task:

  • It is motivated to cause the correct y-coordinate to be outputted.
  • If the x-coordinate AI is not turned on (call this event ¬X), it is strongly motivated to have reduced impact.
  • It must phrase the robotic subagent's utility function in terms of the Zj that obey all the properties above.
  • The evidence E considered for the equation governing the naturalness of Zj is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.

So what will the AI do? It will produce a robot that will aim to aim the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Zj, the robot will not be motivated to change its behaviour just because its sees the other robot.

And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Zj. Thus the "general reduced impact" portion of the robot's motivation will be preserved.

And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.

 

Failure modes

There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.

The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Zj above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.

 

Overall status: I'm not sure the idea is sound, at least not yet. Critiques welcome.

The president didn't die: failures at extending AI behaviour

9 Stuart_Armstrong 10 June 2015 04:00PM

A putative new idea for AI control; index here.

In a previous post, I considered the issue of an AI that behaved "nicely" given some set of circumstances, and whether we could extend that behaviour to the general situation, without knowing what "nice" really meant.

The original inspiration for this idea came from the idea of extending the nice behaviour of "reduced impact AI" to situations where they didn't necessarily have a reduced impact. But it turned out to be connected with "spirit of the law" ideas, and to be of potentially general interest.

Essentially, the problem is this: if we have an AI that will behave "nicely" (since this could be a reduced impact AI, I don't use the term "friendly", which denotes a more proactive agent) given X, how can we extend its "niceness" to ¬X? Obviously if we can specify what "niceness" is, we could just require the AI to do so given ¬X. Therefore let us assume that we don't have a good definition of "niceness", we just know that the AI has that given X.

To make the problem clearer, I chose an X that would be undeniably public and have a large (but not overwhelming) impact: the death of the US president on a 1st of April. The public nature of this event prevents using approaches like thermodynamic miracles to define counterfactuals.

I'll be presenting a solution in a subsequent post. In the meantime, to help better understand the issue, here's a list of failed solutions:

 

First Failure: maybe there's no problem

Initially, it wasn't clear there was a problem. Could we just expect niceness to extend naturally? But consider the following situation: assume the vice president is a warmonger, who will start a nuclear war if ever they get into power (but is otherwise harmless).

Now assume the nice AI has the conditional action criteria: "if the vice president ever becomes president, launch a coup". This is safe, it can be extended to the ¬X situation in the way we want.

However, conditioning on X, that criteria is equivalent with "launch a coup on the 2nd of April". And if the AI has that criteria, then extending it to ¬X is highly non-safe. This illustrates that there is a real problem here - the coup example is just one of the myriad of potential issues that could arise, and we can't predict them all.

 

Second failure: don't condition on X

Maybe the trick could be preventing the AI from conditioning on X (for anything)? If the AI itself can't tell the difference between X and ¬X, wouldn't its nice behaviour extend?

But this doesn't help. The AI could merely condition on things tightly coupled with X. Like news reports of the president's demise, or a conjunction of other events almost impossible under ¬X.

 

Third failure: disbelieve the truth

In this design, the AI simply assumes that X happens, whatever the evidence. Therefore it would stay nice, whatever happened.

But this results in an AI who's beliefs are strongly decoupled with reality. The AI must believe that there is a vast conspiracy concealing the president's death; the more it investigates this conspiracy, and fails to find it, the more vast and powerful it becomes. What if you show it undeniable, impossible-to-fake-by-the-laws-of-physics evidence of the president's demise? Then it starts to believe that the laws of physics are all wrong.

Though the AI is "nice", it's niceness is appropriate for a world of invisible conspiracies and alternate physics - not for the real world. It's clear how such an AI could start doing crazy and dangerous things as its view of reality starts to diverge.

 

Fourth failure: hack the models

This example is due to Daniel Dewey. He imagined that the AI would maintain a full collection of models and update them normally. But before making a prediction, each model would get hacked to artificially set ¬X to X.

To see how this could fail, imagine all evidence about the world is stored in a giant vector of data. These vectors are calibrated correctly, then ¬X is flipped to X. The problem is that, while that is small change to the vector, it could be a huge change to the world that it encodes.

For instance, let's go back to the warmongering vice president, and assume that nuclear war is inevitable, given X. Then the AI will walk around in a peaceful world, update its models - and act as if it was in a nuclear wasteland, because those are the only possibilities, given X. Essentially, the AI will move through our universe, harvesting information that would inform its actions in a parallel universe - and acting as if it existed there instead of here.

For instance, it could wander into a flower show where someone is talking about difficulties growing roses in southern Quebec. It adds this data to its vector, noting that the soil there must be a bit unsuitable to plant growth. It therefore concludes that it must write to the (non-existent) Third God-Emperor of America and advise it to give up on the Quebec Anglican Protectorate, which must be misreporting their agriculture output, given this data.

It's interesting to contrast this AI with the previous one. Suppose that the nuclear war further implies that Paris must be a smoking crater. And now both AIs must walk around a clearly bustling and intact Paris. The disbelieving AI must conclude that this is an elaborate ruse - someone has hidden the crater from its senses, put up some fake building, etc... The model-hacking AI, meanwhile, acts as if it's in a smouldering crater, with the genuine Paris giving it information as to what it should do: it sees an intact army barracks, and starts digging under the "rubble" to see if anything "remains" of that barracks.

It would be interesting to get Robin Hanson to try and reconcile these AIs' beliefs ^_^

 

Fifth failure: Bayes nets and decisions

It seems that a Bayes net would be our salvation. We could have dependent nodes like "warmongering president", "nuclear war", or "flower show". Then we could require that the AI makes its decision dependent only on the states of these dependent nodes. And never on the original X/¬X node.

This seems safe - after all, the AI is nice given X. And if we require the AI's decisions be dependent only on subordinate nodes, then it must be nice dependent on the subordinate nodes. Therefore X/¬X is irrelevant, and the AI is always nice.

Except... Consider what a "decision" is. A decision could be something simple, or it could be "construct a sub AI that will establish X versus ¬X, and do 'blah' if X, and 'shmer' if ¬X". That's a perfectly acceptable decision, and could be made conditional on any (or all) of the subordinate nodes. And if 'blah' is nice while 'shmer' isn't, we have the same problem.

 

Six failure: Bayes nets and unnatural categories

OK, if decisions are too general, how about values for worlds? We take a lot of nodes, subordinate to X/¬X, and require that the AI define its utility or value function purely in terms of the states of these subordinate nodes. Again, this seems safe. The AI's value function is safe given X, by assumption, and is defined in terms of subordinate nodes that "screen off" X/¬X.

And that AI is indeed safe... if the subordinate nodes are sensible. But they're only sensible because I've defined them using terms such as "nuclear war". But what if a node is "nuclear war if X and peace in our time if ¬X"? That's a perfectly fine definition. But such nodes mean that the value function given ¬X need not be safe in any way.

This is somewhat connected with the Grue and Bleen issue, and addressing that is how I'll be hoping to solve the general problem.

NKCDT: The Big Bang Theory

-12 [deleted] 10 November 2012 01:15PM

Hi, Welcome to the first Non-Karmic-Casual-Discussion-Thread.

This is a place for [purpose of thread goes here].

In order to create a causal non karmic environment for every one we ask that you

-Do not upvote or downvote any zero karma posts

-If you see a vote with positive karma, downvote it towards zero, even if it’s a good post

-If you see a vote with negative karma, upvote it towards zero, even if it’s a weak post

-Please be polite and respectful to other users

-Have fun!”

 

 

This is my first attempt at starting a casual conversation on LW where people don't have to worry about winning or losing points, and can just relax and have social fun together.

 

So, Big Bang Theory. That series got me wondering. It seems to be about "geeks", and not the basement-dwelling variety either; they're highly successful and accomplished professionals, each in their own field. One of them has been an astronaut, even. And yet, everything they ever accomplish amounts to absolutely nothing in terms of social recognition or even in terms of personal happiness. And the thing is, it doesn't even get better for their "normal" counterparts, who are just as miserable and petty.

 

Consider, then; how would being rationalists would affect the characters on this show? The writing of the show relies a lot on laughing at people rather than with them; would rationalist characters subvert that? And how would that rationalist outlook express itself given their personalities? (After all, notice how amazingly different from each other Yudkowsky, Hanson, and Alicorn are, just to name a few; they emphasize rather different things, and take different approaches to both truth-testing and problem-solving).

Note: this discussion does not need to be about rationalism. It can be a casual, normal discussion about the series. Relax and enjoy yourselves.

 

But the reason I brought up that series is that its characters are excellent examples of high intelligence hampered by immense irrationality. The apex of this is represented by Dr. Sheldon Cooper, who is, essentially, a complete fundamentalist over every single thing in his life; he applies this attitude to everything, right down to people's favorite flavor of pudding: Raj is "axiomatically wrong" to prefer tapioca, because the best pudding is chocolate. Period. This attitude makes him a far, far worse scientist than he thinks, as he refuses to even consider any criticism of his methods or results.