You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Causal graphs and counterfactuals

3 Stuart_Armstrong 30 August 2016 04:12PM

Problem solved: Found what I was looking for in: An Axiomatic Characterization Causal Counterfactuals, thanks to Evan Lloyd.

Basically, making every endogenous variable a deterministic function of the exogenous variables and of the other endogenous variables, and pushing all the stochasticity into the exogenous variables.

 

Old post:

A problem that's come up with my definitions of stratification.

Consider a very simple causal graph:

In this setting, A and B are both booleans, and A=B with 75% probability (independently about whether A=0 or A=1).

I now want to compute the counterfactual: suppose I assume that B=0 when A=0. What would happen if A=1 instead?

The problem is that P(B|A) seems insufficient to solve this. Let's imagine the process that outputs B as a probabilistic mix of functions, that takes the value of A and outputs that of B. There are four natural functions here:

  • f0(x) = 0
  • f1(x) = 1
  • f2(x) = x
  • f3(x) = 1-x

Then one way of modelling the causal graph is as a mix 0.75f2 + 0.25f3. In that case, knowing that B=0 when A=0 implies that P(f2)=1, so if A=1, we know that B=1.

But we could instead model the causal graph as 0.5f2+0.25f1+0.25f0. In that case, knowing that B=0 when A=0 implies that P(f2)=2/3 and P(f0)=1/3. So if A=1, B=1 with probability 2/3 and B=1 with probability 1/3.

And we can design the node B, physically, to be one or another of the two distributions over functions or anything in between (the general formula is (0.5+x)f2 + x(f3)+(0.25-x)f1+(0.25-x)f0 for 0 ≤ x ≤ 0.25). But it seems that the causal graph does not capture that.

Owain Evans has said that Pearl has papers covering these kinds of situations, but I haven't been able to find them. Does anyone know any publications on the subject?

Counterfactual Mugging Alternative

-1 wafflepudding 06 June 2016 06:53AM

Edit as of June 13th, 2016: I no longer believe this to be easier to understand than traditional CM, but stand by the rest of it. Minor aesthetic edits made.

First post on the LW discussion board. Not sure if something like this has already been written, need your feedback to let me know if I’m doing something wrong or breaking useful conventions.

An alternative to the counterfactual mugging, since people often require it explained a few times before they understand it -- this one I think will be faster for most to comprehend because it arose organically, not seeming specifically contrived to create a dilemma between decision theories:

Pretend you live in a world where time travel exists and Time can create realities with acausal loops, or of ordinary linear chronology, or another structure, so long as there is no paradox -- only self-consistent timelines can be generated. 

In your timeline, there are prophets. A prophet (known to you to be honest and truly prophetic) tells you that you will commit an act which seems horrendously imprudent or problematic. It is an act whose effect will be on the scale of losing $10,000; an act you never would have taken ordinarily. But fight the prophecy all you want, it is self-fulfilling and you definitely live in a timeline where the act gets committed. However, if it weren’t for the prophecy being immutably correct, you could have spent $100 and, even having heard the prophecy (even having believed it would be immutable) the probability of you taking that action would be reduced by, say, 50%. So fighting the prophecy by spending $100 would mean that there were 50% fewer self-consistent (possible) worlds where you lost the $10,000, because its just much less likely for you to end up taking that action if you fight it rather than succumbing to it.

You may feel that there would be no reason to spend $100 averting a decision that you know you’re going to make, and see no reason to care about counterfactual worlds  where you don’t lose the $10,000. But the fact of the matter is that if you could have precommitted to fight the choice you would have, because in the worlds where that prophecy could have been presented to you, you’d be decreasing the average disutility by (($10,000)(.5 probability) - ($100) = $4,900). Not following a precommitment that you would have made to prevent the exact situation which you’re now in because you wouldn’t have followed the precommitment seems an obvious failure mode, but UDT successfully does the same calculation shown above and tells you to fight the prophecy. The simple fact that should tell causal decision theorists that converting to UDT is the causally optimal decision is that Updateless Decision Theorists actually do better on average than CDT proponents.

 

(You may assume also that your timeline is the only timeline that exists, so as not to further complicate the problem by your degree of empathy with your selves from other existing timelines.)

Presidents, asteroids, natural categories, and reduced impact

1 Stuart_Armstrong 06 July 2015 05:44PM

A putative new idea for AI control; index here.

EDIT: I feel this post is unclear, and will need to be redone again soon.

This post attempts to use the ideas developed about natural categories in order to get high impact from reduced impact AIs.

 

Extending niceness/reduced impact

I recently presented the problem of extending AI "niceness" given some fact X, to niceness given ¬X, choosing X to be something pretty significant but not overwhelmingly so - the death of a president. By assumption we had a successfully programmed niceness, but no good definition (this was meant to be "reduced impact" in a slight disguise).

This problem turned out to be much harder than expected. It seems that the only way to do so is to require the AI to define values dependent on a set of various (boolean) random variables Zj that did not include X/¬X. Then as long as the random variables represented natural categories, given X, the niceness should extend.

What did we mean by natural categories? Informally, it means that X should not appear in the definitions of these random variables. For instance, nuclear war is a natural category; "nuclear war XOR X" is not. Actually defining this was quite subtle; diverting through the grue and bleen problem, it seems that we had to define how we update X and the Zj given the evidence we expected to find. This was put in equation as picking Zj's that minimize

  • Variance{log[ P(X∧Z|E)*P(¬X∧¬Z|E) / P(X∧¬Z|E)*P(¬X∧Z|E) ]} 

where E is the random variable denoting the evidence we expected to find. Note that if we interchange X and ¬X, the ratio inverts, the log changes sign - but this makes no difference to the variance. So we can equally well talk about extending niceness given X to ¬X, or niceness given ¬X to X.

 

Perfect and imperfect extensions

The above definition would work for an "perfectly nice AI". That could be an AI that would be nice, given any combination of estimates of X and Zj. In practice, because we can't consider every edge case, we would only have an "expectedly nice AI". That means that the AI can fail to be nice in certain unusual and unlikely edge cases, in certain strange set of values of Zj that almost never come up...

...or at least, that almost never come up, given X. Since the "expected niceness" was calibrated given X, the such an expectedly nice AI may fail to be nice if ¬X results in a substantial change in the probability of the Zj (see the second failure mode in this post; some of the Zj may be so tightly coupled to the value of X that an expected niceness AI considers them fixed, and this results in problems if ¬X happens and their values change).

One way of fixing this is to require that the "swing" of the Zj be small upon changing X to ¬X or vice versa. Something like, for all values of {aj}, the ratio P({Zj=aj} | X) / P({Zj=aj} | ¬X) is contained between 100 and 1/100. This means that a reasonably good "expected niceness" calibrated on the Zj will transfer from X to ¬X (though the error may grow). This approach has some other advantages, as we'll see in the next section.

Of course, problems arise if the Zj are defined by smashing together events relatively independent of X with some that are very strongly dependent on X, into the same variable. Thus we'll further require that Zj cannot be decomposed into natural subvariables whose values swing more under changes from X to ¬X.

 

High impact from reduced impact

With this, we are finally in a position to define a reduced impact AI design that can have a large impact. Imagine that an asteroid is coming to hit the Earth. Two AIs are charged with targeting a laser at it. The laser isn't perfectly effective, and the asteroid isn't perfectly certain to hit the Earth anyway: let's say aiming the laser correctly will reduce the odds of impact by a factor of 10 or so (well within the probability swing restrictions of the previous section).

In the original formulation of this, the AIs were tasked with outputting the x and y coordinates to aim the laser. Here, instead, they'll be tasked with constructing the motivation of a robotic subagent that will move around for a while, see things and interact, before physically aiming the laser in the right direction. This task is harder.

Then the top level AI charged with aiming the y coordinates is given the following task:

  • It is motivated to cause the correct y-coordinate to be outputted.
  • If the x-coordinate AI is not turned on (call this event ¬X), it is strongly motivated to have reduced impact.
  • It must phrase the robotic subagent's utility function in terms of the Zj that obey all the properties above.
  • The evidence E considered for the equation governing the naturalness of Zj is considered (for example) to be a random sampling of space at the time both AIs are turned on. Most importantly, this is defined to be a theoretical process that the AI cannot affect physically.

So what will the AI do? It will produce a robot that will aim to aim the y-coordinate of the laser correctly, given ¬X, and not expand dangerously. However, we've setup things so that X is virtually certain, so the robot will see another, x-aiming robot, wander around. The chances of this happening, given ¬X, are minuscule, so the swing of the probability from ¬X to X, is huge. So, by the restrictions we've put on the Zj, the robot will not be motivated to change its behaviour just because its sees the other robot.

And if the laser is fired, the impact on humanity will, conversely, not be that large, so well within the parameters of the Zj. Thus the "general reduced impact" portion of the robot's motivation will be preserved.

And, of course, the x-robot will have the mirror image motivations, so they will both aim correctly, it seems. This should evade the problem described here.

 

Failure modes

There are two ways the approach could fail: one bad way, and two graceful ways. The bad way is if the "no decomposition" requirements (one of them is presented above, the other is in this post) fail. In this case, the niceness or reduced impact are not guaranteed to be extended to ¬X, and the AI's behaviour can become undefined. Note that this is is not bad optimisation pressure, just an absence of optimisation pressure - but that could still be potentially dangerous.

The first graceful failure happens if the AI is not able to create a reduced impact sub-agent given the restrictions on the Zj above. In that case, the AI does nothing. The second graceful failure happens if the AI evades our attempts to increase its impact, given ¬X. In that case, it simply becomes a reduced impact AI that does little. Not ideal, but not deadly.

 

Overall status: I'm not sure the idea is sound, at least not yet. Critiques welcome.

The president didn't die: failures at extending AI behaviour

9 Stuart_Armstrong 10 June 2015 04:00PM

A putative new idea for AI control; index here.

In a previous post, I considered the issue of an AI that behaved "nicely" given some set of circumstances, and whether we could extend that behaviour to the general situation, without knowing what "nice" really meant.

The original inspiration for this idea came from the idea of extending the nice behaviour of "reduced impact AI" to situations where they didn't necessarily have a reduced impact. But it turned out to be connected with "spirit of the law" ideas, and to be of potentially general interest.

Essentially, the problem is this: if we have an AI that will behave "nicely" (since this could be a reduced impact AI, I don't use the term "friendly", which denotes a more proactive agent) given X, how can we extend its "niceness" to ¬X? Obviously if we can specify what "niceness" is, we could just require the AI to do so given ¬X. Therefore let us assume that we don't have a good definition of "niceness", we just know that the AI has that given X.

To make the problem clearer, I chose an X that would be undeniably public and have a large (but not overwhelming) impact: the death of the US president on a 1st of April. The public nature of this event prevents using approaches like thermodynamic miracles to define counterfactuals.

I'll be presenting a solution in a subsequent post. In the meantime, to help better understand the issue, here's a list of failed solutions:

 

First Failure: maybe there's no problem

Initially, it wasn't clear there was a problem. Could we just expect niceness to extend naturally? But consider the following situation: assume the vice president is a warmonger, who will start a nuclear war if ever they get into power (but is otherwise harmless).

Now assume the nice AI has the conditional action criteria: "if the vice president ever becomes president, launch a coup". This is safe, it can be extended to the ¬X situation in the way we want.

However, conditioning on X, that criteria is equivalent with "launch a coup on the 2nd of April". And if the AI has that criteria, then extending it to ¬X is highly non-safe. This illustrates that there is a real problem here - the coup example is just one of the myriad of potential issues that could arise, and we can't predict them all.

 

Second failure: don't condition on X

Maybe the trick could be preventing the AI from conditioning on X (for anything)? If the AI itself can't tell the difference between X and ¬X, wouldn't its nice behaviour extend?

But this doesn't help. The AI could merely condition on things tightly coupled with X. Like news reports of the president's demise, or a conjunction of other events almost impossible under ¬X.

 

Third failure: disbelieve the truth

In this design, the AI simply assumes that X happens, whatever the evidence. Therefore it would stay nice, whatever happened.

But this results in an AI who's beliefs are strongly decoupled with reality. The AI must believe that there is a vast conspiracy concealing the president's death; the more it investigates this conspiracy, and fails to find it, the more vast and powerful it becomes. What if you show it undeniable, impossible-to-fake-by-the-laws-of-physics evidence of the president's demise? Then it starts to believe that the laws of physics are all wrong.

Though the AI is "nice", it's niceness is appropriate for a world of invisible conspiracies and alternate physics - not for the real world. It's clear how such an AI could start doing crazy and dangerous things as its view of reality starts to diverge.

 

Fourth failure: hack the models

This example is due to Daniel Dewey. He imagined that the AI would maintain a full collection of models and update them normally. But before making a prediction, each model would get hacked to artificially set ¬X to X.

To see how this could fail, imagine all evidence about the world is stored in a giant vector of data. These vectors are calibrated correctly, then ¬X is flipped to X. The problem is that, while that is small change to the vector, it could be a huge change to the world that it encodes.

For instance, let's go back to the warmongering vice president, and assume that nuclear war is inevitable, given X. Then the AI will walk around in a peaceful world, update its models - and act as if it was in a nuclear wasteland, because those are the only possibilities, given X. Essentially, the AI will move through our universe, harvesting information that would inform its actions in a parallel universe - and acting as if it existed there instead of here.

For instance, it could wander into a flower show where someone is talking about difficulties growing roses in southern Quebec. It adds this data to its vector, noting that the soil there must be a bit unsuitable to plant growth. It therefore concludes that it must write to the (non-existent) Third God-Emperor of America and advise it to give up on the Quebec Anglican Protectorate, which must be misreporting their agriculture output, given this data.

It's interesting to contrast this AI with the previous one. Suppose that the nuclear war further implies that Paris must be a smoking crater. And now both AIs must walk around a clearly bustling and intact Paris. The disbelieving AI must conclude that this is an elaborate ruse - someone has hidden the crater from its senses, put up some fake building, etc... The model-hacking AI, meanwhile, acts as if it's in a smouldering crater, with the genuine Paris giving it information as to what it should do: it sees an intact army barracks, and starts digging under the "rubble" to see if anything "remains" of that barracks.

It would be interesting to get Robin Hanson to try and reconcile these AIs' beliefs ^_^

 

Fifth failure: Bayes nets and decisions

It seems that a Bayes net would be our salvation. We could have dependent nodes like "warmongering president", "nuclear war", or "flower show". Then we could require that the AI makes its decision dependent only on the states of these dependent nodes. And never on the original X/¬X node.

This seems safe - after all, the AI is nice given X. And if we require the AI's decisions be dependent only on subordinate nodes, then it must be nice dependent on the subordinate nodes. Therefore X/¬X is irrelevant, and the AI is always nice.

Except... Consider what a "decision" is. A decision could be something simple, or it could be "construct a sub AI that will establish X versus ¬X, and do 'blah' if X, and 'shmer' if ¬X". That's a perfectly acceptable decision, and could be made conditional on any (or all) of the subordinate nodes. And if 'blah' is nice while 'shmer' isn't, we have the same problem.

 

Six failure: Bayes nets and unnatural categories

OK, if decisions are too general, how about values for worlds? We take a lot of nodes, subordinate to X/¬X, and require that the AI define its utility or value function purely in terms of the states of these subordinate nodes. Again, this seems safe. The AI's value function is safe given X, by assumption, and is defined in terms of subordinate nodes that "screen off" X/¬X.

And that AI is indeed safe... if the subordinate nodes are sensible. But they're only sensible because I've defined them using terms such as "nuclear war". But what if a node is "nuclear war if X and peace in our time if ¬X"? That's a perfectly fine definition. But such nodes mean that the value function given ¬X need not be safe in any way.

This is somewhat connected with the Grue and Bleen issue, and addressing that is how I'll be hoping to solve the general problem.

Help needed: nice AIs and presidential deaths

1 Stuart_Armstrong 08 June 2015 04:47PM

A putative new idea for AI control; index here.

This is a problem that developed from the "high impact from low impact" idea, but is a legitimate thought experiment in its own right (it also has connections with the "spirit of the law" idea).

Suppose that, next 1st of April, the US president may or may not die of natural causes. I chose this example because it's an event of potentially large magnitude, but not overwhelmingly so (neither a butterfly wing nor an asteroid impact).

Also assume that, for some reason, we are able to program an AI that will be nice, given that the president does die on that day. Its behaviour if the president doesn't die is undefined and potentially dangerous.

Is there a way (either at the initial stages of programming or at the later) to extend the "niceness" from the "presidential death world" into the "presidential survival world"?

To focus on how tricky the problem is, assume for argument's sake that the vice-president is a war monger that will start a nuclear war if they become president. Then "launch a coup on the 2nd of April" is a "nice" thing of the AI to do, conditional on the president dying. However, if you naively import that requirement into the "presidential survival world", the AI will launch a pointeless and counterproductive coup. This is illustrative of the kind of problems that could come up.

So the question is, can we transfer niceness in this way, without needing a solution to the full problem of niceness in general?

EDIT: Actually, this seems ideally setup for a Bayes network (or for the requirement that a Bayes network be used).

EDIT2: Now the problem of predicates like "Grue" and "Bleen" seem to be the relevant bit. If you can avoid concepts such as "X={nuclear war if president died, peace if president lived}", you can make the extension work.

Humans get different counterfactuals

2 Stuart_Armstrong 23 March 2015 02:54PM

A putative new idea for AI control; index here.

A lot of my ideas rely upon taking counterfactuals of events that have tiny ε probabilities of ever happening. This introduces some extra risks, mainly centred around human behaviour. For the counterfactual methods to work, we need the alternate world to be sensible, without things going crazy due to human action.

For instance, imagine that X defines an event where an AI would get turned on, with ¬X (of probability ε) corresponding to the AI failing to get turned on. There are two risks here: the first is that humans would react by saying "wow, an event of probability of ε actually happened; all our models must be wrong! Let's go crazy! Yay!". The second is that humans react by saying: "well, that was embarrassing - let's turn the AI on anyway."

To avoid this issue, imagine the following setup:

The "ON" signal first goes through an event Y, which has 99% chance of letting it through, then to the event X, which (as before) has a 1-ε chance of letting it through. The setup is designed so that humans cannot distinguish between ¬Y (the signal gets blocked at the first stage) and ¬X (the signal gets blocked at the second stage). This only needs to fool humans, not the AI itself.

The AI defines counterfactuals, as before, by looking at ¬X (possibly conditioning on Y versus ¬Y, if this is needed). Everything proceeds as previously from its perspective.

From the human perspective, however, the ¬X world is not distinguishable from the ¬Y one. Given (¬Y or ¬X), humans would conclude that ¬Y is the much more likely option:

P(¬Y|¬Y or ¬X)≈1-100ε.

So the ¬X counterfactual world (for the AI) is one where humans behave as if they were in the ¬Y world.

And ¬Y has one chance in a hundred of happening, which is unlikely, but not enough for humans to assume that their whole model of reality is wrong. Also, this is sufficiently likely that humans would give serious thought as to what to do in the ¬Y case, maybe arranging various pre-commitments or insurance options, making it possible to arrange (with high probability) that humans don't just ignore the result and try again immediately.

Note that this method can't be used (obviously) if ¬X is something hideously dangerous (like an unleashed UFAI), but in all other cases, it seems implementable.

A counterfactual and hypothetical note on AI safety design

9 Stuart_Armstrong 11 March 2015 04:20PM

A putative new idea for AI control; index here.

A lot of the new ideas I've been posting could be parodied as going something like this:

The AI A, which is utility indifferent to the existence of AI B, has utility u (later corriged to v', twice), and it will create a subagent C which believe via false thermodynamic miracles that D does not exist, while D' will hypothetically and counterfactually use two different definitions of counterfactual so that the information content of its own utility cannot be traded with a resource gathering agent E that doesn't exist (assumed separate from its unknown utility function)...

What is happening is that I'm attempting to define algorithms that accomplish a particular goal (such as obeying the spirit of a restriction, or creating a satisficer). Typically this algorithm has various underdefined components - such as inserting an intelligent agent at a particular point, controlling the motivation of an agent at a point, effectively defining a physical event, or having an agent believe (or act as if they believed) something that was incorrect.

The aim is to reduce the problem from stuff like "define human happiness" to stuff like "define counterfactuals" or "pinpoint an AI's motivation". These problems should be fundamentally easier - if not for general agents, then for some of the ones we can define ourselves (this may also allow us to prioritise research directions).

And I also have no doubt that once a design is available, that it will be improved upon and transformed and made easier to implement and generalise. Therefore I'm currently more interested in objections of the form "it won't work" than "it can't be done".

Counterfactual trade

9 owencb 09 March 2015 01:23PM

Counterfactual trade is a form of acausal trade, between counterfactual agents. Compared to a lot of acausal trade this makes it more practical to engage in with limited computational and predictive powers. In Section 1 I’ll argue that some human behaviour is at least interpretable as counterfactual trade, and explain how it could give rise to phenomena such as different moral circles. In Section 2 I’ll engage in wild speculation about whether you could bootstrap something in the vicinity of moral realism from this.

Epistemic status: these are rough notes on an idea that seems kind of promising but that I haven’t thoroughly explored. I don’t think my comparative advantage is in exploring it further, but I do think some people here may have interesting things to say about it, which is why I’m quickly writing this up. I expect at least part of it has issues, and it may be that it’s handicapped by my lack of deep familiarity with the philosophical literature, but perhaps there’s something useful in here too. The whole thing is predicated on the idea of acausal trade basically working.

0. Set-up
Acausal trade is trade between two agents that are not causally connected. In order for this to work they have to be able to predict the other’s existence and how they might act. This seems really hard in general, which inhibits the amount of this trade that happens.

If we had easier ways to make these predictions we’d expect to see more acausal trade. In fact I think counterfactuals give us such a method.

Suppose agents A and B are in scenario X, and A can see a salient counterfactual scenario Y containing agents A’ and B’ (where A is very similar to A’ and B is very similar to B’). Suppose also that from the perspective of B’ in scenario Y, X is a salient counterfactual scenario. Then A and B’ can engage in acausal trade (so long as A cares about A’ and B’ cares about B). Let’s call such trade counterfactual trade.

Agents might engage in counterfactual trade either because they do care about the agents in the counterfactuals (at least seems plausible for some beliefs about a large multiverse), or because it’s instrumentally useful as a tractable decision rule which works as a better approximation to what they’d ideally like to do than similarly tractable versions.

1. Observed counterfactual trade
In fact, some moral principles could arise from counterfactual trade. The rule that you should treat others as you would like to be treated is essentially what you’d expect to get by trading with the counterfactual in which your positions are reversed. Note I’m not claiming that this is the reason people have this rule, but that it could be. I don’t know whether the distinction is important.

It could also explain the fact that people have lessening feelings of obligation to people in widening circles around them. The counterfactual in which your position is swapped with that of someone else in your community is more salient than the counterfactual in which your position is swapped with someone from a very different community -- and you expect it to be more salient to their counterpart in the counterfactual, too. This means that you have a higher degree of confidence in the trade occurring properly with people in close counterfactuals, hence more reason to help them for selfish reasons.

Social shifts can change the salience of different counterfactuals and hence change the degree of counterfactual trade we should expect. (There is something like a testable prediction in this direction, of the theory that humans engage in counterfactual trade! But I haven’t worked through the details enough to get to that test.)

2. Towards moral realism?
Now I will get even more speculative. As people engage in more counterfactual trade, their interests align more closely. If we are willing to engage with a very large set of counterfactual people, then our interests could converge to some kind of average of the interests of these people. This could provide a mechanism for convergent morality.

This would bear some similarities to moral contractualism with a veil of ignorance. There seem to be some differences, though. We’d expect to weigh the interests of others only to the extent to which they too engage (or counterfactually engage?) in counterfactual trade.

It also has some similarities to preference utilitarianism, but again with some distinctions: we would care less about satisfying the preferences of agents who cannot or would not engage in such trade (except insofar as our trade partners may care about the preferences of such agents). We would also care more about the preferences of agents who could have more power to affect the world. Note that this sense of “care less” is as-we-act. If we start out for example with a utilitarian position before engaging in counterfactual trade, then although we will end up putting less effort into helping those who will not trade than before, this will be compensated by the fact that our counterfactual trade partners will put more effort into that.

If this works, I’m not sure whether the result is something you’d want to call moral realism or not. It would be a morality that many agents would converge to, but it would be ‘real’ only in the sense that it was a weighted average of so many agents that individual agents could only shift it infinitessimally.

False thermodynamic miracles

13 Stuart_Armstrong 05 March 2015 05:04PM

A putative new idea for AI control; index here. See also Utility vs Probability: idea synthesis.

Ok, here is the problem:

  • You have to create an AI that believes (or acts as if it believed) that event X is almost certain, while you believe that X is almost impossible. Furthermore, you have to be right. To make things more interesting, the AI is much smarter than you, knows everything that you do (and more), and has to react sensibly when event X doesn't happen.

Answers will be graded on mathematics, style, colours of ink, and compatibility with the laws of physics. Also, penmanship. How could you achieve this?

continue reading »

AI caught by a module that counterfactually doesn't exist

9 Stuart_Armstrong 17 November 2014 05:49PM

EDIT: Whenever I use colloquial phrases like "the AI believes a (false) X" I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.

A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by - say - requiring it to ensure that the world's stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway - except it will also create some system that technically counts as "the world's stock markets" and ensure that it "continues to function".

I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly - but that we weren't sure. We have a subgoal, say SG="ensuring that the world's stock markets continue to function". We're pretty sure that if the AI is friendly, then SG will happen. We're pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn't happen. I deliberately chose a rather rough subgoal to illustrate this.

So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI's predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn't activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI's motivations - see corrigibility).

How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption - we may need to work a bit on this part). A generic UFAI without M wouldn't accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.

But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M - and wouldn't aim to accomplish SG, and thus would be turned off by M.

Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal - it's just that a generic one won't). Similarly, we can't use a unconstrained search to find such an AI.

I wonder if this idea can be extended. Suggestions?

[LINK] Counterfactual Strategies

2 Strilanc 17 June 2014 07:29PM

 

Link:

Counterintuitive Counterfactual Strategies

 

Overview:

Over the weekend, I was thinking about the variant of Newcomb's Paradox where both boxes are transparent. The one where, unless you precommit to taking a visibly empty box instead of both boxes, omega can self-consistently give you less money.

I was wondering if I could make this kind of "sacrifice yourself for yourself" situation happen without involving a predictor guessing your choice before you made it. Turns out you can.

The innovation tree, overshadowed in the innovation forest

12 Stuart_Armstrong 25 February 2014 02:11PM

Cross-posted at Practical Ethics.

Many have pronounced that the era of innovation dead, peace be to its soul. From Tyler Cowen's decree that we've picked all the low hanging fruit of innovation, through Robert Gordon's idea that further innovation growth is threatened by "six headwinds", to Gary Karparov's and Peter Thiel's theory that risk aversion has stifled innovation, there is no lack of predictions about the end of discovery.

I don't propose to address the issue with something as practical and useful as actual data. Instead, staying true to my philosophical environment, I propose a thought experiment that hopefully may shed some light. The core idea is that we might be underestimating the impact of innovation because we have so much of it.

Imagine that technological innovation had for some reason stopped around the 1945 - with one exception: the CD and CD player/burner. Fast forwards a few decades, and visualise society. We can imagine a society completely dominated by the CD. We'd have all the usual uses for the CD - music, songs and similar - of course, but also much more.

continue reading »

Counterfactual self-defense

0 MrMind 23 November 2012 10:15AM

Let's imagine these following dialogues between Omega and an agent implementing TDT. Usual standard assumptions on Omega applies: the agent knows Omega is real, trustworthy and reliable, and Omega knows that the agent knows that, and the agent knows that Omega knows that the agent knows, etc. (that is, Omega's trustworthiness is common knowledge, à la Aumann).

Dialogue 1.

Omega: "Would you accept a bet where I pay you 1000$ if a fair coin flip comes out tail and you pay me 100$ if it comes out head?"
TDT: "Sure I would."
Omega: "I flipped the coin. It came out head."
TDT: "Doh! Here's your 100$."

I hope there's no controversy here.

Dialogue 2.

Omega: "I flipped a fair coin and it came out head."
TDT: "Yes...?"
Omega: "Would you accept a bet where I pay you 1000$ if the coin flip came out tail and you pay me 100$ if it came out head?"
TDT: "No way!"

I also hope no controversy arises: if the agent would answer yes, then there's no reason he wouldn't accept all kinds of losing bets conditioned on information it already knows.

The two bets are equal, but the information is presented in different order: in the second dialogue, the agent has the time to change its knowledge about the world and should not accept bets that it already knows are losing.

But then...

Dialogue 3.

Omega: "I flipped a coin and it came out head. I offer you a bet where I pay you 1000$ if the coin flip comes out tail, but only if you agree to pay me 100$ if the coin flip comes out head."
TDT: "...?"

In the original counterfactual discussion, apparently the answer of the TDT implementing agent should have been yes, but I'm not entirely clear on what is the difference between the second and the third case.

Thinking about it, it seems that the case is muddled because the outcome and the bet are presented at the same time. On one hand, it appears correct to think that an agent should act exactly how it should if it had pre-committed, but on the other hand, an agent should not ignore any information is presented (it's a basic requirement of treating probability as extended logic).

So here's a principle I would like to call 'counterfactual self-defense': whenever informations and bets are presented to the agent at the same time, it always first conditions its priors and only then examines whatever bets has been offered. This should prevent Omega from offering counterfactual losing bets, but not counterfactual winning ones.

Would this principle make an agent win more?

The lessons of a world without Hitler

-4 Stuart_Armstrong 16 January 2012 04:16PM

What would the world look like without Hitler? Fiction is generally unequivocal about this: the removal of Hitler makes no difference, the world will still lurch towards a world war through some other path. WWII and the Holocaust are such major, defining events of the twentieth century, that we twist counterfactual events to ensure they still happen.

Against this, some have made the argument that Hitler was essentially sole responsible for WWII and especially for the Holocaust - no Hitler, no war, no extermination camps. The no Holocaust argument is quite solid: the extermination system was expensive, militarily counter-productive, and could only have happened given a leader lacking checks and balance and with an idée fixe that overrode everything else (general European antisemitism allowed the Holocaust, but didn't cause it). The no WWII argument points out that Hitler was both irrational and lucky: he often took great risks, on flimsy evidence, and got away with them. Certainly his decisions in the later, post-Barbarossa period of his reign belie political, military or organisational genius. And it was the height of stupidity to have gone to war, for a half of Poland, with simultaneously the world's greatest empire and what appeared to be the overwhelmingly strong French army. Yes Gamelin, the French commander in chief, did behave like a concussed duckling, and the German army outfought the French - but no-one could have predicted this, and no-one sensible would have counted on it, and hence they wouldn't have risked the war. Hitler wan't sensible, and lucked out.

continue reading »