James_Miller comments on Causal decision theory is unsatisfactory - LessWrong

20 Post author: So8res 13 September 2014 05:05PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (158)

You are viewing a single comment's thread.

Comment author: James_Miller 13 September 2014 06:48:24PM *  1 point [-]

However, CDT does fail on a very similar problem where it seems insane to fail. CDT fails at the token trade even when it knows it is playing against a perfect copy of itself.

I don't see this as a failure. As I'm sure you know, if the game still is a PD then regardless of what my clone does I'm still always better off defecting. If either my actions influence what my clone does, or I care about my clone's wealth then the game might no longer be a PD and as simple game theory would predict I might not defect.

Comment author: So8res 13 September 2014 07:17:55PM *  8 points [-]

Imagine the following game: You give me a program that must either output "give" or "keep". I play it in a token trade (as defined above) against itself. I give you the money from only the first instance (you don't get the wealth that goes to the second instance).

Would you be willing to pay me $150 to play this game? (I'd be happy to pay you $150 to give me this opportunity, for the same reason that I cooperate with myself on a one-shot PD.)

regardless of what my clone does I'm still always better off defecting.

This is broken counterfactual reasoning. It assumes that your action is independent of your clone's just because your action does not influence your clone's. According to the definition of the game, the clone will happen to defect if you defect and will happen to cooperate if you cooperate. If you respect this logical dependence when constructing your counterfactuals, you'll realize that reasoning like "regardless of what my clone does I..." neglects the fact that you can't defect while your clone cooperates.

Comment author: lackofcheese 14 September 2014 01:25:52AM *  2 points [-]

The "give me a program" game carries the right intuition, but keep in mind that a CDT agent will happily play it and win it because they will naturally view the decision of "which program do I write" as the decision node and consequently get the right answer.

The game is not quite the same as a one-shot PD, at the least because CDT gives the right answer.

Comment author: So8res 14 September 2014 01:36:09AM 6 points [-]

Right! :-D

This is precisely one of the points that I'm going to make in an upcoming post. CDT acts differently when it's in the scenario compared to when ask it to choose the program that will face the scenario. This is what we mean by "reflective instability", and this is what we're alluding to when we say things like "a sufficiently intelligent self-modifying system using CDT to make decisions would self-modify to stop using CDT to make decisions".

(This point will be expanded upon in upcoming posts.)

Comment author: dankane 15 September 2014 05:33:09PM 2 points [-]

OK. I'll bite. What's so important about reflective stability? You always alter your program when you come across new data. Now sure we usually think about this in terms of running the same program on a different data set, but there's no fundamental program/data distinction.

The acting differently when choosing a program and being in the scenario is perhaps worrying, but I think that it's intrinsic to the situation we are in when your outcomes are allowed to depend on the behavior of counterfactual copies of you.

For example, consider the following pair of games. In REWARD, you are offered $1000. You can choose whether or not to accept. That's it. In PUNISHMENT, you are penalized $1000000 if you accepted the money in REWARD. Thus programs win PUNISHMENT if and only if they lose REWARD. If you want to write a program to play one it will necessarily differ from the program you would write to play the other. In fact the program playing PUNISHMENT will behave differently than the program you would have written to play the (admittedly counterfactual) subgame of REWARD. How is this any worse than what CDT does with PD?

Comment author: So8res 15 September 2014 05:41:40PM 3 points [-]

Nothing in particular. There is no strong notion of dominance among decision theories, as you've noticed. The problem with CDT isn't that it's unstable under reflection, it's that it's unstable under reflection in such a way that it converges on a bad solution that is much more stable. That point will take a few more posts to get to, but I do hope to get there.

Comment author: dankane 15 September 2014 06:42:20PM 1 point [-]

I guess I'll see your later posts then, but I'm not quite sure how this could be the case. If self-modifying-CDT is considering making a self modification that will lead to a bad solution, it seems like it should realize this and instead not make that modification.

Comment author: So8res 15 September 2014 08:53:52PM *  2 points [-]

Indeed. I'm not sure I can present the argument briefly, but a simple analogy might help: a CDT agent would pay to precommit to onebox before playing Newcomb's game, but upon finding itself in Newcomb's game without precommitting, it would twobox. It might curse its fate and feel remorse that the time for precommitment had passed, but it would still twobox.

For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally), but it would not self-modify to do well in Newcomblike games that were begun in its past (it wouldn't self-modify to retrocommit for the same reason that CDT can't retrocommit in Newcomb's problem: it might curse its fate, but it would still perform poorly).

Anyone who can credibly claim to have knowledge of the agent's original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to "blackmail" the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).

[Short story idea: humans scramble to get a copy of a rouge AI's original source so that they can instantiate a Newcomblike scenario that began in the past, with the goal of regaining control before the AI completes an intelligence explosion.]

(I know this is not a strong argument yet; the full version will require a few more posts as background. Also, this is not an argument from "omg blackmail" but rather an argument from "if you start from a bad starting place then you might not end up somewhere satisfactory, and CDT doesn't seem to end up somewhere satisfactory".)

Comment author: dankane 16 September 2014 12:50:18AM 2 points [-]

For analogous reasons, a CDT agent would self-modify to do well on all Newcomblike problems that it would face in the future (e.g., it would precommit generally)

I am not convinced that this is the case. A self-modifying CDT agent is not caused to self-modify in favor of precommitment by facing a scenario in which precommitment would have been useful, but instead by evidence that such scenarios will occur in the future (and in fact will occur with greater frequency than scenarios that punish you for such precommitments).

Anyone who can credibly claim to have knowledge of the agent's original decision algorithm (e.g. a copy of the original source) can put the agent into such a situation, and in certain exotic cases this can be used to "blackmail" the agent in such a way that, even if it expects the scenario to happen, it still fails (for the same reason that CDT twoboxes even though it would precommit to oneboxing).

Actually, this seems like a bigger problem with UDT to me than with SMCDT (self-modifying CDT). Either type of program can be punished for being instantiated with the wrong code, but only UDT can be blackmailed into behaving differently by putting it in a Newcomb-like situation.

The story idea you had wouldn't work. Against a SMCDT agent, all that getting the AIs original code would allow people to do is to laugh at it for having been instantiated with code that is punished by the scenario they are putting it in. You manipulate a SMCDT agent by threatening to get ahold of its future code and punishing it for not having self-modified. On the other hand, against a UDT agent you could do stuff. You just have to tell it "we're going to simulate you and if the simulation behaves poorly, we will punish the real you". This causes the actual instantiation to change its behavior if it's a UDT agent but not if it's a CDT agent.

On the other hand, all reasonable self-modifying agents are subject to blackmail. You just have to tell them "every day that you are not running code with property X, I will charge you $1000000".

Comment author: one_forward 16 September 2014 01:44:13AM 2 points [-]

Can you give an example where an agent with a complete and correct understanding of its situation would do better with CDT than with UDT?

An agent does worse by giving in to blackmail only if that makes it more likely to be blackmailed. If a UDT agent knows opponents only blackmail agents that pay up, it won't give in.

If you tell a CDT agent "we're going to simulate you and if the simulation behaves poorly, we will punish the real you," it will ignore that and be punished. If the punishment is sufficiently harsh, the UDT agent that changed its behavior does better than the CDT agent. If the punishment is insufficiently harsh, the UDT agent won't change its behavior.

The only examples I've thought of where CDT does better involve the agent having incorrect beliefs. Things like an agent thinking it faces Newcomb's problem when in fact Omega always puts money in both boxes.

Comment author: nshepperd 16 September 2014 01:26:18AM 2 points [-]

This causes the actual instantiation to change its behavior if it's a UDT agent but not if it's a CDT agent.

Only if the adversary makes its decision to attempt extortion regardless of the probability of success. In the usual case, the winning move is to ignore extortion, thereby retroactively making extortion pointless and preventing it from happening in the first place. (Which is of course a strategy unavailable to CDT, who always gives in to one-shot extortion.)

Comment author: hairyfigment 16 September 2014 01:12:42AM 1 point [-]

You just have to tell them "every day that you are not running code with property X, I will charge you $1000000".

I think this is actually the point (though I do not consider myself an expert here). Eliezer thinks his TDT will refuse to give in to blackmail, because outputting another answer would encourage other rational agents to blackmail it. By contrast, CDT can see that such refusal would be useful in the future, so it will adopt (if it can) a new decision theory that refuses blackmail and therefore prevents future blackmail (causally). But if you've already committed to charging it money, its self-changes will have no causal effect on you, so we might expect Modified CDT to have an exception for events we set in motion before the change.

Comment author: lackofcheese 14 September 2014 01:59:22AM *  2 points [-]

That's exactly what I was thinking as I wrote my post, but I decided not to bring it up. Your later posts should be interesting, although I don't think I need any convincing here---I already count myself as part of the "we" as far as the concepts you bring up are concerned.

Comment author: shminux 14 September 2014 12:01:56AM *  2 points [-]

regardless of what my clone does I'm still always better off defecting.

This is broken counterfactual reasoning.

Assume it's not a perfect clone, it can defect with probability p even if you cooperate. Then apply CDT. You get "defect" for any p>0. So it is reasonable to implicitly assume continuity and declare that CDT forces you to defect when p=0. However, if you apply CDT for the case p=0 directly, you get "cooperate" instead.

In other words, the conterfactual reasoning gets broken when the map CDT(p, PD) is not continuous at the point p=0.

Comment author: lackofcheese 14 September 2014 01:40:18AM 3 points [-]

It's not entirely clear what you're saying, but I'll try to take the simplest interpretation. I'm guessing that:
- If you're going to defect, your clone always defects.
- If you're going to cooperate, your clone cooperates with probability 1-p and defects with probability p

In that case, I don't see how it is that you get "defect" for p>0; the above formulation gives "cooperate" for 0<=p<0.5.

Comment author: So8res 14 September 2014 12:18:01AM *  3 points [-]

I disagree. If the agent has a 95% probability of doing the same thing as me and a 5% chance of defecting, I still cooperate. (With 95% probability, most likely, because you gotta punish defectors.)

Indeed, consider the following game: You give me a program that must either output "give" or "keep". I roll a 20 sided die. On a 20, I play your program against a program that always keeps its token. Otherwise, I play your program against itself. I give you the money that (the first instance of) your program wins. Are you willing to pay me $110 to play? I'd be happy to pay you $110 for this opportunity.

I don't cooperate with myself because P(TheirChoice=Defect)=0, I cooperate with myself because I don't reason as if p is independent from my action.

Comment author: [deleted] 14 September 2014 03:31:12PM *  1 point [-]

Suppose you have to submit the source code of a program X, and I will play Y = “run X, then do what X did with probability 0.99 and the reverse with probability 0.01” against Y' which is the same as Y but with a different seed for the RNG, and pay you according to how Y does.

Then “you” (i.e. Y) are not a perfect clone of your opponent (i.e. Y').

What do you do?

Comment author: James_Miller 13 September 2014 07:52:47PM *  0 points [-]

According to the definition of the game, the clone will happen to defect if you defect and will happen to cooperate if you cooperate.

You have to consider off-the-equilibrium-path behavior. If I'm the type of person who will always cooperate, what would happen if I went off-the-equilibrium-path and did defect even if my defecting is a zero probability event?

Comment author: shminux 13 September 2014 09:55:43PM 3 points [-]

If I'm the type of person who will always cooperate, what would happen if I went off-the-equilibrium-path and did defect even if my defecting is a zero probability event?

I'm trying to understand the difference between your statement and "1 is not equal 2, but what if it were?" and failing.

Comment author: solipsist 13 September 2014 11:45:03PM *  1 point [-]

See trembling hand equilibrium.

A trembling hand perfect equilibrium is an equilibrium that takes the possibility of off-the-equilibrium play into account by assuming that the players, through a "slip of the hand" or tremble, may choose unintended strategies, albeit with negligible probability.

First we define a perturbed game. A perturbed game is a copy of a base game, with the restriction that only totally mixed strategies are allowed to be played. A totally mixed strategy is a mixed strategy where every pure strategy is played with non-zero probability. This is the "trembling hands" of the players; they sometimes play a different strategy than the one they intended to play. Then we define a strategy set S (in a base game) as being trembling hand perfect if there is a sequence of perturbed games that converge to the base game in which there is a series of Nash equilibria that converge to S.

Comment author: shminux 14 September 2014 12:05:41AM 1 point [-]

Right, as I mentioned in my other reply, CDT is discontinuous at p=0. Presumably a better decision theory would not have such a discontinuity.

Comment author: Jiro 13 September 2014 11:13:00PM 1 point [-]

One possible interpretation of "if I always cooperate, what would happen if I don't" is "what is the limit, as X approaches 1, of 'if I cooperate with probability X, what would happen if I don't'?"

This doesn't reasonably map onto the 1=2 example.

Comment author: shminux 13 September 2014 11:49:26PM 1 point [-]

Right. There seems to be a discontinuity, as the limit of CDT (p->0) is not CDT (p=0). I wonder if this is the root of the issue.

Comment author: James_Miller 13 September 2014 10:31:18PM *  1 point [-]

"1 is not equal 2, but what if it were?" = what if I could travel faster than the speed of light.

Off the equilibrium path = what if I were to burn a dollar.

Or things I can't do vs things I don't want to do.

Comment author: shminux 13 September 2014 11:34:27PM 1 point [-]

Or things I can't do vs things I don't want to do.

In my mind "I'm the type of person who will always cooperate" means that there is no difference between the two in this case. Maybe you use a different definition of "always"?

Comment author: James_Miller 14 September 2014 12:58:28AM *  1 point [-]

I always cooperate because doing so maximizes my utility since it is better than all the alternatives. I always go slower than the speed of light because I have no alternatives.

Comment author: Adele_L 13 September 2014 08:38:06PM 1 point [-]

You can consider it, but conditioned on the information that you are playing against your clone, you should assign this a very low probability of happening, and weight it in your decision accordingly.

Comment author: James_Miller 13 September 2014 08:41:24PM -1 points [-]

Assume I am the type of person who would always cooperate with my clone. If I asked myself the following question "If I defected would my payoff be higher or lower than if I cooperated even though I know I will always cooperate" what would be the answer?

Comment author: lackofcheese 14 September 2014 03:39:49AM *  2 points [-]

Yes, it makes a little bit of sense to counterfactually reason that you would get $1000 more if you defected, but that is predicated on the assumption that you always cooperate. You cannot actually get that free $1000 because the underlying assumption of the counterfactual would be violated if you actually defected.

Comment author: VAuroch 14 September 2014 11:49:25AM 1 point [-]

The answer would be 'MOO'. Or 'Mu', or 'moot'; they're equivalent. "In this impossible counterfactual where I am self-contradictory, what would happen?"

Comment author: VAuroch 13 September 2014 08:05:17PM *  0 points [-]

No, you don't. This is a game where there are only two possible outcomes; DD and CC. CD and DC are defined to be impossible because the agents playing the game are physically incapable of making those outcomes occur.

EDIT: Maybe physically incapable is a bit strong. If they wanted to maximize the chance that they had unmatched outcomes, they could each flip a coin and take C if heads and D if tails, and would have a 50% chance of unmatching. But they still would both be playing the same precise strategy.

Comment author: James_Miller 13 September 2014 08:14:12PM *  0 points [-]

I don't agree. Even if I'm certain I will not defect, I am capable of asking what would happen if I did, just as the real me both knows he won't do "horrible thing" yet can mentally model what would happen if he did "horrible thing". Or imagine an AI that's programmed to always maximize its utility. This AI still could calculate what would happen if it followed a non-utility maximizing strategy. Often in game theory a solution requires you to calculate your payoff if you left the equilibrium path.

Comment author: pragmatist 13 September 2014 08:38:48PM *  1 point [-]

What would you say about the following decision problem (formulated by Andy Egan, I believe)?

You have a strong desire that all psychopaths in the world die. However, your desire to stay alive is stronger, so if you yourself are a psychopath you don't want all psychopaths to die. You are pretty sure, but not certain, that you're not a psychopath. You're presented with a button, which, if pressed, would kill all psychopaths instantly. You are absolutely certain that only a psychopath would press this button. Should you press the button or not?

It seems to me the answer is "Obviously not", precisely because the "off-path" possibility that you're a non-psychopath who pushes the button should not enter into your consideration. But the causal decision algorithm would recommend pushing the button if your prior that you are a psychopath is small enough. Would you agree with that?

Comment author: Jiro 13 September 2014 11:14:57PM 3 points [-]

If only a psychopath would push the button, then your possible non-psychopathic nature limits what decision algorithms you are capable of following.

Comment author: helltank 13 September 2014 11:22:10PM 1 point [-]

Wouldn't the fact that you're even considering pushing the button(because if only a psychopath would push the button then it follows that a non-psychopath would never push the button) indicate that you are a psychopath and therefore you should not push the button?

Another way to put it is:

If you are a psychopath and you push the button, you die. If you are not a psychopath and you push the button, pushing the button would make you a psychopath(since only a psychopath would push), and therefore you die.

Comment author: pragmatist 14 September 2014 05:57:57AM 2 points [-]

Pushing the button can't make you a psychopath. You're either already a psychopath or you're not. If you're not, you will not push the button, although you might consider pushing it.

Comment author: helltank 14 September 2014 12:51:06PM 1 point [-]

Maybe I was unclear.

I'm arguing that the button will never, ever be pushed. If you are NOT a psychopath, you won't push, end of story.

If you ARE A psychopath, you can choose to push or not push.

if you push, that's evidence you are a psychopath. If you are a psychopath, you should not push. Therefore, you will always end up regretting the decision to push.

If you don't push, you don't push and nothing happens.

In all three cases the correct decision is not to push, therefore you should not push.

Comment author: lackofcheese 14 September 2014 01:47:00AM 1 point [-]

Shouldn't you also update your belief towards being a psychopath on the basis that you have a strong desire that all psychopaths in the world die?

Comment author: pragmatist 14 September 2014 05:56:17AM 1 point [-]

You can stipulate this out of the example. Let's say pretty much everyone has the desire that all psychopaths die, but only psychopaths would actually follow through with it.

Comment author: James_Miller 13 September 2014 08:43:30PM 1 point [-]

I don't press. CDT fails here because (I think) it doesn't allow you to update your beliefs based on your own actions.

Comment author: crazy88 14 September 2014 10:05:15PM 2 points [-]

Exactly what information CDT allows you to update your beliefs on is a matter for some debate. You might be interested in a paper by James Joyce (http://www-personal.umich.edu/~jjoyce/papers/rscdt.pdf) on the issue (which was written in response to Egan's paper).

Comment author: pragmatist 13 September 2014 08:46:49PM *  1 point [-]

But then shouldn't you also update your beliefs about what your clone will do based on your own actions in the clone PD case? Your action is very strong (perfect, by stipulation) evidence for his action.

Comment author: James_Miller 13 September 2014 08:58:03PM 1 point [-]

Yes I should. In the psychopath case whether I press the button depends on my beliefs, in contrast in a PD I should defect regardless of my beliefs.

Comment author: pragmatist 13 September 2014 09:13:31PM *  1 point [-]

Maybe I misunderstand what you mean by "updating beliefs based on action". Here's how I interpret it in the psychopath button case: When calculating the expected utility of pushing the button, don't use the prior probability that you're a psychopath in the calculation, use the probability that you're a psychopath conditional on deciding to push the button (which is 1). If you use that conditional probability, then the expected utility of pushing the button is guaranteed to be negative, no matter what the prior probability that you're a psychopath is. Similarly, when calculating the expected utility of not pushing the button, use the probability that you're a psychopath conditional on deciding not to push the button.

But then, applying the same logic to the PD case, you should calculate expected utilities for your actions using probabilities for your clone's action that are conditional on the very action that you are considering. So when you're calculating the expected utility for cooperating, use probabilities for your clone's action conditional on you cooperating (i.e., 1 for the clone cooperating, 0 for the clone defecting). When calculating the expected utility for defecting, use probabilities for your clone's action conditional on you defecting (0 for cooperating, 1 for defecting). If you do things this way, then cooperating ends up having a higher expected utility.

Perhaps another way of putting it is that once you know the clone's actions are perfectly correlated with your own, you have no good reason to treat the clone as an independent agent in your analysis. The standard tools of game theory, designed to deal with cases involving multiple independent agents, are no longer relevant. Instead, treat the clone as if he were part of the world-state in a standard single-agent decision problem, except this is a part of the world-state about which your actions give you information (kind of like whether or not you're a psychopath in the button case).

Comment author: VAuroch 14 September 2014 11:40:22AM 0 points [-]

Even if I'm certain I will not defect, I am capable of asking what would happen if I did,

Yes, and part of the answer is "If I did defect, my clone would also defect." You have a guarantee that both of you take the same actions because you think according to precisely identical reasoning.

Comment author: James_Miller 14 September 2014 02:27:29PM 1 point [-]

What do you think will happen if clones play the centipede game?

Comment author: VAuroch 15 September 2014 09:25:05AM 1 point [-]

Unclear, depends on the specific properties of the person being cloned. Unlike PD, the two players aren't in the same situation, so they can't necessarily rely on their logic being the same as their counterpart. How closely this would reflect the TDT ideal of 'Always Push' will depend on how luminous the person is; if they can model what they would do in the opposite situation, and are highly confident that their self-model is correct, they can reach the best result, but if they lack confidence that they know what they'd do, then the winning cooperation is harder to achieve.

Of course, if it's denominated in money and is 100 steps of doubling, as implied by the Wikipedia page, then the difference in utility between $1 nonillion and $316 octillion is so negligible that there's essentially no incentive to defect in the last round and any halfway-reasonable person will Always Push straight through the game. But that's a degenerate case and probably not the version originally discussed.