Vaniver comments on Timelessness as a Conservative Extension of Causal Decision Theory - Less Wrong

15 [deleted] 28 May 2014 02:57PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (65)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vaniver 28 May 2014 07:59:53PM *  3 points [-]

I think Spohn also qualifies as an extension of CDT.

Disagreed. By CDT I mean calculating utilities using:

(The only modification from the wikipedia article is that I'm using Pearl's clearer notation for P(A>Oj).)

The naive CDT setup for Newcomb's problem has a causal graph which looks like B->M<-P, where B is your boxing decision, P is Omega's prediction, and M is the monetary reward you receive. This causal graph disagrees with the problem statement, as it necessarily implies that B and P are unconditionally independent, which we know is not the case from the assumption that Omega is a perfect predictor. The causal graph that agrees with the problem statement is B->P->M and B->M, in which case one-boxing is trivially the right action.

The bulk of Spohn's paper is all about how to get over the fear of backwards causation in hypothetical scenarios which explicitly allow backwards causation. You can call that an extension if you want, but it seems to me that's all in the counterfactual reasoning module, not in the decision-making module. (That is, CDT does not describe how you come up with P(Oj|do(A)), only what you do with it once you have it.)

Comment author: nshepperd 29 May 2014 12:56:36AM *  2 points [-]

Uh, doesn't the naive CDT setup for Newcomb's problem normally include a "my innards" node that has arrows going to both B and P? It's that that introduces the unconditional dependence between B and P. Obviously "B -> M <- P" by itself can't even express the problem because it can't represent Omega making any prediction at all.

Comment author: Vaniver 29 May 2014 04:26:07AM 1 point [-]

Uh, doesn't the naive CDT setup for Newcomb's problem normally include a "my innards" node that has arrows going to both B and P?

If you decide what your innards are, and not what your action is, then this matches the problem description. If you can somehow have dishonest innards (Omega thinks I'm a one-boxer, then I can two-box), then this again violates the perfect prediction assumption.

I believe, as an empirical question, the first explicitly CDT accounts of Newcomb's problem did not use graphs, but if you convert their argument into a graph, it implicitly assumes "B -> M <- P."

Comment author: nshepperd 29 May 2014 05:36:56AM *  1 point [-]

If you can somehow have dishonest innards (Omega thinks I'm a one-boxer, then I can two-box), then this again violates the perfect prediction assumption.

Isn't the whole point of CDT that you cut any arrows from ancestor nodes with do(A) where A is your "intervention"? Obviously you can't have your innards imply your action if you explicitly violate that connection by describing your decision as an intervention.

Here is how I understood typical CDT accounts of Newcomb's problem: You have a graph given by B <- Innards -> P and B -> M <- P. Innards starts with some arbitrary prior probability since you don't know your decision beforehand. You perturb the graph by deleting Innards -> B in order to calculate p(M | do(B)), and in doing so you end up with a graph "looking like" B -> M <- P. Then the usual "dominance" arguments determine the decision regardless of the prior probability on Innards.

Of course, after doing this analysis and coming up with a decision you now know (unconditionally) the value of B and therefore Innards, so arguably the probabilities for those should be set to 1 or 0 as appropriate in the original graph. This is generally interpreted by CDTists as a proof that this agent always two-boxes, and always gets the smaller reward.

Comment author: Vaniver 29 May 2014 06:20:58PM 2 points [-]

Isn't the whole point of CDT that you cut any arrows from ancestor nodes with do(A) where A is your "intervention"?

Yes. My point is that when you have a supernatural Omega, then putting any of Omega's actions in ancestor nodes of your decisions, instead of descendant nodes of your decisions, is a mistake that violates the problem description.

Comment author: V_V 30 May 2014 02:32:25PM *  3 points [-]

But if you don't delete the incoming arches on your decision nodes then it isn't CDT anymore, it's just EDT.

Which begs the question of why we should bother with CDT in the first place.
Some people claim that EDT fails at "smoking lesion" type of problems, but I think it is due to incorrect modelling or underspecification of the problem. If you use the correct model EDT produces the "right" answer.
It seems to me that EDT is superior to CDT.

(Ilya Shpitser will disagree, but I never understood his arguments)

Comment author: IlyaShpitser 01 June 2014 03:13:41PM *  3 points [-]

People have known how to deal with smoking lesion (under a different name) since the 18th century (hint: the solution is not the EDT solution):

http://www.e-publications.org/ims/submission/STS/user/submissionFile/12809?confirm=bbb928f0

The trick is to construct a system that deals with things 20 times more complicated than smoking lesion. That system is recent, and you will have to read e.g. my thesis, or Jin Tian's thesis, or elsewhere to see what it is.

I have yet to see anyone advocating EDT actually handle a complicated example correctly. Or even a simple tricky example, e.g. the front door case.

Comment author: Vaniver 30 May 2014 06:16:26PM *  2 points [-]

But if you don't delete the incoming arches on your decision nodes then it isn't CDT anymore, it's just EDT.

You still delete incoming arcs when you make a decision. The argument is that if Omega perfectly predicts your decision, then causally his prediction must be a descendant of your decision, rather than an ancestor, because if it were an ancestor you would sever the connection that is still solid (and thus violate the problem description).

(Ilya Shpitser will disagree, but I never understood his arguments)

This is a shame, because he's right. Here's my brief attempt at an explanation of the difference between the two:

EDT uses the joint probability distribution. If you want to express a joint probability distribution as a graphical Bayesian network, then the direction of the arrows doesn't matter (modulo some consistency concerns). If you utilize your human intelligence, you might be able to figure out "okay, for this particular action, we condition on X but not on Y," but you do this for intuitive reasons that may be hard to formalize and which you might get wrong. When you use the joint probability distribution, you inherently assume that all correlation is causation, unless you've specifically added a node or data to block causation for any particular correlation.

CDT uses the causal network, where the direction of the arrows is informative. You can tell the difference between altering and observing something, in that observations condition things both up and down the causal graph, whereas alterations only condition things down the causal graph. You only need to use your human intelligence to build the right graph, and then the math can take over from there. For example, consider price controls: there's a difference between observing that the price of an ounce of gold is $100 and altering the price of an ounce of gold to be $100. And causal networks allow you to answer questions like "given that the price of gold is observed to be $100, what will happen when we force the price of gold to be $120?"

Now, if you look at the math, you can see a way to embed a causal network in a network without causation. So we could use more complicated networks and let conditioning on nodes do the graph severing for us. I think this is a terrible idea, both philosophically and computationally, because it entails more work and less clarity, both of which are changes in the wrong direction.

Comment author: V_V 30 May 2014 07:33:59PM 4 points [-]

You still delete incoming arcs when you make a decision. The argument is that if Omega perfectly predicts your decision, then causally his prediction must be a descendant of your decision, rather than an ancestor, because if it were an ancestor you would sever the connection that is still solid (and thus violate the problem description).

If I understand correctly, in causal networks the orientation of the arches must respect "physical causality", which I roughly understand to mean consistency with the thermodynamical arrow of time.
There is no way for your action to cause Omega's prediction in this sense, unless time travel is involved.

EDT uses the joint probability distribution. If you want to express a joint probability distribution as a graphical Bayesian network, then the direction of the arrows doesn't matter (modulo some consistency concerns).

Yes, different Bayesian networks can represent the same probability distribution. And why would that be a problem? The probability distribution and your utility function are all that matters.

When you use the joint probability distribution, you inherently assume that all correlation is causation, unless you've specifically added a node or data to block causation for any particular correlation.

"Correlation vs causation" is an epistemic error. If you are making it then you are using the wrong probability distribution, not a "wrong" factorization of the correct probability distribution.

Comment author: Vaniver 30 May 2014 07:56:35PM *  2 points [-]

If I understand correctly, in causal networks the orientation of the arches must respect "physical causality", which I roughly understand to mean consistency with the thermodynamical arrow of time.

In the real world, this is correct, but it is not mathematically necessary. (To go up a meta level, this is about how you build causal networks in the first place, not about how you reason once you have a causal network; even if philosophers were right about CDT as the method to go from causal networks to decisions, they seem to have been confused about the method by which one goes from English problem statements to causal networks when it comes to Newcomb's problem.)

unless time travel is involved.

It is. How else can Omega be a perfect predictor? (I may be stretching the language, but I count Laplace's Demon as a time traveler, since it can 'see' the world at any time, even though it can only affect the world at the time that it's at.)

Yes, different Bayesian networks can represent the same probability distribution. And why would that be a problem?

The problem is that you can't put any meaning into the direction of the arrows because they're arbitrary.

"Correlation vs causation" is an epistemic error. If you are making it then you are using the wrong probability distribution, not a "wrong" factorization of the correct probability distribution.

If you give me a causal diagram and the embedded probabilities for the environment, and ask me to predict what would happen if you did action A (i.e. counterfactual reasoning), you've already given me all I need to calculate the probabilities of any of the other nodes you might be interested in, for any action included in the environment description.

If you give me a joint probability distribution for the environment, and ask me to predict what would happen if you did action A, I don't have enough information to calculate the probabilities of the other nodes. You need to give me a different joint probability distribution for every possible action you could take. This requires a painful amount of communication, but possibly worse is that there's no obvious type difference between the joint probability distribution for the environment and for the environment given a particular action--and if I calculate the consequences of an action given the whole environment's data, I can get it wrong.

Comment author: V_V 09 June 2014 05:54:32PM 1 point [-]

In the real world, this is correct, but it is not mathematically necessary.

If you take physical causality out of the picture, then the arches orientation is underspecified in the general case. But then, since you are only allowed to cut arches that are incoming to the decision nodes, your decision model will be underspecified.

It is. How else can Omega be a perfect predictor?

If you are going to allow time travel, defined in a broad sense, then your casual network will have cycles.

The problem is that you can't put any meaning into the direction of the arrows because they're arbitrary.

But the point is that in EDT you don't care about the direction of the arrows.

If you give me a causal diagram and the embedded probabilities for the environment, and ask me to predict what would happen if you did action A (i.e. counterfactual reasoning), you've already given me all I need to calculate the probabilities of any of the other nodes you might be interested in, for any action included in the environment description.

If I give you a casual diagram for Newcomb's problem (or some variation of thereof) you will make a wrong prediction, because causal diagrams can't properly represent it.

If you give me a joint probability distribution for the environment, and ask me to predict what would happen if you did action A, I don't have enough information to calculate the probabilities of the other nodes.

If the model includes the myself as well as the environment, you will be able to make the correct prediction.

Of course, if you give this prediction back to me, and it influences my decision, then the model has to include you as well. Which may, in principle, cause Godelian self-reference issues. But that's a fundamental limit of the logic capabilities of any computable system, there are no easy ways around it.
But that's not as bad as it sounds: the fact that you can't precisely predict everything about yourself doesn't mean that you can't predict anything or that you can't make approximate predictions.
(for instance, GCC can compile and optimize GCC)

Causal decision models are one way to approximate hard decision problems, and they work well in many practical cases. Newcomb-like scenarios are specifically designed to make them fail.

Comment author: nshepperd 31 May 2014 01:30:03PM *  1 point [-]

"Correlation vs causation" is an epistemic error. If you are making it then you are using the wrong probability distribution, not a "wrong" factorization of the correct probability distribution.

The point is here is that if you have the correct probability distribution, all its predictions will be correct (ie. have minimum expected regret). It seems that the difference between epistemology and decision theory can't be emphasized enough. If it's possible for your "mixing up correlation and causation" to result in you making an incorrect prediction and being surprised (when a different prediction would have been systematically more accurate), then there must be an error in your probability distribution.

If you give me a joint probability distribution for the environment, and ask me to predict what would happen if you did action A, I don't have enough information to calculate the probabilities of the other nodes.

But an arbitrary joint probability distribution can assign P(stuff | action=A) to any values whatsoever. What stops you from just setting all conditional probabilities to the correct values (ie. those values such that they "predict what would happen if you did action A" correctly, which would be the output of P(stuff|do(A)) on the "correct" causal graph)?

And furthermore, if that joint distribution does make optimal predictions (assuming that this "counterfactual reasoning" results in optimal predictions, because I can't see any other reason you'd use a set of probabilities), then clearly it must be the probability distribution that is mandated by Cox's theorem, etc etc.

Note, there is a free variable in the above, which is the unconditional probabilities P(A). But as long as the optimal P(A) values are all nonzero (which is the case if you don't know the agent's algorithm, for example), the optimality of the joint distribution requires P(stuff|A) to be correct.

So it would seem like if you have the correct probablity distribution, you can predict what would happen if I did action A, by virtue of me giving you the answers. Unless I've made a fatal mistake in the above argument.

Comment author: EHeller 30 May 2014 11:45:19PM 2 points [-]

The argument is that if Omega perfectly predicts your decision, then causally his prediction must be a descendant of your decision

The problem is that this can lead to inconsistency when you have two omegas trying to predict each other.

Comment author: Vaniver 31 May 2014 12:54:39AM 1 point [-]

The problem is that this can lead to inconsistency when you have two omegas trying to predict each other.

This is one of the arguments against the possibility of Laplace's Demon, and I agree that a world with two Omegas is probably going to be inconsistent.

Comment author: EHeller 31 May 2014 01:32:32AM *  1 point [-]

It should be noted that this also makes transparent Newcomb ill-posed because the transparent boxes make the box-picker essentially an omega.

Comment author: [deleted] 28 May 2014 08:56:55PM 1 point [-]

You say "disagreed" but then end up saying what I meant in the last paragraph.

Consider that I may have read Spohn before.

Comment author: Vaniver 28 May 2014 09:29:29PM 2 points [-]

You say "disagreed" but then end up saying what I meant in the last paragraph

I think that we're arguing about whether the label CDT refers to just the utility calculation or the combination of the utility calculation and the counterfactual module, not about any of the math. I can go into the reasons why I like to separate those two out, but I think I've already covered the basics.

Consider that I may have read Spohn before.

I generally aim to include the audience when I write comments, which sometimes has the side effect of being insultingly basic to the person I'm responding to. Normally I'm more careful about including disclaimers to that effect, and I apologize for missing that this time.