nshepperd comments on Timelessness as a Conservative Extension of Causal Decision Theory - LessWrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (65)
Isn't the whole point of CDT that you cut any arrows from ancestor nodes with do(A) where A is your "intervention"? Obviously you can't have your innards imply your action if you explicitly violate that connection by describing your decision as an intervention.
Here is how I understood typical CDT accounts of Newcomb's problem: You have a graph given by
B <- Innards -> PandB -> M <- P.Innardsstarts with some arbitrary prior probability since you don't know your decision beforehand. You perturb the graph by deletingInnards -> Bin order to calculatep(M | do(B)), and in doing so you end up with a graph "looking like"B -> M <- P. Then the usual "dominance" arguments determine the decision regardless of the prior probability onInnards.Of course, after doing this analysis and coming up with a decision you now know (unconditionally) the value of
Band thereforeInnards, so arguably the probabilities for those should be set to 1 or 0 as appropriate in the original graph. This is generally interpreted by CDTists as a proof that this agent always two-boxes, and always gets the smaller reward.Yes. My point is that when you have a supernatural Omega, then putting any of Omega's actions in ancestor nodes of your decisions, instead of descendant nodes of your decisions, is a mistake that violates the problem description.
But if you don't delete the incoming arches on your decision nodes then it isn't CDT anymore, it's just EDT.
Which begs the question of why we should bother with CDT in the first place.
Some people claim that EDT fails at "smoking lesion" type of problems, but I think it is due to incorrect modelling or underspecification of the problem. If you use the correct model EDT produces the "right" answer.
It seems to me that EDT is superior to CDT.
(Ilya Shpitser will disagree, but I never understood his arguments)
People have known how to deal with smoking lesion (under a different name) since the 18th century (hint: the solution is not the EDT solution):
http://www.e-publications.org/ims/submission/STS/user/submissionFile/12809?confirm=bbb928f0
The trick is to construct a system that deals with things 20 times more complicated than smoking lesion. That system is recent, and you will have to read e.g. my thesis, or Jin Tian's thesis, or elsewhere to see what it is.
I have yet to see anyone advocating EDT actually handle a complicated example correctly. Or even a simple tricky example, e.g. the front door case.
You still delete incoming arcs when you make a decision. The argument is that if Omega perfectly predicts your decision, then causally his prediction must be a descendant of your decision, rather than an ancestor, because if it were an ancestor you would sever the connection that is still solid (and thus violate the problem description).
This is a shame, because he's right. Here's my brief attempt at an explanation of the difference between the two:
EDT uses the joint probability distribution. If you want to express a joint probability distribution as a graphical Bayesian network, then the direction of the arrows doesn't matter (modulo some consistency concerns). If you utilize your human intelligence, you might be able to figure out "okay, for this particular action, we condition on X but not on Y," but you do this for intuitive reasons that may be hard to formalize and which you might get wrong. When you use the joint probability distribution, you inherently assume that all correlation is causation, unless you've specifically added a node or data to block causation for any particular correlation.
CDT uses the causal network, where the direction of the arrows is informative. You can tell the difference between altering and observing something, in that observations condition things both up and down the causal graph, whereas alterations only condition things down the causal graph. You only need to use your human intelligence to build the right graph, and then the math can take over from there. For example, consider price controls: there's a difference between observing that the price of an ounce of gold is $100 and altering the price of an ounce of gold to be $100. And causal networks allow you to answer questions like "given that the price of gold is observed to be $100, what will happen when we force the price of gold to be $120?"
Now, if you look at the math, you can see a way to embed a causal network in a network without causation. So we could use more complicated networks and let conditioning on nodes do the graph severing for us. I think this is a terrible idea, both philosophically and computationally, because it entails more work and less clarity, both of which are changes in the wrong direction.
If I understand correctly, in causal networks the orientation of the arches must respect "physical causality", which I roughly understand to mean consistency with the thermodynamical arrow of time.
There is no way for your action to cause Omega's prediction in this sense, unless time travel is involved.
Yes, different Bayesian networks can represent the same probability distribution. And why would that be a problem? The probability distribution and your utility function are all that matters.
"Correlation vs causation" is an epistemic error. If you are making it then you are using the wrong probability distribution, not a "wrong" factorization of the correct probability distribution.
In the real world, this is correct, but it is not mathematically necessary. (To go up a meta level, this is about how you build causal networks in the first place, not about how you reason once you have a causal network; even if philosophers were right about CDT as the method to go from causal networks to decisions, they seem to have been confused about the method by which one goes from English problem statements to causal networks when it comes to Newcomb's problem.)
It is. How else can Omega be a perfect predictor? (I may be stretching the language, but I count Laplace's Demon as a time traveler, since it can 'see' the world at any time, even though it can only affect the world at the time that it's at.)
The problem is that you can't put any meaning into the direction of the arrows because they're arbitrary.
If you give me a causal diagram and the embedded probabilities for the environment, and ask me to predict what would happen if you did action A (i.e. counterfactual reasoning), you've already given me all I need to calculate the probabilities of any of the other nodes you might be interested in, for any action included in the environment description.
If you give me a joint probability distribution for the environment, and ask me to predict what would happen if you did action A, I don't have enough information to calculate the probabilities of the other nodes. You need to give me a different joint probability distribution for every possible action you could take. This requires a painful amount of communication, but possibly worse is that there's no obvious type difference between the joint probability distribution for the environment and for the environment given a particular action--and if I calculate the consequences of an action given the whole environment's data, I can get it wrong.
If you take physical causality out of the picture, then the arches orientation is underspecified in the general case. But then, since you are only allowed to cut arches that are incoming to the decision nodes, your decision model will be underspecified.
If you are going to allow time travel, defined in a broad sense, then your casual network will have cycles.
But the point is that in EDT you don't care about the direction of the arrows.
If I give you a casual diagram for Newcomb's problem (or some variation of thereof) you will make a wrong prediction, because causal diagrams can't properly represent it.
If the model includes the myself as well as the environment, you will be able to make the correct prediction.
Of course, if you give this prediction back to me, and it influences my decision, then the model has to include you as well. Which may, in principle, cause Godelian self-reference issues. But that's a fundamental limit of the logic capabilities of any computable system, there are no easy ways around it.
But that's not as bad as it sounds: the fact that you can't precisely predict everything about yourself doesn't mean that you can't predict anything or that you can't make approximate predictions.
(for instance, GCC can compile and optimize GCC)
Causal decision models are one way to approximate hard decision problems, and they work well in many practical cases. Newcomb-like scenarios are specifically designed to make them fail.
Yes, and because EDT does not assign meaning to the direction of the arrows is why it's a less powerful language for describing environments.
If you allow retrocausation, I don't see why you think this is the case.
I'm not convinced that this is the case.
Arrow orientation is an artifact of Bayesian networks, not a funamental property of the world.
! Causation going in one direction (if the nodes are properly defined) does appear to be a fundamental property of the real world.
The point is here is that if you have the correct probability distribution, all its predictions will be correct (ie. have minimum expected regret). It seems that the difference between epistemology and decision theory can't be emphasized enough. If it's possible for your "mixing up correlation and causation" to result in you making an incorrect prediction and being surprised (when a different prediction would have been systematically more accurate), then there must be an error in your probability distribution.
But an arbitrary joint probability distribution can assign
P(stuff | action=A)to any values whatsoever. What stops you from just setting all conditional probabilities to the correct values (ie. those values such that they "predict what would happen if you did action A" correctly, which would be the output ofP(stuff|do(A))on the "correct" causal graph)?And furthermore, if that joint distribution does make optimal predictions (assuming that this "counterfactual reasoning" results in optimal predictions, because I can't see any other reason you'd use a set of probabilities), then clearly it must be the probability distribution that is mandated by Cox's theorem, etc etc.
Note, there is a free variable in the above, which is the unconditional probabilities
P(A). But as long as the optimalP(A)values are all nonzero (which is the case if you don't know the agent's algorithm, for example), the optimality of the joint distribution requiresP(stuff|A)to be correct.So it would seem like if you have the correct probablity distribution, you can predict what would happen if I did action A, by virtue of me giving you the answers. Unless I've made a fatal mistake in the above argument.
In the smoking lesion variant where smoking is actually protective against cancer, but not enough to overcome the damage done by the lesion (leading to a Simpson's Paradox), standard EDT recommends against smoking (because it increases your chance of having a lesion) and standard CDT recommends for smoking (because you sever the link to having a lesion, and so only the positive direct effect remains). They give different estimates of difference of probability of getting cancer given that you chose to start smoking and the probability of getting cancer given that you chose to not smoke, because EDT doesn't natively understand the difference between "are a smoker" and "chose to start smoking." If you understand the difference, you can fudge things so that EDT works while you're actively putting effort into it.
This is correct. You can remove the causality from a causal network and just use EDT on a joint probability distribution at the cost of increasing the number of nodes and the fan-in for each node. Since the memory requirements are exponential in fan-in and linear in number of nodes, this is a bad idea.
Besides the memory requirements, this adds another problem: in a causal network, we share parameters that are not shared in the 'decaused' network. This is necessary in order to be able to represent all possible mutilated graphs as marginals of the joint probability distribution, but means that if we're trying to learn the parameters from observational data instead of getting from another source, we need much more data to get estimates that are as good. We can apply equality constraints, but then we might as well use CDT because we're either using the equality constraints implied by CDT (and are thus correct) or we screwed something up.
There also seem to be numerous philosophical benefits to using the language of counterfactuals and conditionals, over just the language of conditionals. Causal networks really are more powerful, in the sense that Paul Graham describes here.
If you give me a joint probability distribution which I can marginalize over any possible action, yes, I can do those predictions because you gave me the answers.
But what use is an algorithm that, when you give it the answers, merely doesn't destroy them? We want something that takes environments as inputs and outputs decisions as outputs, because then it will do the work for us.
I tend to be sceptical of smoking lesion arguments on account of how the scenario seems be always either underspecified or contradictory. For example, how can any agents in the smoking lesion problem be EDT agents at all?
If they always take the action recommended by EDT, and there is exactly one such action, then they must all take the same action. But in that case there can't possibly be the postulated connection between the lesion and smoking (conditional on being an EDT agent). So an EDT agent that knows it implements EDT can't believe that its decision to smoke affects the chances of having the lesion, on pain of making incorrect predictions.
On the other hand, if "EDT agents" in this problem only sometimes take the action recommended by EDT, and the rest of the time are somehow influenced by the presence or absence of the lesion, then the description of the problem that says that the node controlled by your decision theory is "decision to smoke" would seem to be wrong to begin with. (These EDT agents will predict that
P(I smoke | I smoke) = 1and be horribly suprised.)This is something I can believe, though it is not a correctness argument. Certainly it's plausible that in many scenarios it is computationally more convenient to apply CDT directly than to use a fully general model that has been taught about the same structure that CDT assumes.
In the statement of the smoking lesion problem I prefer, you have lots of observational data on people whose decision theory is unknown, but whose bodies are similar enough to yours that you think the things that give or don't give them cancer will have the same effect on you. You also don't know whether or not you have the lesion; a sensible prior is the population prevalence of the lesion.
Now it looks like we have a few options.
Option 1 is unworkable. Option 2 is what I call 'standard EDT,' and it fails on the smoking lesion. Option 3 is generally the one EDTers use to rescue EDT from the smoking lesion. But the issue is that EDT gives you no guidance on which of the correlations to break; you have to figure it out from the problem description. One might expect that sitting down and working out whether or not to smoke using math breaks the correlation between smoking and having the lesion, as most people don't do that. But should we also break the negative correlation between smoking and cancer conditional on lesion status? From the English names, we can probably get those right. If they're unlabeled columns in a matrix or nodes in a graph, we'll have trouble.
That work still has to be done somewhere, obviously; in CDT it's done when one condenses the problem statement down to a causal network. (And CDTers historically being wrong on Newcomb's is an example of what doing this work wrong looks like.) But putting work where it belongs and having good interfaces between your modules is a good idea, and I think this is a place where CDT does solidly better than EDT.
I do think the linked Graham article is well worth reading; that all languages necessarily turn into machine code does not mean all languages are equally good for thinking in. Thinking in a more powerful language lets you have more powerful thoughts.
I don't understand most of your position on EDT/CDT, but I especially don't understand how
follows from the previous sentence.
I also thought P(A|A)=1 followed from the axioms of probability.
Smoking lesion problems are generally underspecified. If you can fill in additional detail, the "correct" decision changes. And I argue that a properly applied EDT outputs it.
Consider the scenario where the lesion affects your probabilty of smoking by affecting your conscious preferences.
The correct decision is smoke, and EDT outputs it if you condition on the preferences.
In another scenario, an evil Omega probes you before you are born. If and only if it predicts that you will be a smoker, it puts a cancer lesion in your DNA (Omega is a good, though not necessarily perfect predictor).
The cancer lesion doesn't directly "cause" smoking, or, in the language of probability theory, it doesn't correlate with smoking conditioned on Omega's prediction.
The correct decision is don't smoke, and EDT outputs it since the problem is exactly isomorphic to Newcomb's standard problem. CDT gets it wrong.
The problem is that this can lead to inconsistency when you have two omegas trying to predict each other.
This is one of the arguments against the possibility of Laplace's Demon, and I agree that a world with two Omegas is probably going to be inconsistent.
It should be noted that this also makes transparent Newcomb ill-posed because the transparent boxes make the box-picker essentially an omega.