IlyaShpitser comments on Chocolate Ice Cream After All? - Less Wrong

3 Post author: pallas 09 December 2013 09:09PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (77)

You are viewing a single comment's thread. Show more comments above.

Comment author: IlyaShpitser 10 December 2013 05:24:35PM *  1 point [-]

I am just saying, fix CDT, not EDT. I claim EDT is irrepairably broken on far less exotic problems than Parfit's hitchhiker. Problems like "should I give drugs to patients based on the results of this observational study?" The reason I think this is I can construct arbitrarily complicated causal graphs where getting the right answer entails having a procedure that is "causal inference"-complete, and I don't think anyone who uses EDT is anywhere near there (and if they are .. they are just reinventing CDT with a different language, which seems silly).

I am not strawmanning EDT, I am happy to be proven wrong by any EDT adherent and update accordingly (hence my challenge). For example, I spent some time with Paul Christiano et al back at the workshop trying to get a satisfactory answer out of EDT, and we didn't really succeed (although to be fair, that was a tangent to the main thrust of that workshop, so we didn't really spend too much time on this).

Comment author: pallas 10 December 2013 06:42:21PM 1 point [-]

I claim EDT is irrepairably broken on far less exotic problems than Parfit's hitchhiker. Problems like "should I give drugs to patients based on the results of this observational study?"

This seems to be a matter of screening off. Once we don't prescribe drugs because of evidential reasoning we don't learn anything new about the health of the patient. I would only not prescripe the drug if a credible instance with forecasting power (for instance Omega) shows to me that generally healthy patients (who show suspicious symptoms) go to doctors who endorse evidential reasoning and unhealthy patients go to conventional causal doctors. This sounds counterintuitive, but structurally it is equal to Newcomb's Problem: The patient corresponds to the box, we know it already "has" a specific value, but we don't know it yet. Choosing only box B (or not to give the drug) would be the option that is only compatible with the more desirable past where Omega has put the million into the box (or where the patient has been healthy all along).

Comment author: IlyaShpitser 10 December 2013 09:59:05PM *  1 point [-]

Look, this is all too theoretical for me. Can you please go and actually read my example and tell me what your decision rule is for giving the drug?


There is more to this than d-separation. D-separation is just a visual rule for the way in which conditional independence works in certain kinds of graphical models. There is not enough conceptual weight in the d-separation idea alone to handle decision theory.

Comment author: pallas 12 December 2013 11:58:22AM 2 points [-]

Look, HIV patients who get HAART die more often (because people who get HAART are already very sick). We don't get to see the health status confounder because we don't get to observe everything we want. Given this, is HAART in fact killing people, or not?

It is not that clear to me what we know about HAART in this game. For instance, in case we know nothing about it and we only observe logical equivalences (in fact rather probabilistic tendencies) in the form "HAART" <--> "Patient dies (within a specified time interval)" and "no HAART" <--> "Patient survives" it wouldn't be irrational to reject the treatment.

Once we know more about HAART, for instance, that the probabilistic tendencies were due to unknowingly comparing sick people to healthy people, we then can figure out that P( patient survives | sick, HAART) > P (patient survives | sick, no HAART) and that P( patient survives | healthy, HAART)< P(patient survives | healthy, no HAART). Knowing that much, choosing not to give the drug would be a foolish thing to do.
If we come to know that a particular reasoning R leads to not prescribing the drug (even after the update above) is very strongly correlated with having patients that are completely healthy but show false-positive clinical test results, then not prescribing the drug would be the better thing to do. This, of course, would require that this new piece of information brings about true predictions about future cases (which makes the scenario quite unlikely, though considering the theoretical debate it might be relevant).

Generally, I think that drawing causal diagrams is a very useful heuristic in "everyday science", since replacing the term causality with all the conditionals involved might be confusing. Maybe this is a reason why some people tend to think that evidential reasoning is defined to only consider plain conditionals (in this example P(survival| HAART)) but not more background data. Because otherwise, in effortful ways you could receive the same answer as causal reasoners do but what would be the point of imitating CDT?

I think it is exactly the other way round. It's all about conditionals. It seems to me that a bayesian writes down "causal connection" in his/her map after updating on sophisticated sets of correlations. It seems impossible to completely rule out confounding at any place. Since evidential reasoning would suggest not to prescribe the drug in the false-positive scenario above its output is not similiar to the one conventional CDT produces. Differences between CDT and the non-naive evidential approach are described here as well: http://lesswrong.com/lw/j5j/chocolate_ice_cream_after_all/a6lh

It seems that CDT-supporters only do A if there is a causal mechanism connecting it with the desirable outcome B. An evidential reasoner would also do A if he knew that there would be no causal mechanism connecting it to B, but a true (but purely correlative) prediction stating the logical equivalences A<-->B and ~A <--> ~B.

Comment author: IlyaShpitser 12 December 2013 12:09:35PM *  1 point [-]

Ok. So what is your answer to this problem:

"A set of 100 HIV patients are randomized to receive HAART at time 0. Some time passes, and their vitals are measured at time 1. Based on this measurement some patients receive HAART at time 1 (some of these received HAART at time 0, and some did not). Some more time passes, and some patients die at time 2. Some of those that die at time 2 had HAART at both times, or at one time, or at no time. You have a set of records that show you, for each patient of 100, whether they got HAART at time 0 (call this variable A0), whether they got HAART at time 1 (call this variable A1), what their vitals were at time 1 (call this variable W), and whether they died or not at time 2 (call this variable Y). A new patient comes in, from the same population as the original 100. You want to determine how much HAART to give him. That is, should {A0,A1} be set to yes,yes; yes,no; no,yes; or no,no. Your utility function rewards you for keeping patients alive. What is your decision rule for prescribing HAART for this patient?"

From the point of view of EDT, the set of records containing values of A0,W,A1,Y for 100 patients is all you get to see. (Someone using CDT would get a bit more information than this, but this isn't relevant for EDT). I can tell you that based on the records you see, p(Y=death | A0=yes,A1=yes) is higher than p(Y=death | A0=no,A1=no). I am also happy to answer any additional questions you may have about p(A0,W,A1,Y). This is a concrete problem with a correct answer. What is it?

Comment author: nshepperd 12 December 2013 12:36:17PM *  1 point [-]

I don't understand why you persist in blindly converting historical records into subjective probabilities, as though there was no inference to be done. You can't just set p(Y=death | A0=yes,A1=yes) to the proportion of deaths in the data, because that throws away all the highly pertinent information you have about biology and the selection rule for "when was the treatment applied". (EDIT: ignoring the covariate W would cause Simpson's Paradox in this instance)

EDIT EDIT: Yes, P(Y = death in a randomly-selected line of the data | A0=yes,A1=yes in the same line of data) is equal to the proportion of deaths in the data, but that's not remotely the same thing as P(this patient dies | I set A0=yes,A1=yes for this patient).

Comment author: IlyaShpitser 12 December 2013 12:52:15PM *  1 point [-]

I was just pointing out that in the conditional distribution p(Y|A0,A1) derived from the empirical distribution some facts happen to hold that might be relevant. I never said what I am ignoring, I was merely posing a decision problem for EDT to solve.

The only information about biology you have is the 100 records for A0,W,A1,Y that I specified. You can't ask for more info, because there is no more info. You have to decide with what you have.

Comment author: nshepperd 12 December 2013 01:15:40PM *  1 point [-]

The information about biology I was thinking of is things like "vital signs tend to be correlated with internal health" and "people with bad internal health tend to die". Information it would be irresponsible to not use.

But anyway, the solution is to calculate P(this patient dies | I set A0=a0,A1=a1 for this patient, data) (I should have included the conditioning on data above but I forgot) by whatever statistical methods are relevant, then to do whichever option of a0,a1 gives the higher number. Straightforward.

You can approximate P(this patient dies | I set A0=a0,A1=a1 for this patient, data) with P_empirical(Y=death | do(A0=a0,A1=a1)) from the data, on the assumption that our decision process is independent of W (which is reasonable, since we don't measure W). There are other ways to calculate P(this patient dies | I set A0=a0,A1=a1 for this patient, data), like Solomonoff induction, presumably, but who would bother with that?

Comment author: IlyaShpitser 12 December 2013 01:39:49PM *  1 point [-]

P_empirical(Y=death | do(A0=a0,A1=a1))

I agree with you broadly, but this is not the EDT solution, is it? Show me a definition of EDT in any textbook (or Wikipedia, or anywhere) that talks about do(.).

Yes, P(Y = death in a randomly-selected line of the data | A0=yes,A1=yes in the same line of data) is equal to the proportion of deaths in the data, but that's not remotely the same thing as P(this patient dies | I set A0=yes,A1=yes for this patient).

Yes, of course not. That is the point of this example! I was pointing out that facts about p(Y | A0,A1) aren't what we want here. Figuring out the distribution that is relevant is not so easy, and cannot be done merely from knowing p(A0,W,A1,Y).

Comment author: nshepperd 12 December 2013 01:50:30PM *  0 points [-]

No, this is the EDT solution.

EDT uses P(this patient dies | I set A0=a0,A1=a1 for this patient, data) while CDT uses P(this patient dies | do(I set A0=a0,A1=a1 for this patient), data).

EDT doesn't "talk about do" because P(this patient dies | I set A0=a0,A1=a1 for this patient, data) doesn't involve do. It just happens that you can usually approximate P(this patient dies | I set A0=a0,A1=a1 for this patient, data) by using do (because the conditions for your personal actions are independent of whatever the conditions for the treatment in the data were).

Let me be clear: the use of do I describe here is not part of the definition of EDT. It is simply an epistemic "trick" for calculating P(this patient dies | I set A0=a0,A1=a1 for this patient, data), and would be correct even if you just wanted to know the probability, without intending to apply any particular decision theory or take any action at all.

Also, CDT can seem a bit magical, because when you use P(this patient dies | do(I set A0=a0,A1=a1 for this patient), data), you can blindly set the causal graph for your personal decision to the empirical causal graph for your data set, because the do operator gets rid of all the (factually incorrect) correlations between your action and variables like W.

Comment author: pallas 10 December 2013 06:12:50PM *  1 point [-]

My comment above strongly called into question whether CDT gives the right answers. Therefore I wouldn't try to reinvent CDT with a different language. For instance, in the post I suggest that we should care about "all" the outcomes, not only the one happening in the future. I've first read about this idea in Paul Almond's paper on decision theory. An excerpt that might be of interest:

Suppose the universe is deterministic, so that the state of the universe at any time completely determines its state at some later time. Suppose at the present time, just before time tnow, you have a choice to make. There is a cup of coffee on a table in front of you and have to decide whether to drink it. Before you decide, let us consider the state of the universe at some time, tsooner, which is earlier than the present. The state of the universe at tsooner should have been one from which your later decision, whatever it is going to be, can be determined: If you eventually end up drinking the coffee at tnow, this should be implied by the universe at tsooner. Assume we do not know whether you are going to drink the coffee. We do not know whether the state of the universe at tsooner was one that led to you drinking the coffee. Suppose that there were a number of conceivable states of the universe at tsooner, each consistent with what you know in the present, which implied futures in which you drink the coffee at tnow. Let us call these states D1,D2,D3,…Dn. Suppose also that there were a number of conceivable states of the universe at tsooner, each consistent with what you know in the present, which implied futures in which you do not drink the coffee at tnow. Let us call these states N1,N2,N3,…Nn. Suppose that you just drunk the coffee at tnow. You would now know that the state of the universe at tsooner was one of the states D1,D2,D3,…Dn. Suppose now that you did not drink the coffee at tnow. You would now know that the state of the universe at tsooner was one of the states N1,N2,N3,…Nn. Consider now the situation in the present, just before tnow, when you are faced with deciding whether to drink the coffee. If you choose to drink the coffee then at tsooner the universe will have been in one of the states D1, D2, D3,…Dn and if you choose not to drink the coffee then at tsooner the universe will have been in one of the states N1,N2,N3,…Nn. From your perspective, your choice is determining the previous state of the universe, as if backward causality were operating. From your perspective, when you are faced with choosing whether or not to drink the coffee, you are able to choose whether you want to live in a universe which was in one of the states D1,D2,D3,…Dn or one of the states N1,N2,N3,…Nn in the past. Of course, there is no magical backward causality effect operating here: The reality is that it is your decision which is being determined by the earlier state of the universe. However, this does nothing to change how things appear from your perspective. Why is it that Newcomb’s paradox worries people so much, while the same issue arising with everyday decisions does not seem to cause the same concern? The main reason is probably that the issue is less obvious outside the scope of contrived situations like that in Newcomb’s paradox. With the example I have been discussing here, you get to choose the state of the universe in the past, but only in very general terms: You know that you can choose to live in a universe that, in the past, was in one of the states D1,D2,D3,…Dn, but you are not confronted with specific details about one of these states, such as knowing that the universe had a specific state in which some money was placed in a certain box (which is how the backward causality seems to operate in Newcomb’s paradox). It may make it seem more like an abstract, philosophical issue than a real problem. In reality, the lack of specific knowledge should not make us feel any better: In both situations you seem to be choosing the past as well as the future. You might say that you do not really get to choose the previous state of the universe, because it was in fact your decision that was determined by the previous state, but you could as well say the same about your decision to drink or not drink the coffee: You could say that whether you drink the coffee was determined by some earlier state of the universe, so you have only the appearance of a choice. When making choices we act as if we can decide, and this issue of the past being apparently dependent on our choices is no different from the normal consequences of our future being apparently dependent on our choices, even though our choices are themselves dependent on other things: We can act as if we choose it.

Comment author: CronoDAS 13 December 2013 07:40:28AM 0 points [-]

This quote seems to be endorsing the Mind Projection Fallacy; learning about the past doesn't seem to me to be the same thing as determining it...

Comment author: pallas 13 December 2013 05:04:21PM *  0 points [-]

It goes the other way round. An excerpt of my post (section Newcomb's Problem's problem of free will):

Perceiving time without an inherent “arrow” is not new to science and philosophy, but still, readers of this post will probably need a compelling reason why this view would be more goal-tracking. Considering the Newcomb’s Problem a reason can be given: Intuitively, the past seems much more “settled” to us than the future. But it seems to me that this notion is confounded as we often know more about the past than we know about the future. This could tempt us to project this disbalance of knowledge onto the universe such that we perceive the past as settled and unswayable in contrast to a shapeable future. However, such a conventional set of intuitions conflicts strongly with us picking only one box. These intuitions would tell us that we cannot affect the content of the box; it is already filled or empty since it has been prepared in the now inaccessible past.