IlyaShpitser comments on Chocolate Ice Cream After All? - Less Wrong
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (77)
Look, this is all too theoretical for me. Can you please go and actually read my example and tell me what your decision rule is for giving the drug?
There is more to this than d-separation. D-separation is just a visual rule for the way in which conditional independence works in certain kinds of graphical models. There is not enough conceptual weight in the d-separation idea alone to handle decision theory.
It is not that clear to me what we know about HAART in this game. For instance, in case we know nothing about it and we only observe logical equivalences (in fact rather probabilistic tendencies) in the form "HAART" <--> "Patient dies (within a specified time interval)" and "no HAART" <--> "Patient survives" it wouldn't be irrational to reject the treatment.
Once we know more about HAART, for instance, that the probabilistic tendencies were due to unknowingly comparing sick people to healthy people, we then can figure out that P( patient survives | sick, HAART) > P (patient survives | sick, no HAART) and that P( patient survives | healthy, HAART)< P(patient survives | healthy, no HAART). Knowing that much, choosing not to give the drug would be a foolish thing to do.
If we come to know that a particular reasoning R leads to not prescribing the drug (even after the update above) is very strongly correlated with having patients that are completely healthy but show false-positive clinical test results, then not prescribing the drug would be the better thing to do. This, of course, would require that this new piece of information brings about true predictions about future cases (which makes the scenario quite unlikely, though considering the theoretical debate it might be relevant).
Generally, I think that drawing causal diagrams is a very useful heuristic in "everyday science", since replacing the term causality with all the conditionals involved might be confusing. Maybe this is a reason why some people tend to think that evidential reasoning is defined to only consider plain conditionals (in this example P(survival| HAART)) but not more background data. Because otherwise, in effortful ways you could receive the same answer as causal reasoners do but what would be the point of imitating CDT?
I think it is exactly the other way round. It's all about conditionals. It seems to me that a bayesian writes down "causal connection" in his/her map after updating on sophisticated sets of correlations. It seems impossible to completely rule out confounding at any place. Since evidential reasoning would suggest not to prescribe the drug in the false-positive scenario above its output is not similiar to the one conventional CDT produces. Differences between CDT and the non-naive evidential approach are described here as well: http://lesswrong.com/lw/j5j/chocolate_ice_cream_after_all/a6lh
It seems that CDT-supporters only do A if there is a causal mechanism connecting it with the desirable outcome B. An evidential reasoner would also do A if he knew that there would be no causal mechanism connecting it to B, but a true (but purely correlative) prediction stating the logical equivalences A<-->B and ~A <--> ~B.
Ok. So what is your answer to this problem:
"A set of 100 HIV patients are randomized to receive HAART at time 0. Some time passes, and their vitals are measured at time 1. Based on this measurement some patients receive HAART at time 1 (some of these received HAART at time 0, and some did not). Some more time passes, and some patients die at time 2. Some of those that die at time 2 had HAART at both times, or at one time, or at no time. You have a set of records that show you, for each patient of 100, whether they got HAART at time 0 (call this variable A0), whether they got HAART at time 1 (call this variable A1), what their vitals were at time 1 (call this variable W), and whether they died or not at time 2 (call this variable Y). A new patient comes in, from the same population as the original 100. You want to determine how much HAART to give him. That is, should {A0,A1} be set to yes,yes; yes,no; no,yes; or no,no. Your utility function rewards you for keeping patients alive. What is your decision rule for prescribing HAART for this patient?"
From the point of view of EDT, the set of records containing values of A0,W,A1,Y for 100 patients is all you get to see. (Someone using CDT would get a bit more information than this, but this isn't relevant for EDT). I can tell you that based on the records you see, p(Y=death | A0=yes,A1=yes) is higher than p(Y=death | A0=no,A1=no). I am also happy to answer any additional questions you may have about p(A0,W,A1,Y). This is a concrete problem with a correct answer. What is it?
I don't understand why you persist in blindly converting historical records into subjective probabilities, as though there was no inference to be done. You can't just set p(Y=death | A0=yes,A1=yes) to the proportion of deaths in the data, because that throws away all the highly pertinent information you have about biology and the selection rule for "when was the treatment applied". (EDIT: ignoring the covariate W would cause Simpson's Paradox in this instance)
EDIT EDIT: Yes,
P(Y = death in a randomly-selected line of the data | A0=yes,A1=yes in the same line of data)is equal to the proportion of deaths in the data, but that's not remotely the same thing asP(this patient dies | I set A0=yes,A1=yes for this patient).I was just pointing out that in the conditional distribution p(Y|A0,A1) derived from the empirical distribution some facts happen to hold that might be relevant. I never said what I am ignoring, I was merely posing a decision problem for EDT to solve.
The only information about biology you have is the 100 records for A0,W,A1,Y that I specified. You can't ask for more info, because there is no more info. You have to decide with what you have.
The information about biology I was thinking of is things like "vital signs tend to be correlated with internal health" and "people with bad internal health tend to die". Information it would be irresponsible to not use.
But anyway, the solution is to calculate
P(this patient dies | I set A0=a0,A1=a1 for this patient, data)(I should have included the conditioning ondataabove but I forgot) by whatever statistical methods are relevant, then to do whichever option of a0,a1 gives the higher number. Straightforward.You can approximate
P(this patient dies | I set A0=a0,A1=a1 for this patient, data)withP_empirical(Y=death | do(A0=a0,A1=a1))from the data, on the assumption that our decision process is independent of W (which is reasonable, since we don't measure W). There are other ways to calculateP(this patient dies | I set A0=a0,A1=a1 for this patient, data), like Solomonoff induction, presumably, but who would bother with that?I agree with you broadly, but this is not the EDT solution, is it? Show me a definition of EDT in any textbook (or Wikipedia, or anywhere) that talks about do(.).
Yes, of course not. That is the point of this example! I was pointing out that facts about p(Y | A0,A1) aren't what we want here. Figuring out the distribution that is relevant is not so easy, and cannot be done merely from knowing p(A0,W,A1,Y).
No, this is the EDT solution.
EDT uses
P(this patient dies | I set A0=a0,A1=a1 for this patient, data)while CDT usesP(this patient dies | do(I set A0=a0,A1=a1 for this patient), data).EDT doesn't "talk about
do" becauseP(this patient dies | I set A0=a0,A1=a1 for this patient, data)doesn't involvedo. It just happens that you can usually approximateP(this patient dies | I set A0=a0,A1=a1 for this patient, data)by usingdo(because the conditions for your personal actions are independent of whatever the conditions for the treatment in the data were).Let me be clear: the use of
doI describe here is not part of the definition of EDT. It is simply an epistemic "trick" for calculatingP(this patient dies | I set A0=a0,A1=a1 for this patient, data), and would be correct even if you just wanted to know the probability, without intending to apply any particular decision theory or take any action at all.Also, CDT can seem a bit magical, because when you use
P(this patient dies | do(I set A0=a0,A1=a1 for this patient), data), you can blindly set the causal graph for your personal decision to the empirical causal graph for your data set, because thedooperator gets rid of all the (factually incorrect) correlations between your action and variables like W.[ I did not downvote, btw. ]
Criticisms section in the Wikipedia article on EDT :
David Lewis has characterized evidential decision theory as promoting "an irrational policy of managing the news".[2] James M. Joyce asserted, "Rational agents choose acts on the basis of their causal efficacy, not their auspiciousness; they act to bring about good results even when doing so might betoken bad news."[3]
Where in the wikipedia EDT article is the reference to "I set"? Or in any text book? Where are you getting your EDT procedure from? Can you show me a reference? EDT is about conditional expectations, not about "I set."
One last question: what is P(this patient dies | I set A0=a0,A1=a1 for this patient, data) as a function of P(Y,A0,W,A1)? If you say "whatever p_empirical(Y | do(A0,A1)) is", then you are a causal decision theorist, by definition.
I don't strongly recall when I last read a textbook on decision theory, but I remember that it described agents using probabilities about the choices available in their own personal situation, not distributions describing historical data.
Pragmatically, when you build a robot to carry out actions according to some decision theory, the process is centered around the robot knowing where it is in the world, and making decisions with the awareness that it is making the decisions, not someone else. The only actions you have to choose are "I do this" or "I do that".
I would submit that a CDT robot makes decisions on the basis of
P(outcome | do(I do this or that), sensor data)while a hypothetical EDT robot would make decisions based onP(outcome | I do this or that, sensor data). HowP(outcome | I do this or that, sensor data)is computed is a matter of personal epistemic taste, and nothing for a decision theory to have any say about.(It might be argued that I am steel-manning the normal description of EDT, since most people talking about it seem to make the error of blindly using distributions describing historical data as
P(outcome | I do this or that, sensor data), to the point where that got incorporated into the definition. In which case maybe I should be writing about my "new" alternative to CDT in philosophy journals.)