IlyaShpitser comments on Evidential Decision Theory, Selection Bias, and Reference Classes - Less Wrong

19 Post author: Qiaochu_Yuan 08 July 2013 05:16AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (127)

You are viewing a single comment's thread.

Comment author: IlyaShpitser 09 July 2013 04:11:49AM *  5 points [-]

Look, HIV patients who get HAART die more often (because people who get HAART are already very sick). We don't get to see the health status confounder because we don't get to observe everything we want. Given this, is HAART in fact killing people, or not?

EDT does the wrong thing here. Any attempt to not handle the confounder properly does the wrong thing here. If something does handle the confounder properly, it's not EDT anymore (because it's not going to look at E[death|HAART]). If you are willing to call such a thing "EDT", then EDT can mean whatever you want it to mean.


Here's the specific example to work out using whatever version of EDT you want:

People get HAART over time (let's restrict to 2 time slices for simplicity). The first time HAART is given (A0) it is randomized. The second time HAART is given (A1), it is given by a doctor according to some (known) policy based on vitals after A0 was given and some time passed (L0). Then we see if the patient dies or not (Y). The graph is this:

A0 -> L0 -> A1 -> Y, with A0 -> A1 and A0 -> Y. There is also health status confounding between L0 and Y (a common cause we don't get to see). Based on this data, how do we determine whether giving people HAART at A0 and A1 is a good idea?


It's true that you can formalize, say fluid dynamics in set theory if you wanted. Does this then mean fluid dynamics is set theoretic? One needs to pick the right level of abstraction.


I think discussions of AIXI, source-code aware agents, etc. in the context of decision theories are a bit sterile because they are very far from actual problems people want to solve (e.g. is this actual non-hypothetical drug killing actual non-hypothetical people?)

Comment author: twanvl 09 July 2013 07:31:58PM 2 points [-]

EDT does the wrong thing here. Any attempt to not handle the confounder properly does the wrong thing here. If something does handle the confounder properly, it's not EDT anymore (because it's not going to look at E[death|HAART])

According to the wikipedia page, EDT uses conditional probabilities. I.e.

V(HAART) = P(death|HAART)U(death) + P(!death|HAART)U(!death).

The problem is not with this EDT formula in general, but with how these probabilities are defined and estimated. In reality, they are based on a sample, and we are making a decision for a particular patient, i.e.

V(HAART-patient1) = P(death-patient1|HAART-patient1)U(death-patient1) + P(!death-patient1|HAART-patient1)U(!death-patient1).

We don't know any of these probabilities exactly, since you will not find out whether the patient dies until after you give or not give him the treatment. So instead, you estimate the probabilities based on other patients. A completely brain-dead model would use the reference class of all people, and conclude that HAART kills. But a more sophisticated model would include something like P(patient1 is similar to patient2) to define a better reference class, and it would also take into account confounders.

Comment author: IlyaShpitser 09 July 2013 07:39:28PM 0 points [-]

But a more sophisticated model would include something like P(patient1 is similar to patient2) to define a better reference class, and it would also take into account confounders.

Ok -- the data is as I describe above. You don't get any more data. What is your EDT solution to this example?

Comment author: twanvl 09 July 2013 10:24:12PM 2 points [-]

You didn't give any data, just a problem description. Am I to assume that there is a bunch of {A0, L0, A1, Y} records are available? And you say that the policy for giving A1 is known, is the information that this decision is based on (health status) also available?

In any case, you end up with the problem of estimating a causal structure from observational data, which is a challenging problem. But I don't see what this has to do with EDT vs another DT. Wouldn't this other decision theory face exactly the same problem?

Comment author: IlyaShpitser 09 July 2013 11:49:09PM *  1 point [-]

Am I to assume that there is a bunch of {A0, L0, A1, Y} records are available? And you say that the policy for giving A1 is known, is the information that this decision is based on (health status) also available?

You have (let's say infinitely many to avoid dealing with stats issues) records for { A0, L0, A1, Y }. You know they come from the causal graph I specified (complete with an unobserved confounder for health status on which no records exist. You don't need to learn the graph, you just need to tell me whether HAART is killing people or not and why, using EDT.

Comment author: twanvl 10 July 2013 09:56:05AM 1 point [-]

There is no single 'right answer' in this case. The answer will depend on your prior for the confounder.

As others have noted, the question "is HAART killing people?" has nothing to do with EDT, or any other decision theory for that matter. The question that decision theories answer is "should I give HAART to person X?"

Comment author: IlyaShpitser 10 July 2013 03:40:33PM *  1 point [-]

There is no single 'right answer' in this case. The answer will depend on your prior for the confounder.

As others have noted, the question "is HAART killing people?" has nothing to do with EDT ...

I think I disagree with both of these assertions. First, there is the "right answer," and it has nothing to do with priors or Bayesian reasoning. In fact there is no model uncertainty in the problem -- I gave you "the truth" (the precise structure of the model and enough data to parameterize it precisely so you don't have to pick or average among a set of alternatives). All you have to do is answer a question related to a single parameter of the model I gave you. The only question is which parameter of the model I am asking you about. Second, it's easy enough to rephrase my question to be a decision theory question (I do so here:

http://lesswrong.com/lw/hwq/evidential_decision_theory_selection_bias_and/9cdk).

Comment author: twanvl 10 July 2013 04:41:12PM *  0 points [-]

To quote your other comment:

Ok -- a patient comes in (from the same reference class as the patients in your data). This patient has HIV. Do you put him on HAART or not?

You put the patient on HAART if and only if V(HAART) > V(!HAART), where

V(HAART) = P(death|HAART)U(death) + P(!death|HAART)U(!death).
V(!HAART) = P(death|!HAART)U(death) + P(!death|!HAART)U(!death).

In these formulas HAART means "(decide to) put this patient on HAART" and death means "this patient dies".

For concreteness, we can assume that the utility of death is low, say 0, while the utility of !death is positive. Then the decision reduces to

P(!death|HAART) > P(!death|!HAART)

So if you give me P(!death|HAART) and P(!death,!HAART) then I can give you a decision.

Comment author: IlyaShpitser 10 July 2013 04:49:22PM *  3 points [-]

Ok. This is wrong. The problem is P(death|HAART) isn't telling you whether HAART is bad or not (due to unobserved confounding). I have already specified that there is confounding by health status (that is, HAART helps, but was only given to people who were very sick). What you need to compare is

for various values of A1, and A0.

Comment author: twanvl 10 July 2013 06:22:11PM 0 points [-]

Note that I defined HAART as "put this patient on HAART", not the probability of death when giving HAART in general (maybe I should have used a different notation).

If I understand your model correctly then

A0 = is HAART given at time t=0 (boolean)
L0 = time to wait (seconds, positive)
A1 = is HAART given (again) at time t=L0 (boolean)

with the confounding variable H1, the health at time t=L0, which influences the choice of A1. You didn't specify how L0 was determined, is it fixed or does it also depend on the patient's health? Your formula above suggests that it depends only on the choice A0.

Now a new patient comes in, and you want to know whether you should pick A0=true/false and A1=true/false. Now for the new patient x, you want to estimate P(death[x] | A0[x],A1[x]). If it was just about A0[x], then it would be easy, since the assignment was randomized, so we know that A0 is independent of any confounders. But this is not true for A1, in fact, we have no good data with which to estimate A1[x], since we only have samples where A1 was chosen according to the health-status based policy.

Comment author: Qiaochu_Yuan 09 July 2013 04:40:14AM 1 point [-]

Look, HIV patients who get HAART die more often (because people who get HAART are already very sick). We don't get to see the health status confounder because we don't get to observe everything we want. Given this, is HAART in fact killing people, or not?

Well, of course I can't give the right answer if the right answer depends on information you've just specified I don't have.

If something does handle the confounder properly, it's not EDT anymore (because it's not going to look at E[death|HAART]).

Again, I think there is a nontrivial selection bias / reference class issue here that needs to be addressed. The appropriate reference class for deciding whether to give HAART to an HIV patient is not just the set of all HIV patients who've been given HAART precisely because of the possibility of confounders.

I think discussions of AIXI, source-code aware agents, etc. in the context of decision theories are a bit sterile because they are very far from actual problems people want to solve (e.g. is this actual non-hypothetical drug killing actual non-hypothetical people?)

In actual problems people want to solve, people have the option of acquiring more information and working from there. It's plausible that with enough information even relatively bad decision theories will still output a reasonable answer (my understanding is that this kind of phenomenon is common in machine learning, for example). But the general question of what to do given a fixed amount of information remains open and is still interesting.

Comment author: IlyaShpitser 09 July 2013 06:55:11AM *  7 points [-]

Well, of course I can't give the right answer if the right answer depends on information you've just specified I don't have.

I think there is "the right answer" here, and I think it does not rely on observing the confounder. If your decision theory does then (a) your decision theory isn't as smart as it could be, and (b) you are needlessly restricting yourself to certain types of decision theories.

The appropriate reference class for deciding whether to give HAART to an HIV patient is not just the set of all HIV patients who've been given HAART precisely because of the possibility of confounders.

People have been thinking about confounders for a long time (earliest reference known to me to a "randomized" trial is the book of Daniel, see also this: http://ije.oxfordjournals.org/content/33/2/247.long). There is a lot of nice clever math that gets around unobserved confounders developed in the last 100 years or so. Saying "well we just need to observe confounders" is sort of silly. That's like saying "well, if you want to solve this tricky computational problem forget about developing new algorithms and that whole computational complexity thing, and just buy more hardware."

In actual problems people want to solve, people have the option of acquiring more information and working from there.

I don't know what kind of actual problems you work on, but the reality of life in stats, medicine, etc. is you have your dataset and you got to draw conclusions from it. The dataset is crappy -- there is probably selection bias, all sorts of missing data, censoring, things we would really liked to have known but which were never collected, etc. This is just a fact of life for folks in the trenches in the empirical sciences/data analysis. The right answer here is not denial, but new methodology.

Comment author: William_Quixote 12 July 2013 09:34:48PM *  2 points [-]

There is a lot of nice clever math that gets around unobserved confounders developed in the last 100 years or so.

For non experts in the thread, what's the name of this area and is there a particular introductory text you would recommend?

Comment author: IlyaShpitser 13 July 2013 06:38:19AM *  3 points [-]

Thanks for your interest! The name of the area is "causal inference." Keywords: "standardization" (in epidemiology), "confounder or covariate adjustment," "propensity score", "instrumental variables", "back-door criterion," "front-door criterion," "g-formula", "potential outcomes", "ignorability," "inverse probability weighting," "mediation analysis," "interference", etc.

Pearl's Causality book (http://www.amazon.com/Causality-Reasoning-Inference-Judea-Pearl/dp/052189560X/ref=pd_sim_sbs_b_1) is a good overview (but doesn't talk a lot about statistics/estimation). Early references are Sewall Wright's path analysis paper from 1921 (http://naldc.nal.usda.gov/download/IND43966364/PDF) and Neyman's paper on potential outcomes from 1923 (http://www.ics.uci.edu/~sternh/courses/265/neyman_statsci1990.pdf). People say either Sewall Wright or his dad invented instrumental variables also.

Comment author: William_Quixote 13 July 2013 04:10:35PM 2 points [-]

Thanks

Comment author: endoself 09 July 2013 12:09:52PM 4 points [-]

Look, HIV patients who get HAART die more often (because people who get HAART are already very sick). We don't get to see the health status confounder because we don't get to observe everything we want. Given this, is HAART in fact killing people, or not?

Well, of course I can't give the right answer if the right answer depends on information you've just specified I don't have.

You're sort of missing what Ilya is trying to say. You might have to look at the actual details of the example he is referring to in order for this to make sense. The general idea is that even though we can't observe certain variables, we still have enough evidence to justify the causal model where HAART leads to fewer people die, so we can conclude that we should prescribe it.

I would object to Ilya's more general point though. Saying that EDT would use E(death|HAART) to determine whether to prescribe HAART is making the same sort of reference class error you discuss in the post. EDT agents use EDT, not the procedures used to A0 and A1 in the example, so we really need to calculate E(death|EDT agent prescribes HAART). I would expect this to produce essentially the same results as a Pearlian E(death | do(HAART)), and would probably regard it as a failure of EDT if it did not add up to the same thing, but I think that there is value in discovering how exactly this works out, if it does.

Comment author: IlyaShpitser 09 July 2013 03:19:18PM *  3 points [-]

A challenge (not in a bad sense, I hope): I would be interested in seeing an EDT derivation of the right answer in this example, if anyone wants to do it.

Comment author: [deleted] 11 July 2013 03:11:01PM 3 points [-]

Yeah, unfortunately everyone who responded to your question went all fuzzy in the brain and started philosophical evasive action.

Comment author: nshepperd 10 July 2013 01:02:05AM 0 points [-]

Um, since when were decision theories for answering epistemic questions? Are you trying to make some kind of point about how evidential decision theorists use incorrect math that ignores confounders?

Comment author: IlyaShpitser 10 July 2013 01:05:36AM *  3 points [-]

Um, since when were decision theories for answering epistemic questions?

???

How are you supposed to make good decisions?

Are you trying to make some kind of point about how evidential decision theorists use incorrect math that ignores confounders?

Well, I am trying to learn why people think EDT isn't terminally busted. I gave a simple example that usually breaks EDT as I understand it, and I hope someone will work out the right answer with EDT to show me where I am going wrong.

Comment author: nshepperd 10 July 2013 02:58:15AM -1 points [-]

How are you supposed to make good decisions?

Use decision theory. The point is that it's not decision theory that tells you your shoelaces are undone when you look at your feet. "Are my shoelaces undone?" is a purely epistemic question, that has nothing to do with making decisions. But upon finding out that your shoelaces are undone, a decision theory might decide to do X or Y, after discovering (by making a few queries to the epistemic-calculations module of your brain) that certain actions will result in the shoelaces being tied again, that that would be safer, etc etc.

You're complaining that EDT is somehow unable to solve the question of "is HAART bad" given some useless data set when that doesn't even sound like a question EDT should be trying to answer in the first place—but rather, a question you would try to answer with standard multivariate statistics.

Comment author: IlyaShpitser 10 July 2013 06:33:11AM *  1 point [-]

Ok -- a patient comes in (from the same reference class as the patients in your data). This patient has HIV. Do you put him on HAART or not? Your utility function is minimizing patient deaths. By the way, if you do the wrong thing, you go to jail for malpractice.

Comment author: nshepperd 10 July 2013 07:30:06AM 0 points [-]

How about we dispense with this and you tell us if you know how to extract information about the usefulness (or not) of HAART from a data set like this?

Comment author: IlyaShpitser 10 July 2013 03:37:31PM *  3 points [-]

Ok, first things first.

when that doesn't even sound like a question EDT should be trying to answer in the first place

Do you agree that "Do you put him on HAART or not? Your utility function is minimizing patient deaths." is in fact a kind of question EDT, or decision theories in general, should be trying to answer?

In fact, I already said elsewhere in this thread that I think there is the right answer to this question, and this right answer is to put the patient on HAART (whereas my understanding of EDT is that it will notice that E[death | HAART] > E[death | no HAART], and conclude that HAART is bad). The way you get the answer is no secret either, it's what is called 'the g-formula' or 'truncated factorization' in the literature. I have been trying to understand how my understanding of EDT is wrong. If people's attempt to fix this is to require that we observe all unobserved confounders for death, then to me this says EDT is not a very good decision theory (because other decision theories can get the right answer here without having to observe anything over what I specified). If people say that the right answer is to not give HAART then that's even worse (e.g. they will kill people and go to jail if they actually practice medicine like that).

Comment author: nshepperd 11 July 2013 12:11:29AM *  -1 points [-]

Do you agree that "Do you put him on HAART or not? Your utility function is minimizing patient deaths." is in fact a kind of question EDT, or decision theories in general, should be trying to answer?

Yes. However a decision theory in general contains no specific prescriptions for obtaining probabilities from data, such as "oh, use the parametric g-formula". In general, they have lists of probabilistic information that they require.

E[death | HAART] > E[death | no HAART]

Setting that aside, I assume you mean the above to mean "count the proportion of samples without HAART with death, and compare to proportion of samples with HAART with death". Ignoring the fact that I thought there were no samples without HAART at t=0, what if half of the samples referred to hamsters, rather than humans?

No-one would ever have proposed EDT as a serious decision theory if they intended one to blindly count records while ignoring all other relevant "confounding" information (such as species, or health status). In reality, the purpose of the program of "count the number of people who smoke who have the lesion" or "count how many people who have HAART die" is to obtain estimates of P(I have the lesion | I smoke) or P(this patient dies | I give this patient HAART). That is why we discard hamster samples, because there are good a priori reasons to think that the survival of hamsters and humans is not highly correlated, and "this patient" is a human.

Comment author: IlyaShpitser 11 July 2013 03:41:28AM *  2 points [-]

Ignoring the fact that I thought there were no samples without HAART at t=0, what if half of the samples referred to hamsters, rather than humans?

Well, there is in reality A0 and A1. I choose this example because in this example it is both the case that E[death | A0, A1] is wrong, and \sum_{L0} E[death | A0,A1,L0] p(L0) (usual covariate adjustment) is wrong, because L0 is a rather unusual type of confounder. This example was something naive causal inference used to get wrong for a long time.

More generally, you seem to be fighting the hypothetical. I gave a specific problem on only four variables, where everything is fully specified, there aren't hamsters, and which (I claim) breaks EDT. You aren't bringing up hamsters with Newcomb's problem, why bring them up here? This is just a standard longitudinal design: there is nothing exotic about it, no omnipotent Omegas or source-code reading AIs.

However a decision theory in general contains no specific prescriptions for obtaining probabilities from data.

I think you misunderstand decision theory. If you were right, there would be no difference between CDT and EDT. In fact, the entire point of decision theories is to give rules you would use to make decisions. EDT has a rule involving conditional probabilities of observed data (because EDT treats all observed data as evidence). CDT has a rule involving a causal connection between your action and the outcome. This rule implies, contrary to what you claimed, that a particular method must be used to get your answer from data (this method being given by the theory of identification of causal effects) on pain of getting garbage answers and going to jail.

Comment author: nshepperd 11 July 2013 05:47:54AM *  1 point [-]

You aren't bringing up hamsters with Newcomb's problem, why bring them up here?

I said why I was bringing them up. To make the point that blindly counting the number of events in a dataset satisfying (action = X, outcome = Y) is blatantly ridiculous, and this applies whether or not hamsters are involved. If you think EDT does that then either you are mistaken, or everyone studying EDT are a lot less sane than they look.

I think you misunderstand decision theory. If you were right, there would be no difference between CDT and EDT.

The difference is that CDT asks for P(utility | do(action), observations) and EDT asks for P(utility | action, observations). Neither CDT or EDT specify detailed rules for how to calculate these probabilities or update on observations, or what priors to use. Indeed, those rules are normally found in statistics textbooks, Pearl's Causality or—in the case of the g-formula—random math papers.