Meetup : Prediction Markets and Futarchy

1 Anders_H 02 September 2014 02:13PM

Discussion article for the meetup : Prediction Markets and Futarchy

WHEN: 07 September 2014 03:30:00PM (-0400)

WHERE: 98 Elm St Apt 1, Somerville, MA

I will give a talk about prediction markets and futarchy. The talk is intended as a basic introduction for people who are new to the concept. After my slides, I hope to have a discussion about whether futarchy is feasible.

Cambridge/Boston-area Less Wrong meetups start at 3:30pm, and have an alternating location:

1st Sunday meetups are at Citadel in Porter Sq, at 98 Elm St, apt 1, Somerville.

3rd Sunday meetups are in MIT's building 66 at 25 Ames St, room 156. Room number subject to change based on availability; signs will be posted with the actual room number.

(We also have last Wednesday meetups at Citadel at 7pm.)

Our default schedule is as follows:

—Phase 1: Arrival, greetings, unstructured conversation.

—Phase 2: The headline event. This starts promptly at 4pm, and lasts 30-60 minutes.

—Phase 3: Further discussion. We'll explore the ideas raised in phase 2, often in smaller groups.

—Phase 4: Dinner.

Discussion article for the meetup : Prediction Markets and Futarchy

Meetup : Nick Bostrom Talk on Superintelligence

4 Anders_H 02 September 2014 02:09PM

Discussion article for the meetup : Nick Bostrom Talk on Superintelligence

WHEN: 04 September 2014 08:00:00PM (-0400)

WHERE: Emerson 105, Harvard University, Cambridge, MA

What happens when machines surpass humans in general intelligence? Will artificial agents save or destroy us? In his new book - Superintelligence: Paths, Dangers, Strategies - Professor Bostrom explores these questions, laying the foundation for understanding the future of humanity and intelligent life. Q&A will follow the talk.

http://harvardea.org/event/2014/09/04/bostrom/

(This event is organized by Harvard Effective Altruism. It is not technically a Less Wrong Meetup, but the topic is highly relevant and most of the Boston area rationalist community will be there)

Discussion article for the meetup : Nick Bostrom Talk on Superintelligence

Meetup : The Psychology of Video Games

1 Anders_H 11 August 2014 04:58AM

Discussion article for the meetup : The Psychology of Video Games

WHEN: 17 August 2014 03:30:00PM (-0400)

WHERE: Cambridge, MA

Kelly MacNeill will talk about the psychology of video games, starting at 4pm at MIT's Building 66

Cambridge/Boston-area Less Wrong meetups start at 3:30pm, and have an alternating location:

1st Sunday meetups are at Citadel in Porter Sq, at 98 Elm St, apt 1, Somerville.

3rd Sunday meetups are in MIT's building 66 at 25 Ames St, room 156. Room number subject to change based on availability; signs will be posted with the actual room number.

(We also have last Wednesday meetups at Citadel at 7pm.)

Our default schedule is as follows:

—Phase 1: Arrival, greetings, unstructured conversation.

—Phase 2: The headline event. This starts promptly at 4pm, and lasts 30-60 minutes.

—Phase 3: Further discussion. We'll explore the ideas raised in phase 2, often in smaller groups.

—Phase 4: Dinner.

Discussion article for the meetup : The Psychology of Video Games

Ethical Choice under Uncertainty

3 Anders_H 10 August 2014 10:13PM

Ethical Choice under Uncertainty:

Most discussions about utilitarian ethics are attempt to determine the goodness of an outcome.  For instance, discussions may focus on whether it would be ethical to increase total utility by increasing the total number of individuals but reducing their average utility.  Or,  one could argue about whether we should give more weight to those who are worst off when we aggregate utility over individuals.  

These are all important questions. However, even if they were answered to everyone's satisfaction, the answers would not be sufficient to guide the choices of agents acting under uncertainty. To elaborate,  I believe textbook versions of utilitarianism are unsatisfactory for the following reasons: 

  1. Ethical theories that don't account for the agent's beliefs will have absurd consequences such as claiming that it is unethical to rescue a drowning child if the child goes on to become Hitler.  Clearly, if we are interested in judging whether the agent is acting ethically, the only relevant consideration is his beliefs about the consequences at the time the choice is made. If we define "ethics" to require him to act on information from the future, it becomes impossible in principle to act ethically.
  2. In real life, there will be many situations where the agent makes a bad choice because he has incorrect beliefs about the consequences of his actions.  For most people, if they were asked to judge the morality of a person who has pushed a fat man to his death,  it is important to know whether the man believed he could save the lives of five children by doing so.   Whether the belief is correct or not is not ethically relevant:  There is a difference between stupidity and immorality.  
  3. The real choices are never of the type  "If you choose A, the fat man dies with probability 1,  whereas if you choose B, the five children die with probability 1".   Rather, they are of the type "If you choose A, the fat man dies with probability 0.5, the children die with probability 0.25 and they all die with probablity 0.25".   Choosing between such options will require a formalization of the concept of risk aversion as an integral component of the ethical theory. 

I will attempt to fix this by providing the following definition of ethical choice, which is based on the same setup as Von Neumann Morgenstern Expected Utility Theory:

An agent is making a decision, and can choose from a choice set A, with elements  (a1, a2, an). The possible outcome states of the world are contained in the set W, with elements (w1,w2,wm).  The agent is uncertain about the consequences of his choice; he is not able to perfect predict whether choosing a1 will lead to state w1, w2 or wm. In other words, for every element of the choice set, he has a separate subjective probability distribution ("prior") on W.

He also has a cardinal social welfare function f over possible states of the world.    The social welfare function may have properties such as risk aversion or risk neutrality over attributes of W.   Since the choice made by the agent is one aspect of the state of the world, the social welfare function may include terms for A. 

We define that the agent is acting "ethically" if he chooses the element of the choice set that maximizes the expected value of the social welfare function, under the agent's beliefs about the probability of each possible state of the world that could arise under that action:

Max Σw Pr (W|a) * f(W, a)

Note here that "risk aversion" is defined as the second derivative of the social welfare function. For details, I will unfortunately have to refer the reader to a textbook on Decision Theory, such as Notes on the Theory of Choice.

The advantage of this setup is that it allows us to define the ethical choice precisely, in terms of the intentions and beliefs of the agent. For example, if an individual makes a bad choice because he honestly has a bad prior about the consequences of his choice, we interpret him as acting stupidly, but not unethically.  However, ignorance is not a complete "get out of jail for free" card:  One element of the choice set is always "seek more information / update your prior".  If your true prior says that you can maximize the expected social welfare function by updating your prior, the ethical choice is to seek more information  (this is analogous to the decision theoretic concept "value of information"). 

At this stage, the “social welfare function” is completely unspecified. Therefore, this definition places only minor constraints on what we mean by the word “ethics”.  Some ethical theories are special cases of this definition of ethical choice.   For example, deontology is the special case where the social welfare function f(W,A) is independent of the state of the world, and can be simplified to f(A).  (If the social welfare function is constant over W, the summation over the prior will cancel out)

One thing that is ruled out by the definition, is outcome-based consequentialism, where an agent is defined to act ethically if his actions lead to good realized outcomes.  Note that under this type of consequentialism, at the time a decision is made it is impossible for an agent to know what the correct choice is, because the ethical choice will depend on random events that have not yet taken place.  This definition of ethics excludes strategies that cannot be followed by a rational agent acting solely on information from the past. This is a feature, not a bug. 

We now have a definition of acting ethically. However, it is not yet very useful: We have no way of knowing what the social welfare function looks like. The model simply rules out some pathological ethical theories that are not usable as decision theories, and gives us an appealing definition of ethical choice that allows us to distinguish "ignorance/stupidity" from "immorality". 

If nobody points out any errors that invalidate my reasoning, I will write another installment with some more speculative ideas about how we can attempt to determine what the social welfare function f(W,A) looks like

 

--

I have no expertise in ethics, and most my ideas will be obvious to anyone who has spent time thinking about decision theory. From my understanding of Cake or Death , it looks like similar ideas have been explored here previously, but with additional complications that are not necessary for my argument.   I am puzzled by the fact that this line of thinking is not a central component of most ethical discussions, because I don't believe that it is possible for a non-Omega agent to follow an ethical theory that does not explicitly account for uncertainty. My intuition  is that unless there is a flaw in my reasoning, this is a neglected point that it would be important to draw people's attention to, in a simple form with as few complications as possible.  Hence this post. 

This is a work in progress, I would very much appreciate feedback on where it needs more work. 

Some thoughts on where this idea needs more work:

  • While agents who have bad priors about the consequences of their actions are defined to act stupidly and not unethically, I am currently unclear about how to interpret the actions of agents who have incorrect beliefs about the social welfare function.  
  • I am also unsure if this setup excludes some reasonable forms of ethics, such as a scenario where we model the agent is simultaneously trying to optimize the social welfare function and his own utility function. In such a setup, we may want to have a definition of ethics that involves the rate of substitution between the two things he is optimizing.  However, it is possible that this can be handled within my model, by finding the right social welfare function.  

 

Causal Inference Sequence Part II: Graphical Models

8 Anders_H 04 August 2014 11:10PM

(Part 2 of a Sequence on Applied Causal Inference. Follow-up to Part 1)

Saturated and Unsaturated Models

A model is a restriction on the possible states of the world: By specifying a model, you make a claim that you have knowledge about what the world does not look like.  

To illustrate this, if you have two binary predictors A and B, there are four groups defined by A and B, and four different values of E[Y|A,B].  Therefore, the regression E[Y|A,B] = β0 + β1*A  + β2*B + β3 * A * B  is not a real model :  There are four parameters and four values of E[Y|A,B], so the regression is saturated. In other words, the regression does not make any assumptions about the joint distribution of A, B and Y.   Running this regression in statistical software will simply give you exactly the same estimates as you would have obtained if you manually looked in each of the four groups defined by A and B, and estimated the mean of Y.

If we instead fit the regression model E[Y|A,B] = β0 + β1*A  + β2*B , we are making an assumption: We are assuming that there is no interaction between A and B on the average value of Y.  In contrast to the previous regression, this is a true model:   It makes the assumption that the value of β3 is 0.  In other words, we are saying that the data did not come from a distribution where βis not equal to 0. If this assumption is not true, the model is wrong:  We would have excluded the true state of the world

In general, whenever you use models, think first about what the saturated model looks like, and then add assumptions by asking what parameters  you can reasonably assume are equal to a specific value (such as zero). The same type of logic applies to graphical models such as directed acyclic graphs (DAGs).  

We will talk about two types of DAGs:   Statistical DAGs are models for the joint distribution of the variables on the graph, whereas Causal DAGs are a special class of DAGs which can be used as models for the data generating mechanism.  

Statistical DAGs

A Statistical DAG is a graph that allows you to encode modelling assumptions about the joint distribution of the individual variables. These graphs do not necessarily have any causal interpretation.  

On a Statistical DAG, we represent modelling assumptions by missing arrows. Those missing arrows define the DAG in the same way that the missing term for β3 defines the regression model above.  If there is a directed arrow between any two variables on the graph, the DAG is saturated or complete.  Complete DAGs make no modelling assumptions about the relationship between the variables, in the same way that a saturated regression model makes no modelling assumptions. 

DAG Factorization

The arrows on DAGs are statements about how the joint distribution factorizes. To illustrate, consider the following complete DAG (where each individual patient in our study represents a realization of the joint distribution of the variables A, B, C and D.  ):

 

 

 

 Any joint distribution of A,B,C and D can be factorized algebraically according to the laws of probability as  f(A,B,C,D) = f(D|C,B,A) * f(C|B,A) * f(B|A) * f(A).   This factorization is always true, it does not require any assumptions about independence.  By drawing a complete DAG, we are saying that we are not willing to make any further assumptions about how the distribution factorizes.  

Assumptions are represented by missing arrows: Every variable is assumed to be independent of the past, given its parents.  Now, consider the following DAG with three missing arrows:

 

 

 

 

 This DAG is defined by the assumption that C is independent of the joint distribution of A and B, and that D is independent B, given A and C.   If this assumption is true, the distribution can be factorized as f(A,B,C,D) = f(D|C, A) * f(C) * f(B|A) * f(A).    Unlike the factorization of the complete DAG, the above is not a tautology. It is the algebraic representation of the independence assumption that is represented by the missing arrows. The factorization is the modelling assumption:  When arrows are missing, you are really saying that you have a priori knowledge about how the distribution factorizes. 

 

D-Separation

When we make assumptions such as the ones that define a DAG, other independences may automatically follow as logical implications. The reason DAGs are useful, is that you can use the graphs as a tool for reasoning about what independence statements are logical implications of the modelling assumptions. You could reason about this using algebra, but it is usually much harder.  D-Separation is a simple graphical criterion that gives you an immediate answer to whether a particular statement about independence is a logical implication of the independences that define your model.

Two variables are independent (in all distributions that are consistent with the DAG) if there is no open path between them.  This is called «d-separation».  D-Separation is useful because it allows us to determine if a particular independence statement is true within our model.   For example, if we want to know if A is independent of B given C, we check if A is d-separated from B on the graph where C is conditioned on

A path between A and B is any set of edges that connect the two variables. For determining whether a path exists, the direction of the arrows does not matter:  A-->B-->C and A-->B<--C are both examples of paths between A and C.    Using the rules of D-separation, you can determine whether paths are open or closed.  

 

The Rules of D-Separation

Colliders:

If you are considering three variables, they can be connected in four different ways:

 A --> B --> C

 A <-- B <-- C

 A <-- B --> C

 A --> B <-- C

 

  • In the first three cases, B is a non-collider.
  • In the fourth case,  B is a collider: The arrows from A and C "collide" in B.
  • Non-Colliders are (normally) open, whereas colliders are (normally) closed
  • Colliders are defined relative to a specific pathway.  B could be a collider on one pathway, and a non-collider on another pathway 

 

Conditioning:

If we compare individuals within levels of a covariate, that covariate is conditioned on.  In an empirical study, this can happen either by design, or by accident. On a graph, we represent “conditioning” by drawing a box around that variable.  This is equivalent to introducing the variable behind the conditioning sign in the algebraic notation

 

  • If a non-collider is conditioned on, it becomes closed.
  • If a collider is conditioned on, it is opened.
  • If the descendent of a collider is conditioned on, the collider is opened

 

Open and Closed Paths:

 

  • A path is open if and only if all variables on the path are open. 
  • Two variables are D-separated if and only if there is no open path between them
  • Two variables are D-separated conditional on a third variable if and only if there is no open path between them on a graph where the third variable has been conditioned on.

 

Colliders:

Many students who first encounter D-separation are confused about why conditioning on a collider opens it.  Pearl uses the following thought experiment to illustrate what is going on:

Imagine you live in a world where there is a sprinkler that sometimes randomly turns on, regardless of the weather. In this world, whether the sprinkler is on is independent of rain:  If you notice that the sprinkler is on, this gives you no information about whether it rains. 

However, if the sprinkler is on, it will cause the grass to be wet. The same thing happens if it rains. Therefore, the grass being wet is a collider.   Now imagine that you have noticed that the grass is wet.  You also notice that the sprinkler is turned "off".  In this situation, because you have conditioned on the grass being wet,  the fact that the sprinkler is off allows you to conclude that it is probably raining.  

Faithfulness

D-Separation says that if there is no open pathway between two variables, those variables are independent (in all distributions that factorize according to the DAG, ie, in all distributions where the defining independences hold).  This immediately raises the question about whether the logic also runs in the opposite direction:  If there is an open pathway between two variables, does that mean that they are correlated?

The quick answer is that this does not hold, at least not without additional assumptions.   DAGs are defined by assumptions that are represented by the missing arrows:  Any joint distribution where those independences hold, can be represented by the DAG, even if there are additional independences that are not encoded.   However, we usually think about two variables as correlated if they are connected:  This assumption is called faithfulness

Causal DAGs

Causal DAGs are models for the data generating mechanism. The rules that apply to statistical DAGs - such as d-separation - are also valid for Causal DAGs.  If a DAG is «causal», we are simply making the following additional assumptions: 

  • The variables are in temporal (causal) order
  • If two variables on the DAG share a common cause, the common cause is also shown on the graph

If you are willing to make these assumptions, you can think of the Causal DAG as a map of the data generating mechanism. You can read the map as saying that all variables are generated by random processes with a deterministic component that depends only on the parents.  

For example,  if variable Y has two parents A and U, the model says that Ya =  f(A, U, *) where * is a random error term.   The shape of the function f is left completely unspecified, hence the name "non-parametric structural equations model".   The primary assumption in the model is that the error terms on different variables are independent.  

You can also think informally of the arrows as the causal effect of one variable on another:  If we change the value of A, this change would propagate to downstream variables, but not to variables that are not downstream.

Recall that DAGs are useful for reasoning about independences.  Exchangeability assumptions are a special type of independence statements: They involve counterfactual variables.  Counterfactual variables belong in the data generating mechanism, therefore, to reason about them, we will need Causal DAGs.  

A simplified heuristic for thinking about Causal DAGs is as follows:   Correlation flows through any open pathway, but causation flows only in the forward direction.  If you are interested in estimating the causal effect of A on Y, you have to quantify the sum of all forward-going pathways from A to Y.   Any open pathway from A to Y which contains an arrow in the backwards direction will cause bias. 

In the next part in this sequence (which I hope to post next week), I will give a more detailed description of how we can use Causal DAGs to reason about bias in observational research, including confounding bias, selection bias and mismeasurement bias. 

(Feedback is greatly appreciated:  I invoke Crocker's rules.  The most important types of feedback will be if you notice anything that is wrong or misleading.  I also greatly appreciate feedback on whether the structure of the text works, whether the sentences are hard to parse and whether there is any background information that needs to be included)


Causal Inference Sequence Part 1: Basic Terminology and the Assumptions of Causal Inference

27 Anders_H 30 July 2014 08:56PM

(Part 1 of the Sequence on Applied Causal Inference

 

In this sequence, I am going to present a theory on how we can learn about causal effects using observational data.  As an example, we will imagine that you have collected information on a large number of Swedes - let us call them Sven, Olof, Göran,  Gustaf, Annica,  Lill-Babs, Elsa and Astrid. For every Swede, you have recorded data on their gender, whether they smoked or not, and on whether they got cancer during the 10-years of follow-up.   Your goal is to use this dataset to figure out whether smoking causes cancer.   

We are going to use the letter A as a random variable to represent whether they smoked. A can take the value 0 (did not smoke) or 1 (smoked).  When we need to talk about the specific values that A can take, we sometimes use lower case a as a placeholder for 0 or 1.    We use the letter Y as a random variable that represents whether they got cancer, and L to represent their gender. 

The data-generating mechanism and the joint distribution of variables

Imagine you are looking at this data set:

ID

L

A

Y

Name

Sex

Did they smoke?

Did they get cancer?

Sven

Male

Yes

Yes

Olof

Male

No

Yes

Göran

Male

Yes

Yes

Gustaf

Male

No

No

Annica

Female

Yes

Yes

Lill-Babs

Female

Yes

No

Elsa

Female

Yes

No

Astrid

Female

No

No

 

 

This table records information about the joint distribution of the variables L, A and Y.  By looking at it, you can tell that 1/4 of the Swedes were men who smoked and got cancer, 1/8 were men who did not smoke and got cancer, 1/8 were men who did not smoke and did not get cancer etc.  

You can make all sorts of statistics that summarize aspects of the joint distribution.  One such statistic is the correlation between two variables.  If "sex" is correlated with "smoking", it means that if you know somebody's sex, this gives you information that makes it easier to predict whether they smoke.   If knowing about an individual's sex gives no information about whether they smoked, we say that sex and smoking are independent.  We use the symbol ∐ to mean independence. 

When we are interested in causal effects, we are asking what would happen to the joint distribution if we intervened to change the value of a variable.  For example, how many Swedes would get cancer in a hypothetical world where you intervened to make sure they all quit smoking?  

In order to answer this, we have to ask questions about the data generating mechanism. The data generating mechanism is the algorithm that assigns value to the variables, and therefore creates the joint distribution. We will think of the data as being generated by three different algorithms: One for L, one for A and one for Y.    Each of these algorithms takes the previously assigned variables as input, and then outputs a value.    

Questions about the data generating mechanism include “Which variable has its value assigned first?”,  “Which variables from the past (observed or unobserved) are used as inputs” and “If I change whether someone smokes, how will that change propagate to other variables that have their value assigned later".    The last of these questions can be rephrased as "What is the causal effect of smoking”.    

The basic problem of causal inference is that the relationship between the set of possible data generating mechanisms, and the joint distribution of variables, is many-to-one:   For any correlation you observe in the dataset, there are many possible sets of algorithms for L, A and Y that could all account for the observed patterns. For example, if you are looking at a correlation between cancer and smoking, you can tell a story about cancer causing people to take up smoking, or a story about smoking causing people to get cancer, or a story about smoking and cancer sharing a common cause.  

An important thing to note is that even if you have data on absolutely everyone, you still would not be able to distinguish between the possible data generating mechanisms. The problem is not that you have a limited sample. This is therefore not a statistical problem.  What you need to answer the question, is not more people in your study, but a priori causal information.  The purpose of this sequence is to show you how to reason about what prior causal information is necessary, and how to analyze the data if you have measured all the necessary variables. 

Counterfactual Variables and "God's Table":

The first step of causal inference is to translate the English language research question «What is the causal effect of smoking» into a precise, mathematical language.  One possible such language is based on counterfactual variables.  These counterfactual variables allow us to encode the concept of “what would have happened if, possibly contrary to fact, the person smoked”.

We define one counterfactual variable called Ya=1 which represents the outcome in the person if he smoked, and another counterfactual variable called Ya=0 which represents the outcome if he did not smoke. Counterfactual variables such as Ya=0 are mathematical objects that represent part of the data generating mechanism:  The variable tells us what value the mechanism would assign to Y, if we intervened to make sure the person did not smoke. These variables are columns in an imagined dataset that we sometimes call “God’s Table”:

 

ID

A

Y

Ya=1

Ya=0

 

Smoking

Cancer

Whether they would have got cancer if they smoked

Whether they would have got cancer if they didn't smoke

Sven

1

1

1

1

Olof

0

1

0

1

Göran

1

1

1

0

Gustaf

0

0

0

0

 

 

 

Let us start by making some points about this dataset.  First, note that the counterfactual variables are variables just like any other column in the spreadsheet.   Therefore, we can use the same type of logic that we use for any other variables.  Second, note that in our framework, counterfactual variables are pre-treatment variables:  They are determined long before treatment is assigned. The effect of treatment is simply to determine whether we see Ya=0 or Ya=1 in this individual.

If you had access to God's Table, you would immediately be able to look up the average causal effect, by comparing the column Ya=1 to the column Ya=0.  However, the most important point about God’s Table is that we cannot observe Ya=1 and Ya=0. We only observe the joint distribution of observed variables, which we can call the “Observed Table”:

 

ID

A

Y

Sven

1

1

Olof

0

1

Göran

1

1

Gustaf

0

0

 

 

The goal of causal inference is to learn about God’s Table using information from the observed table (in combination with a priori causal knowledge).  In particular, we are going to be interested in learning about the distributions of Ya=1 and Ya=0, and in how they relate to each other.  

 

Randomized Trials

The “Gold Standard” for estimating the causal effect, is to run a randomized controlled trial where we randomly assign the value of A.   This study design works because you select one random subset of the study population where you observe Ya=0, and another random subset where you observe Ya=1.   You therefore have unbiased information about the distribution of both Ya=0and of Ya=1

An important thing to point out at this stage is that it is not necessary to use an unbiased coin to assign treatment, as long as your use the same coin for everyone.   For instance, the probability of being randomized to A=1 can be 2/3.  You will still see randomly selected subsets of the distribution of both Ya=0 and Ya=1, you will just have a larger number of people where you see Ya=1.     Usually, randomized trials use unbiased coins, but this is simply done because it increases the statistical power. 

Also note that it is possible to run two different randomized controlled trials:  One in men, and another in women.  The first trial will give you an unbiased estimate of the effect in men, and the second trial will give you an unbiased estimate of the effect in women.  If both trials used the same coin, you could think of them as really being one trial. However, if the two trials used different coins, and you pooled them into the same database, your analysis would have to account for the fact that in reality, there were two trials. If you don’t account for this, the results will be biased.  This is called “confounding”. As long as you account for the fact that there really were two trials, you can still recover an estimate of the population average causal effect. This is called “Controlling for Confounding”.

In general, causal inference works by specifying a model that says the data came from a complex trial, ie, one where nature assigned a biased coin depending on the observed past.  For such a trial, there will exist a valid way to recover the overall causal results, but it will require us to think carefully about what the correct analysis is. 

Assumptions of Causal Inference

We will now go through in some more detail about why it is that randomized trials work, ie , the important aspects of this study design that allow us to infer causal relationships, or facts about God’s Table, using information about the joint distribution of observed variables.  

We will start with an “observed table” and build towards “reconstructing” parts of God’s Table.  To do this, we will need three assumptions: These are positivity, consistency and (conditional) exchangeability:

ID

A

Y

Sven

1

1

Olof

0

1

Göran

1

1

Gustaf

0

0

 

 

 

Positivity

Positivity is the assumption that any individual has a positive probability of receiving all values of the treatment variable:   Pr(A=a) > 0 for all values of a.  In other words, you need to have both people who smoke, and people who don't smoke.  If positivity does not hold, you will not have any information about the distribution of Ya for that value of a, and will therefore not be able to make inferences about it.

We can check whether this assumption holds in the sample, by checking whether there are people who are treated and people who are untreated. If you observe that in any stratum, there are individuals who are treated and individuals who are untreated, you know that positivity holds.  

If we observe a stratum where no individuals are treated (or no individuals are untreated), this can be either for statistical reasons (your randomly did not sample them) or for structural reasons (individuals with these covariates are deterministically never treated).  As we will see later, our models can handle random violations, but not structural violations.

In a randomized controlled trial, positivity holds because you will use a coin that has a positive probability of assigning people to either arm of the trial.

Consistency

The next assumption we are going to make is that if an individual happens to have treatment (A=1), we will observe the counterfactual variable Ya=1 in this individual. This is the observed table after we make the consistency assumption:

ID

A

Y

Ya=1

Ya=0

Sven

1

1

1

*

Olof

0

1

*

1

Göran

1

1

1

*

Gustaf

0

0

*

0

 

 

 

 

 Making the consistency assumption got us half the way to our goal.  We now have a lot of information about Ya=1 and Ya=0. However, half of the data is still missing.

Although consistency seems obvious, it is an assumption, not something that is true by definition.  We can expect the consistency assumption to hold if we have a well-defined intervention (ie, the intervention is a well-defined choice, not an attribute of the individual), and there is no causal interference (one individual’s outcome is not affected by whether another individual was treated).

Consistency may not hold if you have an intervention that is not well-defined:  For example, there may be multiple types of cigarettes. When you measure Ya=1 in people who smoked, it will actually be a composite of multiple counterfactual variables:  One for people who smoked regular cigarettes (let us call that Ya=1*) and another for people who smoked e-cigarettes (let us call that Ya=1#)   Since you failed to specify whether you are interested in the effect of regular cigarettes or e-cigarettes, the construct Ya=1 is a composite without any meaning, and people will be unable to use your results to predict the consequences of their actions.

Exchangeability

To complete the table, we require an additional assumption on the nature of the data. We call this assumption “Exchangeability”.  One possible exchangeability assumption is “Ya=0 ∐ A and Ya=1 ∐ A”.   This is the assumption that says “The data came from a randomized controlled trial”. If this assumption is true, you will observe a random subset of the distribution of Ya=0 in the group where A=0, and a random subset of the distribution of Ya=1 in the group where A=1.

Exchangeability is a statement about two variables being independent from each other. This means that having information about either one of the variables will not help you predict the value of the other.  Sometimes, variables which are not independent are "conditionally independent".  For example, it is possible that knowing somebody's race helps you predict whether they enjoy eating Hakarl, an Icelandic form of rotting fish.  However, it is also possible that this is just a marker for whether they were born in the ethnically homogenous Iceland. In such a situation, it is possible that once you already know whether somebody is from Iceland, also knowing their race gives you no additional clues as to whether they will enjoy Hakarl.  In this case, the variables "race" and "enjoying hakarl" are conditionally independent, given nationality. 

The reason we care about conditional independence is that sometimes you may be unwilling to assume that marginal exchangeability Ya=1 ∐ A holds, but you are willing to assume conditional exchangeability Ya=1 ∐ A  | L.  In this example, let L be sex.  The assumption then says that you can interpret the data as if it came from two different randomized controlled trials: One in men, and one in women. If that is the case, sex is a "confounder". (We will give a definition of confounding in Part 2 of this sequence. )

If the data came from two different randomized controlled trials, one possible approach is to analyze these trials separately. This is called “stratification”.  Stratification gives you effect measures that are conditional on the confounders:  You get one measure of the effect in men, and another in women.  Unfortunately, in more complicated settings, stratification-based methods (including regression) are always biased. In those situations, it is necessary to focus the inference on the marginal distribution of Ya.

Identification

If marginal exchangeability holds (ie, if the data came from a marginally randomized trial), making inferences about the marginal distribution of Ya is easy: You can just estimate E[Ya] as E [Y|A=a].

However, if the data came from a conditionally randomized trial, we will need to think a little bit harder about how to say anything meaningful about E[Ya]. This process is the central idea of causal inference. We call it “identification”:  The idea is to write an expression for the distribution of a counterfactual variable, purely in terms of observed variables.  If we are able to do this, we have sufficient information to estimate causal effects just by looking at the relevant parts of the joint distribution of observed variables.

The simplest example of identification is standardization.  As an example, we will show a simple proof:

Begin by using the law of total probability to factor out the confounder, in this case L:

·         E(Ya) = Σ  E(Ya|L= l) * Pr(L=l)    (The summation sign is over l)

We do this because we know we need to introduce L behind the conditioning sign, in order to be able to use our exchangeability assumption in the next step:   Then,  because Ya  ∐ A | L,  we are allowed to introduce A=a behind the conditioning sign:

·         E(Ya) =  Σ  E(Ya|A=a, L=l) * Pr(L=l)

Finally, use the consistency assumption:   Because we are in the stratum where A=a in all individuals, we can replace Ya by Y

·         E(Ya) = Σ E(Y|A=a, L=l) * Pr (L=l)

 

We now have an expression for the counterfactual in terms of quantities that can be observed in the real world, ie, in terms of the joint distribution of A, Y and L. In other words, we have linked the data generating mechanism with the joint distribution – we have “identified”  E(Ya).  We can therefore estimate E(Ya)

This identifying expression is valid if and only if L was the only confounder. If we had not observed sufficient variables to obtain conditional exchangeability, it would not be possible to identify the distribution of Ya : there would be intractable confounding.

Identification is the core concept of causal inference: It is what allows us to link the data generating mechanism to the joint distribution, to something that can be observed in the real world. 

 

The difference between epidemiology and biostatistics

Many people see Epidemiology as «Applied Biostatistics».  This is a misconception. In reality, epidemiology and biostatistics are completely different parts of the problem.  To illustrate what is going on, consider this figure:

 

 

The data generating mechanism first creates a joint distribution of observed variables.  Then, we sample from the joint distribution to obtain data. Biostatistics asks:  If we have a sample, what can we learn about the joint distribution?  Epidemiology asks:  If we have all the information about the joint distribution , what can we learn about the data generating mechanism?   This is a much harder problem, but it can still be analyzed with some rigor.

Epidemiology without Biostatistics is always impossible:  It would not be possible to learn about the data generating mechanism without asking questions about the joint distribution. This usually involves sampling.  Therefore, we will need good statistical estimators of the joint distribution.

Biostatistics without Epidemiology is usually pointless:  The joint distribution of observed variables is simply not interesting in itself. You can make the claim that randomized trials is an example of biostatistics without epidemiology.  However, the epidemiology is still there. It is just not necessary to think about it, because the epidemiologic part of the analysis is trivial

Note that the word “bias” means different things in Epidemiology and Biostatistics.  In Biostatistics, “bias” is a property of a statistical estimator:  We talk about whether ŷ is a biased estimator of E(Y |A).   If an estimator is biased, it means that when you use data from a sample to make inferences about the joint distribution in the population the sample came from, there will be a systematic source of error.

In Epidemiology, “bias” means that you are estimating the wrong thing:  Epidemiological bias is a question about whether E(Y|A) is a valid identification of E(Ya).   If there is epidemiologic bias, it means that you estimated something in the joint distribution, but that this something does not answer the question you were interested in.    

These are completely different concepts. Both are important and can lead to your estimates being wrong. It is possible for a statistically valid estimator to be biased in the epidemiologic sense, and vice versa.   For your results to be valid, your estimator must be unbiased in both senses.

 


Sequence Announcement: Applied Causal Inference

24 Anders_H 30 July 2014 08:55PM

Applied Causal Inference for Observational Research

This sequence is an introduction to basic causal inference.  It was originally written as auxiliary notes for a course in Epidemiology, but it is relevant to almost any kind of applied statistical research, including econometrics, sociology, psychology, political science etc.  I would not be surprised if you guys find a lot of errors, and I would be very grateful if you point them out in the comments. This will help me improve my course notes and potentially help me improve my understanding of the material. 

For mathematically inclined readers, I recommend skipping this sequence and instead reading Pearl's book on Causality.  There is also a lot of good material on causal graphs on Less Wrong itself.   Also, note that my thesis advisor is writing a book that covers the same material in more detail, the first two parts are available for free at his website.

Pearl's book, Miguel's book and Eliezer's writings are all more rigorous and precise than my sequence.  This is partly because I have a different goal:  Pearl and Eliezer are writing for mathematicians and theorists who may be interested in contributing to the theory.  Instead,  I am writing for consumers of science who want to understand correlation studies from the perspective of a more rigorous epistemology.  

I will use Epidemiological/Counterfactual notation rather than Pearl's notation. I apologize if this is confusing.  These two approaches refer to the same mathematical objects, it is just a different notation. Whereas Pearl would use the "Do-Operator" E[Y|do(a)], I use counterfactual variables  E[Ya].  Instead of using Pearl's "Do-Calculus" for identification, I use Robins' G-Formula, which will give the same results. 

For all applications, I will use the letter "A" to represent "treatment" or "exposure" (the thing we want to estimate the effect of),  Y to represent the outcome, L to represent any measured confounders, and U to represent any unmeasured confounders. 

Outline of Sequence:

I hope to publish one post every week.  I have rough drafts for the following eight sections, and will keep updating this outline with links as the sequence develops:


Part 0:  Sequence Announcement / Introduction (This post)

Part 1:  Basic Terminology and the Assumptions of Causal Inference

Part 2:  Graphical Models

Part 3:  Using Causal Graphs to Understand Bias

Part 4:  Time-Dependent Exposures

Part 5:  The G-Formula

Part 6:  Inverse Probability Weighting

Part 7:  G-Estimation of Structural Nested Models and Instrumental Variables

Part 8:  Single World Intervention Graphs, Cross-World Counterfactuals and Mediation Analysis

 

 Introduction: Why Causal Inference?

The goal of applied statistical research is almost always to learn about causal effects.  However, causal inference from observational is hard, to the extent that it is usually not even possible without strong, almost heroic assumptions.   Because of the inherent difficulty of the task, many old-school investigators were trained to avoid making causal claims.  Words like “cause” and “effect” were banished from polite company, and the slogan “correlation does not imply causation” became an article of faith which, when said loudly enough,  seemingly absolved the investigators from the sin of making causal claims.

However, readers were not fooled:  They always understood that epidemiologic papers were making causal claims.  Of course they were making causal claims; why else would anybody be interested in a paper about the correlation between two variables?   For example, why would anybody want to know about the correlation between eating nuts and longevity, unless they were wondering if eating nuts would cause them to live longer?

When readers interpreted these papers causally, were they simply ignoring the caveats, drawing conclusions that were not intended by the authors?   Of course they weren’t.  The discussion sections of epidemiologic articles are full of “policy implications” and speculations about biological pathways that are completely contingent on interpreting the findings causally. Quite clearly, no matter how hard the investigators tried to deny it, they were making causal claims. However, they were using methodology that was not designed for causal questions, and did not have a clear language for reasoning about where the uncertainty about causal claims comes from. 

This was not sustainable, and inevitably led to a crisis of confidence, which culminated when some high-profile randomized trials showed completely different results from the preceding observational studies.  In one particular case, when the Women’s Health Initiative trial showed that post-menopausal hormone replacement therapy increases the risk of cardiovascular disease, the difference was so dramatic that many thought-leaders in clinical medicine completely abandoned the idea of inferring causal relationships from observational data.

It is important to recognize that the problem was not that the results were wrong. The problem was that there was uncertainty that was not taken seriously by the investigators. A rational person who wants to learn about the world will be willing to accept that studies have errors of margin, but only as long as the investigators make a good-faith effort to examine what the sources of error are, and communicate clearly about this uncertainty to their readers.  Old-school epidemiology failed at this.  We are not going to make the same mistake. Instead, we are going to develop a clear, precise language for reasoning about uncertainty and bias.

In this context, we are going to talk about two sources of uncertainty – “statistical” uncertainty and “epidemiological” uncertainty. 

We are going to use the word “Statistics” to refer to the theory of how we can learn about correlations from limited samples.  For statisticians, the primary source of uncertainty is sampling variability. Statisticians are very good at accounting for this type of uncertainty: Concepts such as “standard errors”, “p-values” and “confidence intervals” are all attempts at quantifying and communicating the extent of uncertainty that results from sampling variability.

The old school of epidemiology would tell you to stop after you had found the correlations and accounted for the sampling variability. They believed going further was impossible. However, correlations are simply not interesting. If you truly believed that correlations tell you nothing about causation, there would be no point in doing the study.

Therefore, we are going to use the terms “Epidemiology” or “Causal Inference” to refer to the next stage in the process:  Learning about causation from correlations.  This is a much harder problem, with many additional sources of uncertainty, including confounding and selection bias. However, recognizing that the problem is hard does not mean that you shouldn't try, it just means that you have to be careful. As we will see, it is possible to reason rigorously about whether correlation really does imply causation in your particular study: You will just need a precise language. The goal of this sequence is simply to give you such a language.

In order to teach you the logic of this language, we are going to make several controversial statements such as «The only way to estimate a causal effect is to run a randomized controlled trial» . You may not be willing to believe this at first, but in order to understand the logic of causal inference, it is necessary that you are at least willing to suspend your disbelief and accept it as true within the course. 

It is important to note that we are not just saying this to try to convince you to give up on observational studies in favor of randomized controlled trials.   We are making this point because understanding it is necessary in order to appreciate what it means to control for confounding: It is not possible to give a coherent meaning to the word “confounding” unless one is trying to determine whether it is reasonable to model the data as if it came from a complex randomized trial run by nature. 

 

--

When we say that causal inference is hard, what we mean by this is not that it is difficult to learn the basics concepts of the theory.  What we mean is that even if you fully understand everything that has ever been written about causal inference, it is going to be very hard to infer a causal relationship from observational data, and that there will always be uncertainty about the results. This is why this sequence is not going to be a workshop that teaches you how to apply magic causal methodology. What we are interested in, is developing your ability to reason honestly about where uncertainty and bias comes from, so that you can communicate this to the readers of your studies.  What we want to teach you about, is the epistemology that underlies epidemiological and statistical research with observational data. 

Insisting on only using randomized trials may seem attractive to a purist, it does not take much imagination to see that there are situations where it is important to predict the consequences of an action, but where it is not possible to run a trial. In such situations, there may be Bayesian evidence to be found in nature. This evidence comes in the form of correlations in observational data. When we are stuck with this type of evidence, it is important that we have a clear framework for assessing the strength of the evidence. 

 

--

 

I am publishing Part 1 of the sequence at the same time as this introduction. I would be very interested in hearing feedback, particularly about whether people feel this has already been covered in sufficient detail on Less Wrong.  If there is no demand, there won't really be any point in transforming the rest of my course notes to a Less Wrong format. 

Thanks to everyone who had a look at this before I published, including paper-machine and Vika, Janos, Eloise and Sam from the Boston Meetup group. 

View more: Prev