## Connection Theory Has Less Than No Evidence

I’m a member of the Bay Area Effective Altruist movement. I wanted to make my first post here to share some concerns I have about Leverage Research.

At parties, I often hear Leverage folks claiming they've pretty much solved psychology. They assign credit to their central research project: Connection Theory.

Amazingly, Connection Theory is never something I find endorsed by even a single conventionally educated person with knowledge of psychology. Yet some of my most intelligent friends end up deciding that Connection Theory seems promising enough to be given the benefit of the doubt. They usually give black-box reasons for supporting it, like, “I don’t feel confident assigning less than a 1% chance that it’s correct — and if it works, it would be super valuable. Therefore it’s very high EV!”. They do this sort of hedging as though psychology were a field that couldn’t be probed by science or understood in any level of detail. I would argue that this approach is too forgiving and charitable in situations when you can instead just analyze the theory using standard scientific reasoning. You could also assess its credibility based on standard quality markers or even the perceived quality of the work going into developing the theory.

To start, here’s some warning signs for Connection Theory:

- Invented by amateurs without knowledge of psychology
- Never published for scrutiny in any peer-reviewed venue, conference, open access journal, or even a non peer-reviewed venue of any type
- Unknown outside of the research community that created it
- Vaguely specified
- Cites no references
- Created in a vacuum from first principles
- Contains disproven cartesian assumptions about mental processes
- Unaware of the frontier of current psychology research
- Consists entirely of poorly conducted, unpublished case studies
- Unusually lax methodology... even for psychology experiments
- Data from early studies shows a "100% success rate" -- the way only a grade-schooler would forge their results
- In a 2013 talk at Leverage Research, the creator of Connection Theory refused to acknowledge the possibility that his techniques could ever fail to produce correct answers.
- In that same talk, when someone pointed out a hypothetical way that an incorrect answer could be produced by Connection Theory, the creator countered that if that case occurred, Connection Theory would still be right by relying on a redefinition of the word “true”.
- The creator of Connection Theory brags about how he intentionally targets high net worth individuals for “mind charting” sessions so he can gather information about their motivation that he later uses to solicit large amounts of money from them.

I don't know about you, but most people get off this crazy train somewhere around stop #1. And given the rest, can you really blame them? The average person who sets themselves up to consider (and possibly believe) ideas this insane, doesn't have long before they end up pumping all their money into get rich quick schemes or drinking bleach to try and improve their health

But maybe you think you’re different? Maybe you’re sufficiently epistemically advanced that you don't have to disregard theories with this many red flags. In that case, there's now an even more fundamental reason to reject Connection Theory: As Alyssa Vance points out, the supposed "advance predictions" attributed to Connection Theory (the predictions being claimed as evidence in its favor in the only publicly available manuscript about it), are just ad hoc predictions made up by the researchers themselves on a case by case basis -- with little to no input from Connection Theory itself. This kind of error is why there has been a distinct field called "Philosophy of Science" for the past 50 years. And it's why people attempting to do science need to learn a little about it before proposing theories with so little content that they can't even be wrong.

I mention all this because I find that people from outside the Bay Area or those with very little contact with Leverage often think that Connection Theory is part of a bold and noble research program that’s attacking a valuable problem with reports of steady progress and even some plausible hope of success. Instead, I would counsel newcomers to the effective altruist movement to be careful how much you trust Leverage and not to put too much faith in Connection Theory.

## [link] Why Psychologists' Food Fight Matters

Why Psychologists’ Food Fight Matters: Important findings” haven’t been replicated, and science may have to change its ways. By Michelle N. Meyer and Christopher Chabris. *Slate*, July 31, 2014. [Via Steven Pinker's Twitter account, who adds: "Lesson for sci journalists: Stop reporting single studies, no matter how sexy (these are probably false). Report lit reviews, meta-analyses."] Some excerpts:

Psychologists are up in arms over, of all things, the editorial process that led to the recent publication of a special issue of the journal Social Psychology. This may seem like a classic case of ivory tower navel gazing, but its impact extends far beyond academia. The issue attempts to replicate 27 “important findings in social psychology.” Replication—repeating an experiment as closely as possible to see whether you get the same results—is a cornerstone of the scientific method. Replication of experiments is vital not only because it can detect the rare cases of outright fraud, but also because it guards against uncritical acceptance of findings that were actually inadvertent false positives, helps researchers refine experimental techniques, and affirms the existence of new facts that scientific theories must be able to explain.

One of the articles in the special issue reported a failure to replicate a widely publicized 2008 study by Simone Schnall, now tenured at Cambridge University, and her colleagues. In the original study, two experiments measured the effects of people’s thoughts or feelings of cleanliness on the harshness of their moral judgments. In the first experiment, 40 undergraduates were asked to unscramble sentences, with one-half assigned words related to cleanliness (like pure or pristine) and one-half assigned neutral words. In the second experiment, 43 undergraduates watched the truly revolting bathroom scene from the movie Trainspotting, after which one-half were told to wash their hands while the other one-half were not. All subjects in both experiments were then asked to rate the moral wrongness of six hypothetical scenarios, such as falsifying one’s résumé and keeping money from a lost wallet. The researchers found that priming subjects to think about cleanliness had a “substantial” effect on moral judgment: The hand washers and those who unscrambled sentences related to cleanliness judged the scenarios to be less morally wrong than did the other subjects. The implication was that people who feel relatively pure themselves are—without realizing it—less troubled by others’ impurities. The paper was covered by ABC News, the Economist, and the Huffington Post, among other outlets, and has been cited nearly 200 times in the scientific literature.

However, the replicators—David Johnson, Felix Cheung, and Brent Donnellan (two graduate students and their adviser) of Michigan State University—found no such difference, despite testing about four times more subjects than the original studies. [...]

The editor in chief of Social Psychology later agreed to devote a follow-up print issue to responses by the original authors and rejoinders by the replicators, but as Schnall told Science, the entire process made her feel “like a criminal suspect who has no right to a defense and there is no way to win.” The Science article covering the special issue was titled “Replication Effort Provokes Praise—and ‘Bullying’ Charges.” Both there and in her blog post, Schnall said that her work had been “defamed,” endangering both her reputation and her ability to win grants. She feared that by the time her formal response was published, the conversation might have moved on, and her comments would get little attention.

How wrong she was. In countless tweets, Facebook comments, and blog posts, several social psychologists seized upon Schnall’s blog post as a cri de coeur against the rising influence of “replication bullies,” “false positive police,” and “data detectives.” For “speaking truth to power,” Schnall was compared to Rosa Parks. The “replication police” were described as “shameless little bullies,” “self-righteous, self-appointed sheriffs” engaged in a process “clearly not designed to find truth,” “second stringers” who were incapable of making novel contributions of their own to the literature, and—most succinctly—“assholes.” Meanwhile, other commenters stated or strongly implied that Schnall and other original authors whose work fails to replicate had used questionable research practices to achieve sexy, publishable findings. At one point, these insinuations were met with threats of legal action. [...]Unfortunately, published replications have been distressingly rare in psychology. A 2012 survey of the top 100 psychology journals found that barely 1 percent of papers published since 1900 were purely attempts to reproduce previous findings. Some of the most prestigious journals have maintained explicit policies against replication efforts; for example, the Journal of Personality and Social Psychology published a paper purporting to support the existence of ESP-like “precognition,” but would not publish papers that failed to replicate that (or any other) discovery. Science publishes “technical comments” on its own articles, but only if they are submitted within three months of the original publication, which leaves little time to conduct and document a replication attempt.The “replication crisis” is not at all unique to social psychology, to psychological science, or even to the social sciences. As Stanford epidemiologist John Ioannidis famously argued almost a decade ago, “Most research findings are false for most research designs and for most fields.” Failures to replicate and other major flaws in published research have since been noted throughout science, including in cancer research, research into the genetics of complex diseases like obesity and heart disease, stem cell research, and studies of the origins of the universe. Earlier this year, the National Institutes of Health stated “The complex system for ensuring the reproducibility of biomedical research is failing and is in need of restructuring.”Given the stakes involved and its centrality to the scientific method, it may seem perplexing that replication is the exception rather than the rule. The reasons why are varied, but most come down to the perverse incentives driving research. Scientific journals typically view “positive” findings that announce a novel relationship or support a theoretical claim as more interesting than “negative” findings that say that things are unrelated or that a theory is not supported. The more surprising the positive finding, the better, even though surprising findings are statistically less likely to be accurate. Since journal publications are valuable academic currency, researchers—especially those early in their careers—have strong incentives to conduct original work rather than to replicate the findings of others. Replication efforts that do happen but fail to find the expected effect are usually filed away rather than published. That makes the scientific record look more robust and complete than it is—a phenomenon known as the “file drawer problem.”The emphasis on positive findings may also partly explain the fact that when original studies are subjected to replication, so many turn out to be false positives. The near-universal preference for counterintuitive, positive findings gives researchers an incentive to manipulate their methods or poke around in their data until a positive finding crops up, a common practice known as “p-hacking” because it can result in p-values, or measures of statistical significance, that make the results look stronger, and therefore more believable, than they really are. [...]The recent special issue of Social Psychology was an unprecedented collective effort by social psychologists to [rectify this situation]—by altering researchers’ and journal editors’ incentives in order to check the robustness of some of the most talked-about findings in their own field. Any researcher who wanted to conduct a replication was invited to preregister: Before collecting any data from subjects, they would submit a proposal detailing precisely how they would repeat the original study and how they would analyze the data. Proposals would be reviewed by other researchers, including the authors of the original studies, and once approved, the study’s results would be published no matter what. Preregistration of the study and analysis procedures should deter p-hacking, guaranteed publication should counteract the file drawer effect, and a requirement of large sample sizes should make it easier to detect small but statistically meaningful effects.The results were sobering.At least 10 of the 27 “important findings” in social psychology were not replicated at all. In the social priming area, only one of seven replications succeeded.[...]One way to keep things in perspective is to remember that scientific truth is created by the accretion of results over time, not by the splash of a single study. A single failure-to-replicate doesn’t necessarily invalidate a previously reported effect, much less imply fraud on the part of the original researcher—or the replicator. Researchers are most likely to fail to reproduce an effect for mundane reasons, such as insufficiently large sample sizes, innocent errors in procedure or data analysis, and subtle factors about the experimental setting or the subjects tested that alter the effect in question in ways not previously realized.Caution about single studies should go both ways, though. Too often, a single original study is treated—by the media and even by many in the scientific community—as if it definitively establishes an effect. Publications like Harvard Business Review and idea conferences like TED, both major sources of “thought leadership” for managers and policymakers all over the world, emit a steady stream of these “stats and curiosities.” Presumably, the HBR editors and TED organizers believe this information to be true and actionable. But most novel results should be initially regarded with some skepticism, because they too may have resulted from unreported or unnoticed methodological quirks or errors. Everyone involved should focus their attention on developing a shared evidence base that consists of robust empirical regularities—findings that replicate not just once but routinely—rather than of clever one-off curiosities. [...]Scholars, especially scientists, are supposed to be skeptical about received wisdom, develop their views based solely on evidence, and remain open to updating those views in light of changing evidence. But as psychologists know better than anyone, scientists are hardly free of human motives that can influence their work, consciously or unconsciously. It’s easy for scholars to become professionally or even personally invested in a hypothesis or conclusion. These biases are addressed partly through the peer review process, and partly through the marketplace of ideas—by letting researchers go where their interest or skepticism takes them, encouraging their methods, data, and results to be made as transparent as possible, and promoting discussion of differing views. The clashes between researchers of different theoretical persuasions that result from these exchanges should of course remain civil; but the exchanges themselves are a perfectly healthy part of the scientific enterprise.This is part of the reason why we cannot agree with a more recent proposal by Kahneman, who had previously urged social priming researchers to put their house in order. He contributed an essay to the special issue of Social Psychology in which he proposed a rule—to be enforced by reviewers of replication proposals and manuscripts—that authors “be guaranteed a significant role in replications of their work.” Kahneman proposed a specific process by which replicators should consult with original authors, and told Science that in the special issue, “the consultations did not reach the level of author involvement that I recommend.”Collaboration between opposing sides would probably avoid some ruffled feathers, and in some cases it could be productive in resolving disputes. With respect to the current controversy, given the potential impact of an entire journal issue on the robustness of “important findings,” and the clear desirability of buy-in by a large portion of psychology researchers, it would have been better for everyone if the original authors’ comments had been published alongside the replication papers, rather than left to appear afterward. But consultation or collaboration is not something replicators owe to original researchers, and a rule to require it would not be particularly good science policy.Replicators have no obligation to routinely involve original authors because those authors are not the owners of their methods or results. By publishing their results, original authors state that they have sufficient confidence in them that they should be included in the scientific record. That record belongs to everyone. Anyone should be free to run any experiment, regardless of who ran it first, and to publish the results, whatever they are. [...]some critics of replication drives have been too quick to suggest that replicators lack the subtle expertise to reproduce the original experiments. One prominent social psychologist has even argued that tacit methodological skill is such a large factor in getting experiments to work that failed replications have no value at all (since one can never know if the replicators really knew what they were doing, or knew all the tricks of the trade that the original researchers did), a surprising claim that drew sarcastic responses. [See LW discussion.] [...]Psychology has long been a punching bag for critics of “soft science,” but the field is actually leading the way in tackling a problem that is endemic throughout science. The replication issue of Social Psychology is just one example. The Association for Psychological Science is pushing for better reporting standards and more study of research practices, and at its annual meeting in May in San Francisco, several sessions on replication were filled to overflowing. International collaborations of psychologists working on replications, such as the Reproducibility Project and the Many Labs Replication Project (which was responsible for 13 of the 27 replications published in the special issue of Social Psychology) are springing up.Even the most tradition-bound journals are starting to change. The Journal of Personality and Social Psychology—the same journal that, in 2011, refused to even consider replication studies—recently announced that although replications are “not a central part of its mission,” it’s reversing this policy. We wish that JPSP would see replications as part of its central mission and not relegate them, as it has, to an online-only ghetto, but this is a remarkably nimble change for a 50-year-old publication. Other top journals, most notable among them Perspectives in Psychological Science, are devoting space to systematic replications and other confirmatory research. The leading journal in behavior genetics, a field that has been plagued by unreplicable claims that particular genes are associated with particular behaviors, has gone even further: It now refuses to publish original findings that do not include evidence of replication.A final salutary change is an overdue shift of emphasis among psychologists toward establishing the size of effects, as opposed to disputing whether or not they exist. The very notion of “failure” and “success” in empirical research is urgently in need of refinement. When applied thoughtfully, this dichotomy can be useful shorthand (and we’ve used it here). But there are degrees of replication between success and failure, and these degrees matter.For example, suppose an initial study of an experimental drug for cardiovascular disease suggests that it reduces the risk of heart attack by 50 percent compared to a placebo pill. The most meaningful question for follow-up studies is not the binary one of whether the drug’s effect is 50 percent or not (did the first study replicate?), but the continuous one of precisely how much the drug reduces heart attack risk. In larger subsequent studies, this number will almost inevitably drop below 50 percent, but if it remains above 0 percent for study after study, then the best message should be that the drug is in fact effective, not that the initial results “failed to replicate.”

## August 2014 Media Thread

This is the monthly thread for posting media of various types that you've found that you enjoy. Post what you're reading, listening to, watching, and your opinion of it. Post recommendations to blogs. Post whatever media you feel like discussing! To see previous recommendations, check out the older threads.

Rules:

- Please avoid downvoting recommendations just because you don't personally like the recommended material; remember that liking is a two-place word. If you can point out a specific flaw in a person's recommendation, consider posting a comment to that effect.
- If you want to post something that (you know) has been recommended before, but have another recommendation to add, please link to the original, so that the reader has both recommendations.
- Please use the comment trees for genres. There is a meta thread for comments about future threads.
- If you think there should be a thread for a particular genre of media, please post it to the Other Media thread for now, and add a poll to the Meta thread asking if it should be a thread every month.

## Prediction of the Internet

The internet is quite a special invention. It had huge impact on the way we live in number of ways. And yet, it is mainly a conceptual breakthrough. While for most of history, humanity didn't have the technology needed for its creation, the internet wasn't even imagined shortly until its emergence. Many other technologies had no such problem (flight, audiovisual long-distance communication, audiovisual recording, death rays a.k.a. lasers or teleportation - as an example of a phenomenon we can easily conceive even though we're nowhere near the technology needed for it)

My question is when was the Internet predicted for the first time, when the idea of the Internet was uttered for the first time?

Let me clarify, what I mean by the Internet. I don't mean global communication. This is achieved by telephone networks, (was by) telegrams or by e-mails or Skype. But those are not the Internet, they are just part of it. Nor do I mean the global library - while this is closer to what I find essential about the Internet, it doesn't encompass the whole idea. Global library means one-way communication, analogous to a real library. It also emphasises the special status of authors, professionals, not ordinary folk. And even before the advent of Web 2.0, the internet was in great part created by amateurs. The internet is more of a form of huge advertising column, to which access is easy and which is not a curated repository of knowledge, but rather a mix of ads, telephone book, practical information about shops and services, information published by amateurs in a subject, personal diaries, a medium of exchange of goods, etc. - and on top of that, a tool for interpersonal communication. Eventually, it is a clearly a new form of communication, encompassing both two-way communication of casual chat and one-way communication between reader and author.

This is the idea I mean by the Internet. When was the first time this specific idea came to existence? When had people realised that global network of PCs and servers will result in that form of communication, creating its own space - cyberspace, rather than just a tool for communication and a form of a library?

## New LW Meetup: Perth

**This summary was posted to LW main on July 25th. The following week's summary is here.**

The remaining meetups take place in cities with regular scheduling, but involve a change in time or location, special meeting content, or simply a helpful reminder about the meetup:

- Brussels - August (topic TBD): 09 August 2014 01:00PM
- Houston, TX: 09 August 2014 12:16AM
- London social meetup - possibly in a park: 27 July 2014 02:00PM
- Washington, D.C.: Fun & Games: 27 July 2014 03:00PM

Locations with regularly scheduled meetups:** Austin,** **Berkeley, Berlin, Boston, Brussels****, Buffalo, Cambridge UK, Canberra, Columbus,**** ****London,**** ****Madison WI**,** ****Melbourne,**** ****Mountain View,**** ****New York, Philadelphia,**** Research Triangle NC,**** Salt Lake City,**** Seattle, Sydney,**** ****Toronto, Vienna****, ****Washington DC**, **Waterloo**, and **West Los Angeles**. There's also a 24/7 online study hall for coworking LWers.

## Meetup : London - Index Funds and Other Fun Stuff

## Discussion article for the meetup : London - Index Funds and Other Fun Stuff

The next London meetup will be on Sunday August 10th, from 2pm at the Shakespeare's Head in Holborn.

Several regulars have expressed an interest in setting up passive investment in an index fund. Towards the beginning of the meetup, I'll be presenting a short how-to on index fund investments and answering questions on the subject. (I have no broader finance and investment expertise, but did a lot of research into UK index tracker products a few years ago, which I'm happy to share).

This obviously isn't everybody's discussion topic of choice, but there's room for simultaneous discussion on another table, so come along anyway even if you're not interested in index funds, and talk about unicorns / sci-fi plausibility / infinite torture scenarios, etc. instead. The whole thing will probably revert to unstructured discussion after the first hour or so. We normally have some sort of sign identifying us as a LessWrong meetup, and typically try to get one or more of the large round tables near the back of the pub. If you can't find us, call or text 07887 718458.

**About London LessWrong:**

We're currently running meetups every other Sunday, and tend to get between 5 and 15 attendees. Most of our meetups default to unstructured social discussion on LessWrongy subjects, though we occasionally have special topics, events or activities. Sometimes we play games.

We have a Google Group and a Facebook group where we plan and discuss stuff. We would be extraordinarily pleased if you joined them.

## Discussion article for the meetup : London - Index Funds and Other Fun Stuff

## Announcing the 2014 program equilibrium iterated PD tournament

Last year, AlexMennen ran a prisoner's dilemma tournament with bots that could see each other's source code, which was dubbed a "program equilibrium" tournament. This year, I will be running a similar tournament. Here's how it's going to work: Anyone can submit a bot that plays the iterated PD against other bots. Bots can not only remember previous rounds, as in the standard iterated PD, but also run perfect simulations of their opponent before making a move. Please see the github repo for the full list of rules and a brief tutorial.

There are a few key differences this year:

1) The tournament is in Haskell rather than Scheme.

2) The time limit for each round is shorter (5 seconds rather than 10) but the penalty for not outputting Cooperate or Defect within the time limit has been reduced.

3) Bots cannot directly see each other's source code, but they can run their opponent, specifying the initial conditions of the simulation, and then observe the output.

All submissions should be emailed to pdtournament@gmail.com or PM'd to me here on LessWrong by September 1st, 2014. LW users with 50+ karma who want to participate but do not know Haskell can PM me with an algorithm/psuedocode, and I will translate it into a bot for them. (If there is a flood of such requests, I would appreciate some volunteers to help me out.)

## Causal Inference Sequence Part 1: Basic Terminology and the Assumptions of Causal Inference

**The data-generating mechanism and the joint distribution of variables**

It is possible to create arbitrarily complicated mathematical structures to describe empirical research. If the logic is done correctly, these structures are all completely* valid*, but they are only *useful* if the mathematical objects correctly represent the things in the real world that we want to learn about. Whenever someone tells you about a new framework which has been found to be mathematically valid, the first question you should ask yourself is whether the new framework allows you to correctly represent the important aspects of phenomena you are studying.

When we are interested in causal questions, the phenomenon we are studying is called "the data generating mechanism". The data generating mechanism is the causal force of nature that assigns value to variables. Questions about the data generating mechanism include “Which variable has its value assigned first?”, “What variables from the past are taken into consideration when nature assigns the value of a variable?” and “What is the causal effect of treatment”.

We can never observe the data generating mechanism. Instead, we observe something different, which we call “The joint distribution of observed variables”. The joint distribution is created when the data generating mechanism assigns value to variables in individuals. All questions about how whether observed variables are correlated or independent, and how about strongly they are correlated, are questions about the joint distribution.

The basic problem of causal inference is that the relationship between the set of possible data generating mechanisms, and the joint distribution of variables, is many-to-one.

Imagine you have data on all observable variables for all individuals in the world. You can just look at your data and know everything there is to know about the joint distribution. You don’t need estimators, and you don’t need to worry about limited samples. Anything you need to know about the joint distribution can just be looked up. Now ask yourself: Can you learn anything about causal effects from this data?

Consider the two graphs below. We haven't introduced causal graphs yet, but for the moment, it is sufficient to understand these graphs as intuitive maps of the data generating mechanism. In reality, they are causal DAGs, which we will introduce in the next chapter:

In Graph 1, A is assigned first, then L is assigned by some random function with a deterministic component that depends only on A, then Y is assigned by some random function that depends only on L. In Graph 2, L is assigned first, then A and Y are assigned by two different random functions that each depend only on L.

No matter how many people you sample, you cannot tell the graphs apart, because any joint distribution of L, A and Y that is consistent with graph 1, could also have been generated by graph 2. Distinguishing between the two possible data generating mechanisms is therefore not a statistical problem. This is one reason why model selection algorithms (which rely only on the joint distribution of observed variables for input) are not valid for causal inference.

Because even a complete sample is insufficient to learn about causal effects, you need a priori causal information in order to do this. This prior causal information comes in the form of the assumption “the data came from a complicated randomized trial run by nature”. If you have reason to doubt this assumption, you should also doubt the conclusions

**What do we mean by Causality?**

The first step of causal inference is to translate the English language research question «What is the causal effect of treatment» into a precise, mathematical language. One possible such language is based on counterfactual variables. These counterfactual variables allow us to encode the concept of “what would have happened if, possibly contrary to fact, treatment had been given”.

We define one counterfactual variable called Y^{a=1} which represents the outcome in the person if he has treatment, and another counterfactual variable called Y^{a=0} which represents the outcome if he does not have treatment. Counterfactual variables such as Y^{a=0} are mathematical objects that represent part of the data generating mechanism: The variable tells us what value the mechanism would assign to Y, if A is set to 0. These variables are columns in an imagined dataset that we sometimes call “God’s Table”:

ID |
A |
Y |
Y |
Y |

1 |
1 |
1 |
1 |
1 |

2 |
0 |
1 |
0 |
1 |

3 |
1 |
1 |
1 |
0 |

4 |
0 |
0 |
0 |
0 |

Let us start by making some points about this spreadsheet. First, note that the counterfactual variables are variables just like any other column in the spreadsheet. Therefore, we can use the same type of logic that we use for any other variables. Second, note that in our framework, counterfactual variables are pre-treatment variables: They are determined long before treatment is assigned. The effect of treatment is simply to determine whether we see Y^{a=0} or Y^{a=1} in this individual.

The most important point about God’s Table is that we cannot observe Y^{a=1 }and Y^{a=0}. We only observe the joint distribution of observed variables, which we can call the “Observed Table”:

ID |
A |
Y |

1 |
1 |
1 |

2 |
0 |
1 |

3 |
1 |
1 |

4 |
0 |
0 |

The goal of causal inference is to learn about God’s Table using information from the observed table (in combination with a priori causal knowledge). In particular, we are going to be interested in learning about the distributions of Y^{a=1} and Y^{a=0}, and in how they relate to each other.

** **

**Randomized Trials**

The “Gold Standard” for estimating the causal effect, is to run a randomized controlled trial where we randomly assign the value of A. This study design works because you select one random subset of the study population where you observe Y^{a=0}, and another random subset where you observe Y^{a=1}. You therefore have unbiased information about the distribution of both Y^{a=0} and of Y^{a=1}.

An important thing to point out at this stage is that it is not necessary to use an unbiased coin to assign treatment, as long as your use the same coin for everyone. For instance, the probability of being randomized to A=1 can be 2/3. You will still see randomly selected subsets of the distribution of both Y^{a=0} and Y^{a=1}, you will just have a larger number of people where you see Y^{a=1}.^{ } Usually, randomized trials use unbiased coins, but this is simply done because it increases the statistical power.

Also note that it is possible to run two different randomized controlled trials: One in men, and another in women. The first trial will give you an unbiased estimate of the effect in men, and the second trial will give you an unbiased estimate of the effect in women. If both trials used the same coin, you could think of them as really being one trial. However, if the two trials used different coins, and you pooled them into the same database, your analysis would have to account for the fact that in reality, there were two trials. If you don’t account for this, the results will be biased. This is called “confounding”. As long as you account for the fact that there really were two trials, you can still recover an estimate of the population average causal effect. This is called “Controlling for Confounding”.

In general, causal inference works by specifying a model that says the data came from a complex trial, ie, one where nature assigned a biased coin depending on the observed past. For such a trial, there will exist a valid way to recover the overall causal results, but it will require us to think carefully about what the correct analysis is.

**Assumptions of Causal Inference**

We will now go through in some more detail about why it is that randomized trials work, ie , the important aspects of this study design that allow us to infer causal relationships, or facts about God’s Table, using information about the joint distribution of observed variables.

We will start with an “observed table” and build towards “reconstructing” parts of God’s Table. To do this, we will need three assumptions: These are positivity, consistency and (conditional) exchangeability:

ID |
A |
Y |

1 |
1 |
1 |

2 |
0 |
1 |

3 |
1 |
1 |

4 |
0 |
0 |

*Positivity*

Positivity is the assumption that any individual has a positive probability of receiving all levels of treatment: Pr(A=a) > 0 for all levels of a. If positivity does not hold, you will not have any information about the distribution of Y^{a} for that level of a, and will therefore not be able to make inferences about it.

We can check whether this assumption holds in the sample, by checking whether there are people who are treated and people who are untreated. If you observe that in any stratum, there are individuals who are treated and individuals who are untreated, you know that positivity holds.

If we observe a stratum where no individuals are treated (or no individuals are untreated), this can be either for statistical reasons (your randomly did not sample them) or for structural reasons (individuals with these covariates are deterministically never treated). As we will see later, our models can handle random violations, but not structural violations.

In a randomized controlled trial, positivity holds because you will use a coin that has a positive probability of assigning people to either arm of the trial.

*Consistency*

The next assumption we are going to make is that if an individual happens to have treatment (A=1), we will observe the counterfactual variable Y^{a=1} in this individual. This is the observed table after we make the consistency assumption:

ID |
A |
Y |
Y |
Y |

1 |
1 |
1 |
1 |
* |

2 |
0 |
1 |
* |
1 |

3 |
1 |
1 |
1 |
* |

4 |
0 |
0 |
* |
0 |

Making the consistency assumption got us half the way to our goal. We now have a lot of information about Y^{a=1} and Y^{a=0}. However, half of the data is still missing.

Although consistency seems obvious, it is an assumption, not something that is true by definition. We can expect the consistency assumption to hold if we have a well-defined intervention (ie, the intervention is a well-defined choice, not an attribute of the individual), and there is no causal interference (one individual’s outcome is not affected by whether another individual was treated).

Consistency may not hold if you have an intervention that is not well-defined: For example, imagine you are interested in the effect of obesity, but there are several ways to gain weight. When you measure Y^{a=1 }in people who gained weighted, it will actually be a composite of multiple counterfactual variables: One for people who decided to stop exercising (let us call that Y^{a=1*}) and another for people who decided that they really like cake (let us call that Y^{a=1#})_{. } Since you failed to specify whether you are interested in the effect of cake or the effect of lack of exercise, the construct_{ }Y^{a=1 }is a composite without any meaning, and people will be unable to use your results to predict the consequences of their actions.

*Exchangeability*

To complete the table, we require an additional assumption on the nature of the data. We call this assumption “Exchangeability”. One possible exchangeability assumption is “Y^{a=0} ∐ A and Y^{a=1} ∐ A”. This is the assumption that says “The data came from a randomized controlled trial”. If this assumption is true, you will observe a random subset of the distribution of Y^{a=0} in the group where A=0, and a random subset of the distribution of Y^{a=1} in the group where A=1.

Exchangeability is a statement about two variables being independent from each other. This means that having information about either one of the variables will not help you predict the value of the other. Sometimes, variables which are not independent are "conditionally independent". For example, it is possible that knowing somebody's race helps you predict whether they enjoy eating Hakarl, an Icelandic form of rotting fish. However, it is also possible that this is just a marker for whether they were born in the ethnically homogenous Iceland. In such a situation, it is possible that once you already know whether somebody is from Iceland, also knowing their race gives you no additional clues as to whether they will enjoy Hakarl. In this case, the variables "race" and "enjoying hakarl" are conditionally independent, given nationality.

The reason we care about conditional independence is that sometimes you may be unwilling to assume that marginal exchangeability Y^{a=1} ∐ A holds, but you are willing to assume conditional exchangeability Y^{a=1} ∐ A | L. In this example, let L be sex. The assumption then says that you can interpret the data as if it came from two different randomized controlled trials: One in men, and one in women. If that is the case, sex is a "confounder". (We will give a definition of confounding in Part 2 of this sequence. )

If the data came from two different randomized controlled trials, one possible approach is to analyze these trials separately. This is called “stratification”. Stratification gives you effect measures that are conditional on the confounders: You get one measure of the effect in men, and another in women. Unfortunately, in more complicated settings, stratification-based methods (including regression) are always biased. In those situations, it is necessary to focus the inference on the marginal distribution of Y^{a}.

**Identification**

If marginal exchangeability holds (ie, if the data came from a marginally randomized trial), making inferences about the marginal distribution of Y^{a} is easy: You can just estimate E[Y^{a}] as E [Y|A=a].

However, if the data came from a conditionally randomized trial, we will need to think a little bit harder about how to say anything meaningful about E[Y^{a}]. This process is the central idea of causal inference. We call it “identification”: The idea is to write an expression for the distribution of a counterfactual variable, purely in terms of observed variables. If we are able to do this, we have sufficient information to estimate causal effects just by looking at the relevant parts of the joint distribution of observed variables.

The simplest example of identification is standardization. As an example, we will show a simple proof:

Begin by using the law of total probability to factor out the confounder, in this case L:

· E(Y^{a}) = Σ E(Y^{a}|L= l) * Pr(L=l) (The summation sign is over l)

We do this because we know we need to introduce L behind the conditioning sign, in order to be able to use our exchangeability assumption in the next step: Then, because Y^{a }∐ A | L, we are allowed to introduce A=a behind the conditioning sign:

· E(Y^{a}) = Σ E(Y^{a}|A=a, L=l) * Pr(L=l)

Finally, use the consistency assumption: Because we are in the stratum where A=a in all individuals, we can replace Y^{a} by Y

· E(Y^{a}) = Σ E(Y|A=a, L=l) * Pr (L=l)

We now have an expression for the counterfactual in terms of quantities that can be observed in the real world, ie, in terms of the joint distribution of A, Y and L. In other words, we have linked the data generating mechanism with the joint distribution – we have “identified” E(Y^{a}). We can therefore estimate E(Y^{a})

This identifying expression is valid if and only if L was the only confounder. If we had not observed sufficient variables to obtain conditional exchangeability, it would not be possible to identify the distribution of Y^{a} : there would be intractable confounding.

Identification is the core concept of causal inference: It is what allows us to link the data generating mechanism to the joint distribution, to something that can be observed in the real world.

**The difference between epidemiology and biostatistics**

Many people see Epidemiology as «Applied Biostatistics». This is a misconception. In reality, epidemiology and biostatistics are completely different parts of the problem. To illustrate what is going on, consider this figure:

The data generating mechanism first creates a joint distribution of observed variables. Then, we sample from the joint distribution to obtain data. Biostatistics asks: If we have a sample, what can we learn about the joint distribution? Epidemiology asks: If we have all the information about the joint distribution , what can we learn about the data generating mechanism? This is a much harder problem, but it can still be analyzed with some rigor.

Epidemiology without Biostatistics is always impossible: It would not be possible to learn about the data generating mechanism without asking questions about the joint distribution. This usually involves sampling. Therefore, we will need good statistical estimators of the joint distribution.

Biostatistics without Epidemiology is usually pointless: The joint distribution of observed variables is simply not interesting in itself. You can make the claim that randomized trials is an example of biostatistics without epidemiology. However, the epidemiology is still there. It is just not necessary to think about it, because the epidemiologic part of the analysis is trivial

Note that the word “bias” means different things in Epidemiology and Biostatistics. In Biostatistics, “bias” is a property of a statistical estimator: We talk about whether ŷ is a biased estimator of E(Y^{ }|A). If an estimator is biased, it means that when you use data from a sample to make inferences about the joint distribution in the population the sample came from, there will be a systematic source of error.

In Epidemiology, “bias” means that you are estimating the wrong thing: Epidemiological bias is a question about whether E(Y|A) is a valid identification of E(Y^{a}). If there is epidemiologic bias, it means that you estimated something in the joint distribution, but that this something does not answer the question you were interested in.

These are completely different concepts. Both are important and can lead to your estimates being wrong. It is possible for a statistically valid estimator to be biased in the epidemiologic sense, and vice versa. For your results to be valid, your estimator must be unbiased in both senses.

## Sequence Announcement: Applied Causal Inference for Empirical Research

**Applied Causal Inference for Empirical Research**

This sequence is an introduction to basic causal inference. It was originally written as auxiliary notes for a course in Epidemiology, but it is relevant to almost any kind of applied statistical and empirical research, including econometrics, sociology, psychology, political science etc. I would not be surprised if you guys find a lot of errors, and I would be very grateful if you point them out in the comments. This will help me improve my course notes and potentially help me improve my understanding of the material.

For mathematically inclined readers, I recommend skipping this sequence and instead reading Pearl's book on Causality. There is also a lot of good material on causal graphs on Less Wrong itself. Also, note that my thesis advisor is writing a book that covers the same material in more detail, the first two parts are available for free at his website.

Pearl's book, Miguel's book and Eliezer's writings are all more rigorous and precise than my sequence. This is partly because I have a different goal: Pearl and Eliezer are writing for mathematicians and theorists who may be interested in contributing to the theory. Instead, I am writing for consumers of science who want to understand correlation studies from the perspective of a more rigorous epistemology.

I will use Epidemiological/Counterfactual notation rather than Pearl's notation. I apologize if this is confusing. These two approaches refer to the same mathematical objects, it is just a different notation. Whereas Pearl would use the "Do-Operator" E[Y|do(a)], I use counterfactual variables E[Y^{a}]. Instead of using Pearl's "Do-Calculus" for identification, I use Robins' G-Formula, which will give the same results.

For all applications, I will use the letter "A" to represent "treatment" or "exposure" (the thing we want to estimate the effect of), Y to represent the outcome, L to represent any measured confounders, and U to represent any unmeasured confounders.

**Outline of Sequence:**

I hope to publish one post every week. I have rough drafts for the following eight sections, and will keep updating this outline with links as the sequence develops:

Part 0: Sequence Announcement / Introduction (This post)

Part 1: Basic Terminology and the Assumptions of Causal Inference

Part 2: Graphical Models

Part 3: Using Causal Graphs to Understand Bias

Part 4: Time-Dependent Exposures

Part 5: The G-Formula

Part 6: Inverse Probability Weighting

Part 7: G-Estimation of Structural Nested Models and Instrumental Variables

Part 8: Single World Intervention Graphs, Cross-World Counterfactuals and Mediation Analysis

** Introduction: Why Causal Inference?**

The goal of applied statistical research is almost always to learn about causal effects. However, causal inference from observational is hard, to the extent that it is usually not even possible without strong, almost heroic assumptions. Because of the inherent difficulty of the task, many old-school investigators were trained to avoid making causal claims. Words like “cause” and “effect” were banished from polite company, and the slogan “correlation does not imply causation” became an article of faith which, when said loudly enough, seemingly absolved the investigators from the sin of making causal claims.

However, readers were not fooled: They always understood that epidemiologic papers were making causal claims. Of course they were making causal claims; why else would anybody be interested in a paper about the correlation between two variables? For example, why would anybody want to know about the correlation between eating nuts and longevity, unless they were wondering if eating nuts would cause them to live longer?

When readers interpreted these papers causally, were they simply ignoring the caveats, drawing conclusions that were not intended by the authors? Of course they weren’t. The discussion sections of epidemiologic articles are full of “policy implications” and speculations about biological pathways that are completely contingent on interpreting the findings causally. Quite clearly, no matter how hard the investigators tried to deny it, they were making causal claims. However, they were using methodology that was not designed for causal questions, and did not have a clear language for reasoning about where the uncertainty about causal claims comes from.

This was not sustainable, and inevitably led to a crisis of confidence, which culminated when some high-profile randomized trials showed completely different results from the preceding observational studies. In one particular case, when the Women’s Health Initiative trial showed that post-menopausal hormone replacement therapy increases the risk of cardiovascular disease, the difference was so dramatic that many thought-leaders in clinical medicine completely abandoned the idea of inferring causal relationships from observational data.

It is important to recognize that the problem was not that the results were wrong. The problem was that there was uncertainty that was not taken seriously by the investigators. A rational person who wants to learn about the world will be willing to accept that studies have errors of margin, but only as long as the investigators make a good-faith effort to examine what the sources of error are, and as long as they communicate clearly about this uncertainty to their readers. Old-school epidemiology failed at this. We are not going to make the same mistake. Instead, we are going to develop a clear, precise language for reasoning about uncertainty and bias.

In this context, we are going to talk about two sources of uncertainty – “statistical” uncertainty and “epidemiological” uncertainty.

We are going to use the word “Statistics” to refer to the theory of how we can learn about correlations from limited samples. For statisticians, the primary source of uncertainty is sampling variability. Statisticians are very good at accounting for this type of uncertainty: Concepts such as “standard errors”, “p-values” and “confidence intervals” are all attempts at quantifying and communicating the extent of uncertainty that results from sampling variability.

The old school of epidemiology would tell you to stop after you had found the correlations and accounted for the sampling variability. They believed going further was impossible. However, correlations are simply not interesting. If you truly believed that correlations tell you nothing about causation, there would be no point in doing the study.

Therefore, we are going to use the terms “Epidemiology” or “Causal Inference” to refer to the next stage in the process: Learning about causation from correlations. This is a much harder problem, with many additional sources of uncertainty, including confounding and selection bias. However, recognizing that the problem is hard does not mean that you shouldn't try, it just means that you have to be careful. As we will see, it is possible to reason rigorously about whether correlation really does imply causation *in your particular study*: You will just need a precise language. The goal of this sequence is simply to give you such a language.

In order to teach you the logic of this language, we are going to make several controversial statements such as «The only way to estimate a causal effect is to run a randomized controlled trial» . You may not be willing to believe this at first, but in order to understand the logic of causal inference, it is necessary that you are at least willing to suspend your disbelief and accept it as true within the course.

It is important to note that we are not just saying this to try to convince you to give up on observational studies in favor of randomized controlled trials. We are making this point because understanding it is necessary in order to appreciate what it means to control for confounding: It is not possible to give a coherent meaning to the word “confounding” unless one is trying to determine whether it is reasonable to model the data as if it came from a complex randomized trial run by nature.

--

When we say that causal inference is hard, what we mean by this is not that it is difficult to learn the basics concepts of the theory. What we mean is that even if you fully understand everything that has ever been written about causal inference, it is going to be very hard to infer a causal relationship from observational data, and that there will always be uncertainty about the results. This is why this sequence is not going to be a workshop that teaches you how to apply magic causal methodology. What we are interested in, is developing your ability to reason honestly about where uncertainty and bias comes from, so that you can communicate this to the readers of your studies. We want to teach you about, is the epistemology that underlies epidemiological and statistical research with observational data.

Insisting on only using randomized trials may seem attractive to a purist, it does not take much imagination to see that there are situations where it is important to predict the consequences of an action, but where it is not possible to run a trial. In such situations, there may be Bayesian evidence to be found in nature. This evidence comes in the form of correlations in observational data. When we are stuck with this type of evidence, it is important that we have a clear framework for assessing the strength of the evidence.

--

I am publishing Part 1 of the sequence at the same time as this introduction. I would be very interested in hearing feedback, particularly about whether people feel this has already been covered in sufficient detail on Less Wrong. If there is no demand, there won't really be any point in transforming the rest of my course notes to a Less Wrong format.

Thanks to everyone who had a look at this before I published, including paper-machine and Vika, Janos, Eloise and Sam from the Boston Meetup group.

## How to treat problems of unknown difficulty

*Crossposted from the Global Priorities Project*

This is the first in a series of posts which take aim at the question: how should we prioritise work on problems where we have very little idea of our chances of success. In this post we’ll see some simple models-from-ignorance which allow us to produce some estimates of the chances of success from extra work. In later posts we’ll examine the counterfactuals to estimate the value of the work. For those who prefer a different medium, I gave a talk on this topic at the Good Done Right conference in Oxford this July.

## Introduction

How hard is it to build an economically efficient fusion reactor? How hard is it to prove or disprove the Goldbach conjecture? How hard is it to produce a machine superintelligence? How hard is it to write down a concrete description of our values?

These are all hard problems, but we don’t even have a good idea of just how hard they are, even to an order of magnitude. This is in contrast to a problem like giving a laptop to every child, where we know that it’s hard but we could produce a fairly good estimate of how much resources it would take.

Since we need to make choices about how to prioritise between work on different problems, this is clearly an important issue. We can prioritise using benefit-cost analysis, choosing the projects with the highest ratio of future benefits to present costs. When we don’t know how hard a problem is, though, our ignorance makes the size of the costs unclear, and so the analysis is harder to perform. Since we make decisions anyway, we are implicitly making some judgements about when work on these projects is worthwhile, but we may be making mistakes.

In this article, we’ll explore practical epistemology for dealing with these problems of unknown difficulty.

### Definition

We will use a simplifying model for problems: that they have a critical threshold *D* such that the problem will be completely solved when *D* resources are expended, and not at all before that. We refer to this as the *difficulty* of the problem. After the fact the graph of success with resources will look something like this:

Of course the assumption is that we don’t know *D*. So our uncertainty about where the threshold is will smooth out the curve in expectation. Our expectation beforehand for success with resources will end up looking something like this:

Assuming a fixed difficulty is a simplification, since of course resources are not all homogenous, and we may get lucky or unlucky. I believe that this is a reasonable simplification, and that taking these considerations into account would not change our expectations by much, but I plan to explore this more carefully in a future post.

## What kind of problems are we looking at?

We’re interested in one-off problems where we have a lot of uncertainty about the difficulty. That is, the kind of problem we only need to solve once (answering a question a first time can be Herculean; answering it a second time is trivial), and which may not easily be placed in a reference class with other tasks of similar difficulty. Knowledge problems, as in research, are a central example: they boil down to finding the answer to a question. The category might also include trying to effect some systemic change (for example by political lobbying).

This is in contrast to engineering problems which can be reduced down, roughly, to performing a known task many times. Then we get a fairly good picture of how the problem scales. Note that this includes some knowledge work: the “known task” may actually be different each time. For example, proofreading two pages of text is quite the same, but we have a fairly good reference class so we can estimate moderately well the difficulty of proofreading a page of text, and quite well the difficulty of proofreading a 100,000-word book (where the length helps to smooth out the variance in estimates of individual pages).

Some knowledge questions can naturally be broken up into smaller sub-questions. However these typically won’t be a tight enough class that we can use this to estimate the difficulty of the overall problem from the difficult of the first few sub-questions. It may well be that one of the sub-questions carries essentially all of the difficulty, so making progress on the others is only a very small help.

## Model from extreme ignorance

One approach to estimating the difficulty of a problem is to assume that we understand essentially nothing about it. If we are completely ignorant, we have no information about the scale of the difficulty, so we want a scale-free prior. This determines that the prior obeys a power law. Then, we update on the amount of resources we have already expended on the problem without success. Our posterior probability distribution for how many resources are required to solve the problem will then be a Pareto distribution. (Fallenstein and Mennen proposed this model for the difficulty of the problem of making a general-purpose artificial intelligence.)

There is still a question about the shape parameter of the Pareto distribution, which governs how thick the tail is. It is hard to see how to infer this from a priori reasons, but we might hope to estimate it by generalising from a very broad class of problems people have successfully solved in the past.

This idealised case is a good starting point, but in actual cases, our estimate may be wider or narrower than this. Narrower if either we have some idea of a reasonable (if very approximate) reference class for the problem, or we have some idea of the rate of progress made towards the solution. For example, assuming a Pareto distribution implies that there’s always a nontrivial chance of solving the problem at any minute, and we may be confident that we are not that close to solving it. Broader because a Pareto distribution implies that the problem is certainly solvable, and some problems will turn out to be impossible.

This might lead people to criticise the idea of using a Pareto distribution. If they have enough extra information that they don’t think their beliefs represent a Pareto distribution, can we still say anything sensible?

## Reasoning about broader classes of model

In the previous section, we looked at a very specific and explicit model. Now we take a step back. We assume that people will have complicated enough priors and enough minor sources of evidence that it will in practice be impossible to write down a true distribution for their beliefs. Instead we will reason about some properties that this true distribution should have.

The cases we are interested in are cases where we do not have a good idea of the order of magnitude of the difficulty of a task. This is an imprecise condition, but we might think of it as meaning something like:

There is no difficulty X such that we believe the probability of D lying between X and 10X is more than 30%.

Here the “30%” figure can be adjusted up for a less stringent requirement of uncertainty, or down for a more stringent one.

Now consider what our subjective probability distribution might look like, where difficulty lies on a logarithmic scale. Our high level of uncertainty will smooth things out, so it is likely to be a reasonably smooth curve. Unless we have specific distinct ideas for how the task is likely to be completed, this curve will probably be unimodal. Finally, since we are unsure even of the order of magnitude, the curve cannot be too tight on the log scale.

Note that this should be our *prior *subjective probability distribution: we are gauging how hard we would have thought it was before embarking on the project. We’ll discuss below how to update this in the light of information gained by working on it.

The distribution might look something like this:

In some cases it is probably worth trying to construct an explicit approximation of this curve. However, this could be quite labour-intensive, and we usually have uncertainty even about our uncertainty, so we will not be entirely confident with what we end up with.

Instead, we could ask what properties tend to hold for this kind of probability distribution. For example, one well-known phenomenon which is roughly true of these distributions but not all probability distributions is Benford’s law.

### Approximating as locally log-uniform

It would sometimes be useful to be able to make a simple analytically tractable approximation to the curve. This could be faster to produce, and easily used in a wider range of further analyses than an explicit attempt to model the curve exactly.

As a candidate for this role, we propose working with the assumption that the distribution is locally flat. This corresponds to being log-uniform. The smoothness assumptions we made should mean that our curve is nowhere too far from flat. Moreover, it is a very easy assumption to work with, since it means that the expected returns scale logarithmically with the resources put in: in expectation, a doubling of the resources is equally good regardless of the starting point.

It is, unfortunately, never exactly true. Although our curves may be approximately flat, they cannot be everywhere flat -- this can’t even give a probability distribution! But it may work reasonably as a model of local behaviour. If we want to turn it into a probability distribution, we can do this by estimating the plausible ranges of D and assuming it is uniform across this scale. In our example we would be approximating the blue curve by something like this red box:

Obviously in the example the red box is not a fantastic approximation. But nor is it a terrible one. Over the central range, it is never out from the true value by much more than a factor of 2. While crude, this could still represent a substantial improvement on the current state of some of our estimates. A big advantage is that it is easily analytically tractable, so it will be quick to work with. In the rest of this post we’ll explore the consequences of this assumption.

### Places this might fail

In some circumstances, we might expect high uncertainty over difficulty without everywhere having local log-returns. A key example is if we have bounds on the difficulty at one or both ends.

For example, if we are interested in X, which comprises a task of radically unknown difficulty plus a repetitive and predictable part of difficulty 1000, then our distribution of beliefs of the difficulty about X will only include values above 1000, and may be quite clustered there (so not even approximately logarithmic returns). The behaviour in the positive tail might still be roughly logarithmic.

In the other direction, we may know that there is a slow and repetitive way to achieve X, with difficulty 100,000. We are unsure whether there could be a quicker way. In this case our distribution will be uncertain over difficulties up to around 100,000, then have a spike. This will give the reverse behaviour, with roughly logarithmic expected returns in the negative tail, and a different behaviour around the spike at the upper end of the distribution.

In some sense each of these is diverging from the idea that we are very ignorant about the difficulty of the problem, but it may be useful to see how the conclusions vary with the assumptions.

## Implications for expected returns

What does this model tell us about the expected returns from putting resources into trying to solve the problem?

Under the assumption that the prior is locally log-uniform, the full value is realised over the width of the box in the diagram. This is *w = *log(*y*) - log(*x*)*, *where *x *is the value at the start of the box (where the problem could first be plausibly solved), *y* is the value at the end of the box, and our logarithms are natural. Since it’s a probability distribution, the height of the box is 1/*w*.

For any *z* between *x* and *y*, the modelled chance of success from investing *z* resources is equal to the fraction of the box which has been covered by that point. That is:

(1) Chance of success before reaching *z* resources = log(*z*/*x*)/log(*y*/*x*)*.*

So while we are in the relevant range, the chance of success is equal for any doubling of the total resources. We could say that we expect *logarithmic returns *on investing resources.

### Marginal returns

Sometimes of greater relevance to our decisions is the marginal chance of success from adding an extra unit of resources at *z*. This is given by the derivative of Equation (1):

(2) Chance of success from a marginal unit of resource at *z* = 1/*zw*.

So far, we’ve just been looking at estimating the prior probabilities -- before we start work on the problem. Of course when we start work we generally get more information. In particular, if we would have been able to recognise success, and we have invested *z* resources without observing success, then we learn that the difficulty is at least *z*. We must update our probability distribution to account for this. In some cases we will have relatively little information beyond the fact that we haven’t succeeded yet. In that case the update will just be to curtail the distribution to the left of *z* and renormalise, looking roughly like this:

Again the blue curve represents our true subjective probability distribution, and the red box represents a simple model approximating this. Now the simple model gives slightly higher estimated chance of success from an extra marginal unit of resources:

(3) Chance of success from an extra unit of resources after *z* = 1/(*z**(ln(*y*)-ln(*z*))).

Of course in practice we often will update more. Even if we don’t have a good idea of how hard fusion is, we can reasonably assign close to zero probability that an extra $100 today will solve the problem today, because we can see enough to know that the solution won’t be found imminently. This looks like it might present problems for this approach. However, the truly decision-relevant question is about the counterfactual impact of extra resource investment. The region where we can see little chance of success has a much smaller effect on that calculation, which we discuss below.

### Comparison with returns from a Pareto distribution

We mentioned that one natural model of such a process is as a Pareto distribution. If we have a Pareto distribution with shape parameter *α*, and we have so far invested *z* resources without success, then we get:

(4) Chance of success from an extra unit of resources = *α*/*z*.

This is broadly in line with equation (3). In both cases the key term is a factor of 1/*z*. In each case there is also an additional factor, representing roughly how hard the problem is. In the case of the log-linear box, this depends on estimating an upper bound for the difficulty of the problem; in the case of the Pareto distribution it is handled by the shape parameter. It may be easier to introspect and extract a sensible estimate for the width of the box than for the shape parameter, since it is couched more in terms that we naturally understand.

## Further work

In this post, we’ve just explored a simple model for the basic question of how likely success is at various stages. Of course it should not be used blindly, as you may often have more information than is incorporated into the model, but it represents a starting point if you don't know where to begin, and it gives us something explicit which we can discuss, critique, and refine.

In future posts, I plan to:

- Explore what happens in a field of related problems (such as a research field), and explain why we might expect to see logarithmic returns
*ex post*as well as*ex ante*.- Look at some examples of this behaviour in the real world.

- Examine the counterfactual impact of investing resources working on these problems, since this is the standard we should be using to prioritise.
- Apply the framework to some questions of interest, with worked proof-of-concept calculations.
- Consider what happens if we relax some of the assumptions or take different models.

View more: Next