# Original Research on Less Wrong

20 29 October 2012 10:50PM

Hundreds of Less Wrong posts summarize or repackage work previously published in professional books and journals, but Less Wrong also hosts lots of original research in philosophy, decision theory, mathematical logic, and other fields. This post serves as a curated index of Less Wrong posts containing significant original research.

Obviously, there is much fuzziness about what counts as "significant" or "original." I'll be making lots of subjective judgment calls about which suggestions to add to this post. One clear rule is: I won't be linking anything that merely summarizes previous work (e.g. Stuart's summary of his earlier work on utility indifference).

Update 09/20/2013: Added Notes on logical priors from the MIRI workshop, Cooperating with agents with different ideas of fairness, while resisting exploitation, Do Earths with slower economic growth have a better chance at FAI?

Update 11/03/2013: Added Bayesian probability as an approximate theory of uncertainty?, On the importance of taking limits: Infinite Spheres of Utility, Of all the SIA-doomsdays in the all the worlds...

Update 01/22/2014: Added Change the labels, undo infinitely good, Reduced impact AI: no back channels, International cooperation vs. AI arms race, Naturalistic trust among AIs: The parable of the thesis advisor’s theorem

#### AI Risk Strategy

Sort By: Best
Comment author: 30 October 2012 10:04:03PM 5 points [-]

Um... perhaps Wei Dai's analysis of the absent-minded driver problem (with it's subsequent resolution in the comments) and paulfchristiano's AIXI and existential despair would qualify?

Comment author: 12 November 2012 02:06:54AM 0 points [-]

Comment author: 03 November 2012 01:54:42PM 2 points [-]

Privileging the Hypothesis . I don't think this bias has been describe anywhere else.

Comment author: 03 November 2012 10:24:04PM 2 points [-]

I don't think it has been named elsewhere, but there is related research. Kahneman described something very similar to it in Thinking, Fast and Slow:

“The probability of a rare event will (often, not always) be overestimated, because of the confirmatory bias of memory. Thinking about that event, you try to make it true in your mind. A rare event will be overweighted if it specifically attracts attention. …. And when there is no overweighting, there will be neglect. When it comes to rare probabilities, our mind is not designed to get things quite right. For the residents of a planet that may be exposed to events no one has yet experienced, this is not good news.”

Comment author: 30 October 2012 12:29:20AM *  1 point [-]

According to an article on PLOS Medicine, Most Published Research Findings Are False. Feynman provides an unsettling perspective on what's happened with research as well in Cargo Cult Science.

Have we done any better?

Comment author: 30 October 2012 01:07:14AM 13 points [-]

That's for experimental statistical reports. Trying to do math runs into a different set of dangers.

You can easily beat "Most published research findings are false" by reporting Bayesian likelihood ratios instead of "statistical significance", or even just keeping statistical significance and demanding p < .001 instead of the ludicrous p < .05. It should only take <2.5 times as many subjects to detect a real effect at p < .001 instead of p < .05 and the proportion of false findings would go way down immediately. That's what current grantmakers and journals would ask for if they cared.

Comment author: 30 October 2012 02:34:04AM *  11 points [-]

It should only take <2.5 times as many subjects to detect a real effect at p < .001 instead of p < .05 and the proportion of false findings would go way down immediately.

If anyone is curious about the details here - you can derive from the basic t-test statistics (easier to understand is the z-test) an equation for the power of an equation for 0.05 and 90% power which goes

$\frac{\sqrt{n}}{\hat{\sigma}_D} > 1.64- z_{0.10}$

The "1.64" here is a magic value derived from a big table for the normal distribution: it indicates a result which for the normal distribution of the null is 1.64 standard deviations out towards the tails, which turns out to happen 0.05 or 5% of the time when you generate random draws from the null hypothesis's normal distribution. But we need to know how many standard deviations out we have to go in order to have a result which appears only 0.001 or 0.1% of the time when you randomly draw. This magic value turns out to be 3.09. So we plug that in and now the equation looks like:

$\frac{\sqrt{n}}{\hat{\sigma}_D} > 3.09 - z_{0.10}$

$z_{0.10}$ can be given a value too from the table, but it's smaller, just 1.28 (10% of the population will be that far out). So we can substitute in for both:

$\frac{\sqrt{n}}{\hat{\sigma}_D} > 1.64 + 1.28 \text{ and } \frac{\sqrt{n'}}{\hat{\sigma}_D} > 3.09 + 1.28$

Simplify:

$\frac{\sqrt{n}}{\hat{\sigma}_D} > 2.92 \text{ and } \frac{\sqrt{n'}}{\hat{\sigma}_D} > 4.37$

Multiply the standard deviation by both sides to start to get at n/n':

$\sqrt{n}} > 2.92 \times \hat{\sigma}_D \text{ and } \sqrt{n'} > 4.37 \times \hat{\sigma}_D$

Square to expose the naked n/n':

$n > (2.92 \times \hat{\sigma}_D)^2 \text{ and } n' > (4.37 \times \hat{\sigma}_D)^2$

Distribute:

$n > 2.92^2 \times \hat{\sigma}_D^2 \text{ and } n' > 4.37^2 \times \hat{\sigma}_D^2$

Simplify:

$n > 8.5264\times \hat{\sigma}_D^2 \text{ and } n' > 19.09 \times \hat{\sigma}_D^2$

Obviously we can divide both n and n' by $\hat{\sigma}_D^2$ to get rid of that, leaving us with:

$n > 8.5264 \text{ and } n' > 19.09$

So, how much bigger is n' than n? This requires an advanced operation known as division:

$\frac{19.09}{8.5264} = 2.239$

And 2.23 is indeed <2.5.

We can double check by firing up a power calculator and messing around with various sample sizes and effect sizes and powers to see how n changes with more stringent significances:

\$ R

library(pwr)
pwr.t.test(d=0.1,sig.level=0.05,power=0.90)

Two-sample t test power calculation

n = 2102.445
d = 0.1
sig.level = 0.05
power = 0.9
alternative = two.sided

NOTE: n is number in each group

pwr.t.test(d=0.1,sig.level=0.001,power=0.90)

Two-sample t test power calculation

n = 4183.487
d = 0.1
sig.level = 0.001
power = 0.9
alternative = two.sided

NOTE: n is number in each group

4183 / 2102
[1] 1.99001

Comment author: 30 October 2012 04:45:05AM 15 points [-]

I was just mentally approximating log(.001)/log(.05) = 2.3.

Comment author: 30 October 2012 03:34:28PM *  10 points [-]

Thanks. Sometimes I learn a lot from people saying fairly-obvious (in retrospect) things.

In case anyone is curious about this, I guess that Eliezer knew it instantly because each additional data point brings with it a constant amount of information. The log of a probability is the information it contains, so an event with probability .001 has 2.3 times the information of an event of probability .05.

If that's not intuitive, consider that p=.05 means that you have a .05 chance of seeing the effect by statistical fluke (assuming there's no real effect present). If your sample size is n times as large, the probability becomes (.05)^n. (Edit: see comments below) To solve

(.05)^n = .001

take logs of both sides and divide to get

n = log(.001)/log(.05)

Comment author: 02 November 2012 08:52:51PM 0 points [-]

The log of a probability is the information it contains

Why?

Comment author: 02 November 2012 09:08:38PM 1 point [-]

You mean why isn't the information of a particular number just its length, or its size, and is its log of all things?

Because you can think of each part of the number as telling you how to navigate a binary tree to the node target meaning, and the opposite of a binary tree is the logarithm; at least, that's how I think of it when I use it in my essays like Death Note anonymity.

Comment author: 31 October 2012 09:12:24AM 0 points [-]

If your sample size is n times as large, the probability becomes (.05)^n

I'm not sure that follows.

Comment author: 31 October 2012 09:37:32AM 2 points [-]

If a given piece of evidence E1 provides Bayesian likelihood for theory T1 over T2, and E2 was generated by an isomorphic process, then we get the likelihood ratio squared, providing that T1 and T2 are single possible worlds and have no parameters being updated by E1 or E2 so that the probability of the evidence is conditionally independent.

Thus sayeth Bayes, so far as I can tell.

As for the frequentists...

Well, logically, we're allegedly rejecting a null hypothesis. If the "null hypothesis" contains no parameters to be updated and the probability that E1 was generated by the null hypothesis is .05, and E2 was generated by a causally conditionally independent process, the probability that E1+E2 was generated by the null hypothesis ought to be 0.0025.

But of course gwern's calculation came out differently in the decimals. This could be because some approximation truncated a decimal or two. But it could also be because frequentism actually calculates the probability that E1 is in some amazing class [E] of other data we could've observed but didn't, to be p < 0.05. Who knows what strange class of other data we could've seen but didn't, a given frequentist method will put E1 + E2 into? I mean, you can make up whatever the hell [E] you want, so who says you've got to make up one that makes [E+E] have the probability of [E] squared? So if E1 and E2 are exactly equally likely given the null hypothesis, a frequentist method could say that their combined "significance" is the square of E1, less than the square, more than the square, who knows, what the hell, if we obeyed probability theory we'd be Bayesians so let's just make stuff up. Sorry if I sound a bit polemical here.

Comment author: 01 November 2012 04:00:29AM 4 points [-]

You can't just multiply p-values together to get the combined p-value for multiple experiments.

A p-value is a statistic that has a uniform(0,1) distribution if the null hypothesis is true. If you take two independent uniform(0,1) variables and multiply them together, the product is not a uniform(0,1) variable - it has more of its distribution near 0 and less near 1. So multiplying two p-values together does not give you a p-value; it gives you a number that is smaller than the p-value that you would get if you went through the appropriate frequentist procedure.

Comment author: 01 November 2012 02:42:20PM 3 points [-]

In the course of figuring out what the hell the parent comment was talking about and how one was supposed to do the calculation, I found this. p-values are much clearer for me now, thanks for bringing this up.

Comment author: 01 November 2012 03:42:17PM 2 points [-]

Don't get me wrong, this is a good paper, well-written to be clearly understandable and not to be deliberately obtuse like far too many math papers these days, and the author's heart is clearly in the right place, but I still screamed while reading it.

How can anyone read this, and not bang their head against the wall at how horribly arbitrary this all is... no wonder more than half of published findings are false.

Comment author: 01 November 2012 03:25:45PM *  0 points [-]

That is a really interesting paper.

Also, I found that the function R_k in Section 2 has the slightly-more-closed formula $\rho\cdot P_k(\log(1/\rho$) where P_k(x) is the first k terms of the Taylor series for e^x (and has the formula with factorials and everything). Just in case anyone wants to try this at home.

Comment author: 31 October 2012 01:58:22PM 2 points [-]

A more generous way to think about frequentism (which can be justified by some conditional probability sleight-of-hand) is that the significance of some evidence E is actually the probability that the null hypothesis is true, given E and also some prior distribution that is swept under the rug and (mostly) not under the experimenter's control. Which is bad, yes, but in many cases the prior distribution is at least close to something reasonable. And there are some cases in which we can somewhat change the prior distribution to reflect our real priors: for example, when choosing to conduct a 1-tailed test rather than a 2-tailed one.

Under this interpretation, it is silly to expect significances to multiply. You'd really be saying something like Pr[H|E1+E2] = Pr[H|E1] Pr[H|E2]. And that's simply not true: you are double-counting the prior probability Pr[H] when you do this. The frequentist approach is a correct way to combine these probabilities, although this isn't obvious because nobody actually knows what the frequentist Pr[H] is.

But if you read about two experiments with a p-value of 0.05, and think of them as one experiment with a p-value of 0.0025, you are very very very wrong; not just frequentist-wrong but Bayesian-wrong as well.

Comment author: 31 October 2012 02:39:46PM 2 points [-]

the significance of some evidence E is actually the probability that the null hypothesis is true, given E

No frequentist says this. They don't believe in P(H|E). That's the explicit basis of the whole philosophy. People who talk about the probability of a hypothesis given the evidence are Bayesians, full stop.

Statistical significance is, albeit in a strange and distorted way, supposed to be about P(E|null hypothesis), and so, yes, two experiments with a p-value of 0.05 should add up to somewhere in the vicinity of p < 0.0025, because it's about likelihoods, which do multiply, and not posteriors.

Comment author: 31 October 2012 04:31:10PM 0 points [-]

While some frequentist methods do use likelihoods, the mapping from likelihood to p-value is non-linear, so multiplying them would still be a mistake, at least as far as I can tell.

Comment author: 31 October 2012 02:50:39PM *  0 points [-]

I'm not saying that frequentists believe this. I'm saying that the frequentist math (which computes Pr[E|H0]) is equivalent to computing Pr[H0|E] with respect to a prior distribution under which Pr[H0]=Pr[E]. Furthermore, this is a reasonable thing to look at, because from that point of view the way statistical significances combine actually makes sense.

Comment author: 31 October 2012 04:29:00PM -1 points [-]

Suppose that our data are coin flips, and consider three hypotheses: H0 = always heads, H1 = fair coin, H2 = heads with probability 25%. Now suppose that the two hypotheses we actually want to test between are H0 and H' = 0.5(H1+H2). After seeing a single heads, the likelihood of H0 is 1 and the likelihood of H' is 0.5(0.5+0.25). After seeing two heads, the likelihood of H0 is 1 and the likelihood of H' is 0.5(0.5^2+0.25^2). In general, the likelihood of H' after n heads is 0.5(0.5^n+0.25^n), i.e. a mixture of multiple geometric functions. In general if H' is a mixture of many hypotheses, the likelihood will be a mixture of many geometric functions, and therefore could be more or less arbitrary.

Comment author: 01 November 2012 04:43:15AM 1 point [-]

That's why I specified single possible worlds / hypotheses with no internal parameters that are being learned.

Comment author: 01 November 2012 05:30:46AM 1 point [-]

Oops, missed that; but that specification doesn't hold in the situation we care about, since rejecting the null hypotheses typically requires us to consider the result of marginalizing over a space of alternative hypotheses (well, assuming we're being Bayesians, but I know you prefer that anyways =P).

Comment author: 31 October 2012 12:42:26PM *  1 point [-]

You're right. That would be true if we did n independent tests, not one test with n-times the subjects.

e.g. probability of 60 or more heads in 100 tosses = .028

probability of 120 or more heads in 200 tosses = .0028

but .028^2 = .00081

Comment author: 31 October 2012 02:47:21PM 4 points [-]

Amazing, innit? Meanwhile in the land of the sane people, the likelihood function from any given propensity to come up heads, to the observed data, is exactly squared for 120 in 200 vs. 60 in 100.

Comment author: 30 October 2012 04:14:32PM 4 points [-]

It should only take <2.5 times as many subjects to detect a real effect at p < .001 instead of p < .05 and the proportion of false findings would go way down immediately.

But then people could only publish 1/50 as many papers!

Comment author: 30 October 2012 05:13:15PM 2 points [-]

I had to do a double-take before I realized that this probably wasn't a serious attempt at a counterargument. I'm still not quite convinced that it isn't. Poe's Law and related things.

Yes, it does seem like some peoples' true rejections might turn out to be less opportunities for appeal to public and gaining popularity / funding.

Comment author: 03 November 2012 03:37:27PM 1 point [-]

it should only take <2.5 times as many subjects to detect a real effect at p < .001 instead of p < .05

Not to disagree with the overarching point, but the use of "only" here is inappropriate under some circumstances. Eg, a neuropsychological study requiring participants with a particular kind of brain injury is going to find more than doubling its n extremely difficult and time-consuming. For this kind of study (presuming insistence on working with p-values) it seems better to roll with the "ludicrous" p < .05 and rely on replication elsewhere for improved reliability. "Ludicrous" is too strong in fields with small effect sizes and small subject pools; they just need a much higher rate of replication.

Comment author: 30 October 2012 01:10:10PM 1 point [-]

I have made a habit out of ignoring p<.05 values when they are reported, unless its a special case where getting more subjects is too difficult or impossible.* I normally go with p<0.01 results unless its very easy to gather more subjects, in which case going with p<0.001 or lower is good.

• For those cases, one can rely on repeated measurements over time of the same subjects over time. For instance, when comparing cross-country scores where the number of subjects is maxed out at 100-200. E.g. in The Spirit Level (book).
Comment author: 30 October 2012 01:01:06AM 5 points [-]

Dunno.

I bet math and logic papers have a higher frequency of valid results than medicine papers have of true results, and that LW math and logic results are more often valid than not.

Mainstream philosophy, however, is vastly less truth-tracking than medicine. I bet LW has a better philosophical track record than mainstream philosophy (merely by being naturalistic, reductionistic, relatively precise, cogsci-informed, etc.), but I'm not sure by how much.

Comment author: 30 October 2012 01:08:34AM 0 points [-]

Those criticisms are largely irrelevant to 3 of the 4 sections, and the 4th is more history or applied statistics than science.

Comment author: 30 October 2012 05:54:25PM 1 point [-]

No help from LW whatsoever? I was at least expecting people to mention the obvious stuff, like Eliezer's free wlil sequence. :(

Comment author: [deleted] 30 October 2012 06:04:23PM 5 points [-]

Could you say something about your methods of deciding originality and significance? One of the problems with figuring something like this out is that LW and mainstream academia often use significantly different jargon. It may be that LW will be unhelpful here because in order to work out what's original and significant, you'd have to be an expert both in LW stuff and in mainstream academic discussions.

Comment author: 31 October 2012 07:06:05PM *  0 points [-]

Are you going to put on the Free Will sequence? And other important contributions from the sequences, like the reductionism post and dissolving the question (and Zombies too!)? (They're pretty important to philosophy.)

Comment author: 31 October 2012 07:45:07PM 0 points [-]

Eventually; I'll probably get a remote researcher to help me fill out the list, later.

Comment author: 30 October 2012 06:11:29PM *  0 points [-]

"I would help, but I suck at researching papers to compare LW stuff with mainstream science, so I can't really do much."

Unfortunately, I might also want to post that (for mostly status reasons) if I wouldn't help, which makes the information content near-zero and qualifies as noise.

Unless it's a meta post explaining why I wasn't posting and that I think many other users might judge themselves unable to help.

Comment author: [deleted] 02 November 2012 09:49:56PM 0 points [-]

You forgot to escape the underscores in cousin_it's username.

Comment author: 02 November 2012 10:41:53PM *  0 points [-]

It's an error in LW's rendering. Not sure how to fix it.

Comment author: 03 November 2012 04:57:15AM *  3 points [-]

Fixed. (If you edit HTML source, LW software typically isn't involved in deciding how the post gets rendered.)