You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

summerstay comments on Against NHST - Less Wrong Discussion

57 Post author: gwern 21 December 2012 04:45AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (62)

You are viewing a single comment's thread.

Comment author: summerstay 21 December 2012 04:07:37PM 11 points [-]

Can you give me a concrete course of action to take when I am writing a paper reporting my results? Suppose I have created two versions of a website, and timed 30 people completing a task on each web site. The people on the second website were faster. I want my readers to believe that this wasn't merely a statistical coincidence. Normally, I would do a t-test to show this. What are you proposing I do instead? I don't want a generalization like "use Bayesian statistics, " but a concrete example of how one would test the data and report it in a paper.

Comment author: XFrequentist 21 December 2012 09:28:48PM 4 points [-]

You could use Bayesian estimation to compute credible differences in mean task completion time between your groups.

Described in excruciating detail in this pdf.

Comment author: summerstay 21 December 2012 04:17:29PM *  1 point [-]

Perhaps you would suggest showing the histograms of completion times on each site, along with the 95% confidence error bars?

Comment author: jsteinhardt 21 December 2012 05:06:28PM 1 point [-]

Presumably not actually 95%, but, as gwern said, a threshold based on the cost of false positives.

Comment author: gwern 21 December 2012 05:34:11PM *  4 points [-]

Yes, in this case you could keep using p-values (if you really wanted to...), but with reference to the value of, say, each customer. (This is what I meant by setting the threshold with respect to decision theory.) If the goal is to use on a site making millions of dollars*, 0.01 may be too loose a threshold, but if he's just messing with his personal site to help readers, a p-value like 0.10 may be perfectly acceptable.

* If the results were that important, I think there'd be better approaches than a once-off a/b test. Adaptive multi-armed bandit algorithms sound really cool from what I've read of them.

Comment author: gwern 21 December 2012 04:56:16PM 1 point [-]

I'd suggest more of a scattergram than a histogram; superimposing 95% CIs would then cover the exploratory data/visualization & confidence intervals. Combine that with an effect size and one has made a good start.