summerstay comments on Against NHST - Less Wrong

57 Post author: gwern 21 December 2012 04:45AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (62)

You are viewing a single comment's thread.

Comment author: summerstay 21 December 2012 04:07:37PM 11 points [-]

Can you give me a concrete course of action to take when I am writing a paper reporting my results? Suppose I have created two versions of a website, and timed 30 people completing a task on each web site. The people on the second website were faster. I want my readers to believe that this wasn't merely a statistical coincidence. Normally, I would do a t-test to show this. What are you proposing I do instead? I don't want a generalization like "use Bayesian statistics, " but a concrete example of how one would test the data and report it in a paper.

Comment author: XFrequentist 21 December 2012 09:28:48PM 4 points [-]

You could use Bayesian estimation to compute credible differences in mean task completion time between your groups.

Described in excruciating detail in this pdf.

Comment author: summerstay 21 December 2012 04:17:29PM *  1 point [-]

Perhaps you would suggest showing the histograms of completion times on each site, along with the 95% confidence error bars?

Comment author: jsteinhardt 21 December 2012 05:06:28PM 1 point [-]

Presumably not actually 95%, but, as gwern said, a threshold based on the cost of false positives.

Comment author: gwern 21 December 2012 05:34:11PM *  4 points [-]

Yes, in this case you could keep using p-values (if you really wanted to...), but with reference to the value of, say, each customer. (This is what I meant by setting the threshold with respect to decision theory.) If the goal is to use on a site making millions of dollars*, 0.01 may be too loose a threshold, but if he's just messing with his personal site to help readers, a p-value like 0.10 may be perfectly acceptable.

* If the results were that important, I think there'd be better approaches than a once-off a/b test. Adaptive multi-armed bandit algorithms sound really cool from what I've read of them.

Comment author: gwern 21 December 2012 04:56:16PM 1 point [-]

I'd suggest more of a scattergram than a histogram; superimposing 95% CIs would then cover the exploratory data/visualization & confidence intervals. Combine that with an effect size and one has made a good start.