You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

IlyaShpitser comments on Open Thread, Apr. 27 - May 3, 2015 - Less Wrong Discussion

3 Post author: Gondolinian 27 April 2015 12:18AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (352)

You are viewing a single comment's thread. Show more comments above.

Comment author: Kaj_Sotala 27 April 2015 05:07:51PM *  12 points [-]

I managed to get my Bayes RPG into such a state that, although it still isn't that interesting as a game, it's moderately entertaining for a brief while until you master it, and seems like it should produce some actual learning.

I had this game as my MSc thesis topic as a way to force myself to work on the game, but I'm now finally starting to get to the point where a) working on it is fun enough that I don't need an external motivator b) I'd like to actually graduate. So I'll take what I have so far, run it to a bunch of test subjects, see if they learn anything, and write up the results in my thesis. Then I'll continue working on the game on my spare time.

But I'd like to do the empirical part of the thesis properly. Since LW has a bunch of people who know a lot about statistics, I'd like to ask LW: what kinds of statistical tests would be most appropriate for measuring the results?

To elaborate more on the test setup. I expect to go with the standard approach: have some task that measures understanding of something that we want the game to teach, and split people into an intervention group and control group. Have them complete the task first, dropping anyone who does too well in this pre-test, and then carry out the intervention (i.e. either have them play the game or do some "placebo" task, depending on their group). Then have them re-do a new version of the original task and see whether the intervention group has improved more than the controls have.

I don't want to elaborate too much on what tasks we'll give to the subjects, in case I'll recruit someone reading this to be one of my test subjects. But you can expect the standard mammography/cancer thing to be there, since it's such a classic in the literature, though it's not the thing that I'd expect the game's current state to be the most successful at teaching. There will also be a task on a subject I do expect the game to currently be good at teaching. Then there will be one task that I'd expect to have a bimodal distribution in whether or not the game improves it, since the game doesn't force you to pay attention to it. I'd expect some types of players to pay attention to it with others ignoring it.

Additionally I'd like to test things like:

  • giving the players a relatively challenging in-game goal and see whether the completion of that challenge correlates with learning results
  • ask all the players to play for at least X minutes but optionally allow them to play for longer, see whether the amount of time spent playing has any connection to the learning results
  • after playing the game, have the players rate the game on some likert-like scales on questions like whether they enjoyed the game, whether it was too easy or too hard, whether they'd like to play it again, etc. Again look to see if the correlations might be as expected.

So, what statistical tests to use here? I don't actually have much experience with statistics. I guess that the naive approach would be to use some (which?) form of ANOVA to test whether the means of pre-test, control intervention, and game intervention populations are the same. And then just do Spearman's correlation between every numerical item that I've collected and see whether any statistically significant items pop up. Is that fine? Neither of those tests is going to pick up on the hypothesized bimodal distribution in the improvement in one of the tasks, but I might not bother with digging too deeply into that.

Also, how do I set the threshold for how good of a performance in the pre-test indicates that the subject already knows this too well to learn anything, and should thus be ignored in the analysis? Or should I even do that in the first place?

Comment author: IlyaShpitser 29 April 2015 09:35:58AM *  2 points [-]

what kinds of statistical tests would be most appropriate for measuring the results?

What question about your game and learning math/probability are you trying to answer?


If you want "an effect" you want a comparison of two arms. But you can only have one arm have an intervention, and the other just be the baseline arm with no treatment at all (or just the 'background treatment' of being a college undergraduate). For example, you can take a set of undergrads, and advertise that you are testing probability aptitude or something, and then the control arm just gets the test, while the test arm gets your game and the test afterwards.

I don't know about your advisor, but I would accept a study like that.


I always found it slightly puzzling that LW folks who get into practical data analysis start with F methods, and not B. Isn't B kind of a LW "thing?"


Starting to think about measuring results via ANOVA et al is, to me, starting at the wrong level of abstraction (I realize I may differ on this from a lot of statisticians). For example, ANOVA can test for the null. What does that null mean? Well, you are interested in some causal effect. Maybe this: E[test result | assigned to game] - E[test result | baseline undergrad].

Or maybe you give them a questionaire first, and learn how much math they have had (or even what particular classes). Maybe you want to actually look at an effect conditional on math preparation level. Does your game possibly have an 'interaction' with background math sophistication level? Then you need to model that. Then maybe if you decide on the model, you decide for how to test for the null. Or maybe you don't want the null, but the size of the effect itself. etc. etc.

You think about what you want first, the stats technique afterwards.

Comment author: Kaj_Sotala 08 May 2015 03:38:32PM *  0 points [-]

What question about your game and learning math/probability are you trying to answer?

Mostly 1) do the players actually learn anything that would transfer outside the immediate game 2) how much (if at all) things like their enjoyment affect whether they learn

If you want "an effect" you want a comparison of two arms. But you can only have one arm have an intervention, and the other just be the baseline arm with no treatment at all (or just the 'background treatment' of being a college undergraduate). For example, you can take a set of undergrads, and advertise that you are testing probability aptitude or something, and then the control arm just gets the test, while the test arm gets your game and the test afterwards.

Thanks! Isn't "undergrads with only the test vs. undergrads with the game and then the test" kinda the same as "undergrads with only test vs. undergrads after the pretest and the game", though?

I always found it slightly puzzling that LW folks who get into practical data analysis start with F methods, and not B. Isn't B kind of a LW "thing?"

F is what we've been taught, and what most of our supervisors understand. I'm not really familiar with B stats.