IlyaShpitser comments on Open Thread, Apr. 27 - May 3, 2015 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (352)
I managed to get my Bayes RPG into such a state that, although it still isn't that interesting as a game, it's moderately entertaining for a brief while until you master it, and seems like it should produce some actual learning.
I had this game as my MSc thesis topic as a way to force myself to work on the game, but I'm now finally starting to get to the point where a) working on it is fun enough that I don't need an external motivator b) I'd like to actually graduate. So I'll take what I have so far, run it to a bunch of test subjects, see if they learn anything, and write up the results in my thesis. Then I'll continue working on the game on my spare time.
But I'd like to do the empirical part of the thesis properly. Since LW has a bunch of people who know a lot about statistics, I'd like to ask LW: what kinds of statistical tests would be most appropriate for measuring the results?
To elaborate more on the test setup. I expect to go with the standard approach: have some task that measures understanding of something that we want the game to teach, and split people into an intervention group and control group. Have them complete the task first, dropping anyone who does too well in this pre-test, and then carry out the intervention (i.e. either have them play the game or do some "placebo" task, depending on their group). Then have them re-do a new version of the original task and see whether the intervention group has improved more than the controls have.
I don't want to elaborate too much on what tasks we'll give to the subjects, in case I'll recruit someone reading this to be one of my test subjects. But you can expect the standard mammography/cancer thing to be there, since it's such a classic in the literature, though it's not the thing that I'd expect the game's current state to be the most successful at teaching. There will also be a task on a subject I do expect the game to currently be good at teaching. Then there will be one task that I'd expect to have a bimodal distribution in whether or not the game improves it, since the game doesn't force you to pay attention to it. I'd expect some types of players to pay attention to it with others ignoring it.
Additionally I'd like to test things like:
So, what statistical tests to use here? I don't actually have much experience with statistics. I guess that the naive approach would be to use some (which?) form of ANOVA to test whether the means of pre-test, control intervention, and game intervention populations are the same. And then just do Spearman's correlation between every numerical item that I've collected and see whether any statistically significant items pop up. Is that fine? Neither of those tests is going to pick up on the hypothesized bimodal distribution in the improvement in one of the tasks, but I might not bother with digging too deeply into that.
Also, how do I set the threshold for how good of a performance in the pre-test indicates that the subject already knows this too well to learn anything, and should thus be ignored in the analysis? Or should I even do that in the first place?
What question about your game and learning math/probability are you trying to answer?
If you want "an effect" you want a comparison of two arms. But you can only have one arm have an intervention, and the other just be the baseline arm with no treatment at all (or just the 'background treatment' of being a college undergraduate). For example, you can take a set of undergrads, and advertise that you are testing probability aptitude or something, and then the control arm just gets the test, while the test arm gets your game and the test afterwards.
I don't know about your advisor, but I would accept a study like that.
I always found it slightly puzzling that LW folks who get into practical data analysis start with F methods, and not B. Isn't B kind of a LW "thing?"
Starting to think about measuring results via ANOVA et al is, to me, starting at the wrong level of abstraction (I realize I may differ on this from a lot of statisticians). For example, ANOVA can test for the null. What does that null mean? Well, you are interested in some causal effect. Maybe this: E[test result | assigned to game] - E[test result | baseline undergrad].
Or maybe you give them a questionaire first, and learn how much math they have had (or even what particular classes). Maybe you want to actually look at an effect conditional on math preparation level. Does your game possibly have an 'interaction' with background math sophistication level? Then you need to model that. Then maybe if you decide on the model, you decide for how to test for the null. Or maybe you don't want the null, but the size of the effect itself. etc. etc.
You think about what you want first, the stats technique afterwards.
Mostly 1) do the players actually learn anything that would transfer outside the immediate game 2) how much (if at all) things like their enjoyment affect whether they learn
Thanks! Isn't "undergrads with only the test vs. undergrads with the game and then the test" kinda the same as "undergrads with only test vs. undergrads after the pretest and the game", though?
F is what we've been taught, and what most of our supervisors understand. I'm not really familiar with B stats.