qsz comments on Open Thread, Apr. 27 - May 3, 2015 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (352)
I managed to get my Bayes RPG into such a state that, although it still isn't that interesting as a game, it's moderately entertaining for a brief while until you master it, and seems like it should produce some actual learning.
I had this game as my MSc thesis topic as a way to force myself to work on the game, but I'm now finally starting to get to the point where a) working on it is fun enough that I don't need an external motivator b) I'd like to actually graduate. So I'll take what I have so far, run it to a bunch of test subjects, see if they learn anything, and write up the results in my thesis. Then I'll continue working on the game on my spare time.
But I'd like to do the empirical part of the thesis properly. Since LW has a bunch of people who know a lot about statistics, I'd like to ask LW: what kinds of statistical tests would be most appropriate for measuring the results?
To elaborate more on the test setup. I expect to go with the standard approach: have some task that measures understanding of something that we want the game to teach, and split people into an intervention group and control group. Have them complete the task first, dropping anyone who does too well in this pre-test, and then carry out the intervention (i.e. either have them play the game or do some "placebo" task, depending on their group). Then have them re-do a new version of the original task and see whether the intervention group has improved more than the controls have.
I don't want to elaborate too much on what tasks we'll give to the subjects, in case I'll recruit someone reading this to be one of my test subjects. But you can expect the standard mammography/cancer thing to be there, since it's such a classic in the literature, though it's not the thing that I'd expect the game's current state to be the most successful at teaching. There will also be a task on a subject I do expect the game to currently be good at teaching. Then there will be one task that I'd expect to have a bimodal distribution in whether or not the game improves it, since the game doesn't force you to pay attention to it. I'd expect some types of players to pay attention to it with others ignoring it.
Additionally I'd like to test things like:
So, what statistical tests to use here? I don't actually have much experience with statistics. I guess that the naive approach would be to use some (which?) form of ANOVA to test whether the means of pre-test, control intervention, and game intervention populations are the same. And then just do Spearman's correlation between every numerical item that I've collected and see whether any statistically significant items pop up. Is that fine? Neither of those tests is going to pick up on the hypothesized bimodal distribution in the improvement in one of the tasks, but I might not bother with digging too deeply into that.
Also, how do I set the threshold for how good of a performance in the pre-test indicates that the subject already knows this too well to learn anything, and should thus be ignored in the analysis? Or should I even do that in the first place?
Typical analysis of the basic design you described is often something like a mixed 2×2 factorial design: which test (pre- / post-test, within subjects) × intervention (yes/no, between subjects) - the interaction term being evidence for effects of intervention (greater increase between pre- and post- test in intervention condition). Often analysed using ANOVA (participants as random effect), nonparametric equivalents may be more appropriate.
More complex models are also very appropriate, e.g., adding question type as a factor/predictor rather than treating the different questions as separate dependent variables: this would provide indications of whether improvement after intervention differs for the question types, as you've predicted. This doesn't give you clues about bimodality but at least allows you to more directly test your predictions about relative degree of improvement (if the intervention works).
Correlations between your different dependent measures: feel free by all means - but make sure you examine the characteristics of the distributions rather than just zooming ahead with a matrix of correlation coefficients. And be aware of the multiple comparisons problem, Type I error is very likely.
Excluding participants on the basis of overly high performance in pretest is appropriate. If possible I suggest setting this criterion before formal testing (even an educated guess is appropriate as this doesn't harm the conclusions you can draw: it can be justified as leaving room for improvement if the intervention works) - or at the very least do this before analysing anything else of the participant's performance to avoid biasing your decision about setting the threshold.
I'm afraid you've said too much already - and if you're looking for people who are naive about the principles involved, LW is probably not a great place for recruiting anyway.
please feel free to private message me if you'd like clarification of what I've posted - this sort of thing is very much part of my day job.
Thanks a lot!
Could you elaborate on that? Something like "so we're going to test the impact of traditional instruction versus this prototype educational game on your ability to do these tasks" is what I'd have expected to say to the test subjects anyway, and that's mostly the content of what I said here. (Though I do admit that the bit about expecting a bimodal distribution depending on whether or not the subjects pay attention to something was a bit of an unnecessary tipoff here.)
In particular, I expect to have a tradeoff - I can tell people even less than that, and get a much smaller group of testers. Or I can tell people that I've gotten the game I've been working on to a very early prototype stage and am now looking for testers, and advertise that on e.g. LW, and get a much bigger group of test subjects.
It's true that LW-people are much more likely to be able to e.g. solve the mammography example already, but I'd still expect most users to be relatively unfamiliar with the technicalities of causal networks - I was too, until embarking on this project.
I was thinking more about your previous posts on the subject (your development of the game and some of the ideas behind it). The same general reason I'd avoid testing people from my extended lab network, who may not know any details of a current study but have a sufficiently clear impression of what I'm interested in to potentially influence the outcomes (whether intentionally, "helping me out", or implicitly).
When rolling it out for testing, you could always include a post-test which probes people's previous experience (e.g. what they knew in advance about your work & the ideas behind it) & exclude people who report that they know "too much" about the motivations of the study. Could even prompt for some info about LW participation, could also be used to mitigate this issue (especially if you end up with decent samples both in and outside LW).
Ah, that's a good point. And a good suggestion, too.