Open thread, Jul. 11 - Jul. 17, 2016

MrMind

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.

If it's worth saying, but not worth its own post (even in Discussion), then it goes here.

Notes for future OT posters:

1. Please add the 'open_thread' tag.

2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)

3. Open Threads should be posted in Discussion, and not Main.

4. Open Threads should start on Monday, and end on Sunday.

I also think that a QS dataset might be more interesting than compressing the GSS. Promotion wise I think it could be promoted via the QS website (I might still have posting privelages or simply ask, I doubt people would have a problem).

I dunno if it would be all that interesting. If someone wants to work on predictive modeling of datasets every week or month in a tournament format, they can just use Kaggle (and win with XGBOOST or a residual network, likely). I have fat/muscle/weight data on myself from an Omron scale going back 2 years with multiple measurements on most days; this is a reasonably interesting dataset because one can quantify measurement error, the variables are interrelated with one or two latent variables, there are definite nontrivial time trends, and it's easy to generate hold out data (if the tournament runs 1 month, then there's an additional 1 month of data which no one, including the organizer, had access to to score contributions with at the end) - but I doubt anyone would bother participating. I have an even bigger QS dataset incorporating all my recorded data of all kinds on a daily granularity, somewhere around 100+ summary variables, but the missingness is so high that it would be unpleasant to work with (I've been having a great deal of difficulty just getting lavaan/blavaan to run on it) and likewise I doubt there would be much interest in a competition. There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it. Kaggles with a lot of participating have big prizes or sexy datasets like the Higgs boson or whales.

There needs to be some sort of incentive: either prizes, inherently interesting data, or some important intellectual/scientific point to it

I think there a scientific point for those QS data sets that can be automatically measured with a high scale of granuality. Very frequently people measure less data because they don't want to store all the data that a single sensor can produce.

Currently acclerometer data get's compressed into the variable of "steps". That variable has the advantage that it has an intuitive meaning but it's likely not the best possible variable to gather when doing scientific work about how Pokemon Go leads people to do more exercise.