Daniel_Burfoot comments on Request for suggestions: ageing and data-mining - Less Wrong

14 Post author: bokov 24 November 2014 11:38PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (48)

You are viewing a single comment's thread.

Comment author: Daniel_Burfoot 25 November 2014 04:09:50AM 4 points [-]

Research access to large amounts of anonymized patient data.

Take all the data you have, come up with some theory to describe it, build the scheme into a lossless data compressor, and invoke it on the data set. Write down the compression rate you achieve, and then try to do better. And better. And better. This goal will force you to systematically improve your understanding of the data.

(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encoding - I can give you my implementation, or you can find one on the web. So what I'm really suggesting is just that you build statistical models of the raw data, and try systematically to improve those models).

Comment author: Punoxysm 26 November 2014 03:26:41PM 2 points [-]

Would anyone want to literally do this on something as complex as patient data?

If not, why not just say try to come up with as good of models as you can?

Pick a couple of quantities of interest and try to model them as accurately as you can.

Comment author: Daniel_Burfoot 26 November 2014 04:55:57PM 3 points [-]

There is a problem that some data may really fundamentally be a distraction, and so modeling it is just a waste of time.

But it is very hard to tell ahead of time whether or not a piece of data is going to be relevant to a downstream analysis. As an example, in my work on text analysis, the issue of capitalization takes a lot of effort in proportion to how interesting it seems. It is tempting to just throw away caps information by lowercasing everything. But capitalization actually has clues that are relevant to parsing and other analysis - in particular, it allows you to identify acronyms, which usually stand for proper nouns.

Comment author: bokov 03 December 2014 04:24:50PM 1 point [-]

(Note that transforming a sufficiently well specified statistical model into a lossless data compressor is a solved problem, and the solution is called arithmetic encoding - I can give you my implementation, or you can find one on the web.

The unsolved problems are the ones hiding behind the token "sufficiently well specified statistical model".

That said, thanks for the pointer to arithmetic encoding, that may be useful in the future.

Comment author: Username 25 November 2014 10:12:45PM 1 point [-]

This positively sounds a lot like advice that was given in response to a question in the open thread about how to go about a masters thesis. I can't find it but I endorse the recommendation. Immerse yourself in the data. Attack it from different angles and try to compress it down as much as possible. The idea behind the advice is that if you understand the mechanics behind the process the data can be generated from the process (imagine an image of a circle encoded as svg instead of bitmap (or png)).

There are two caveats: 1) You can't eliminate noise of course. 2) You are limited by your data set(s). For the former you know enough tools to separate the noise from the data and quantify it.For the latter you should join in extenal data sets. Your modelling might suggest which could improve your compression. E.g. try to link in SNPs databases.