You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Daniel_Burfoot comments on Request for suggestions: ageing and data-mining - Less Wrong Discussion

14 Post author: bokov 24 November 2014 11:38PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (48)

You are viewing a single comment's thread. Show more comments above.

Comment author: Daniel_Burfoot 26 November 2014 04:55:57PM 3 points [-]

There is a problem that some data may really fundamentally be a distraction, and so modeling it is just a waste of time.

But it is very hard to tell ahead of time whether or not a piece of data is going to be relevant to a downstream analysis. As an example, in my work on text analysis, the issue of capitalization takes a lot of effort in proportion to how interesting it seems. It is tempting to just throw away caps information by lowercasing everything. But capitalization actually has clues that are relevant to parsing and other analysis - in particular, it allows you to identify acronyms, which usually stand for proper nouns.