Daniel_Burfoot comments on Request for suggestions: ageing and data-mining - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (48)
There is a problem that some data may really fundamentally be a distraction, and so modeling it is just a waste of time.
But it is very hard to tell ahead of time whether or not a piece of data is going to be relevant to a downstream analysis. As an example, in my work on text analysis, the issue of capitalization takes a lot of effort in proportion to how interesting it seems. It is tempting to just throw away caps information by lowercasing everything. But capitalization actually has clues that are relevant to parsing and other analysis - in particular, it allows you to identify acronyms, which usually stand for proper nouns.