gwern comments on Open Thread, September, 2010-- part 2 - Less Wrong

3 Post author: NancyLebovitz 17 September 2010 01:44AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (858)

You are viewing a single comment's thread. Show more comments above.

Comment author: gwern 26 September 2010 02:33:54AM 2 points [-]

My general thought is that so little data is needed to identify you, that the dataset can be enormously noisy and still identify you. And if your fake data is just randomly generated, isn't that all it is, noise?

(I saw a paper about medical datasets, I think, that showed that you couldn't anonymize the data successfully and still have a useful dataset; I don't have it handy, but it's not hard to find people saying things like, with the Netflix dataset, that it can't be done: http://33bits.org/2010/03/15/open-letter-to-netflix/ )

Comment author: [deleted] 26 September 2010 02:41:41AM 2 points [-]

I've heard about the medical datasets.

Noise is a pretty interesting thing, and the possibility of "denoising" depends a lot on the kind of noise. White noise is the easiest to get rid of; malicious noise, which isn't random but targeted to be "worst-case," can thwart denoising methods that were designed for white noise.