Open Thread, September, 2010-- part 2

NancyLebovitz

This thread is for the discussion of Less Wrong topics that have not appeared in recent posts. If a discussion gets unwieldy, celebrate by turning it into a top-level post.

My general thought is that so little data is needed to identify you, that the dataset can be enormously noisy and still identify you. And if your fake data is just randomly generated, isn't that all it is, noise?

(I saw a paper about medical datasets, I think, that showed that you couldn't anonymize the data successfully and still have a useful dataset; I don't have it handy, but it's not hard to find people saying things like, with the Netflix dataset, that it can't be done: http://33bits.org/2010/03/15/open-letter-to-netflix/ )

I've heard about the medical datasets.

Noise is a pretty interesting thing, and the possibility of "denoising" depends a lot on the kind of noise. White noise is the easiest to get rid of; malicious noise, which isn't random but targeted to be "worst-case," can thwart denoising methods that were designed for white noise.