Comment author:[deleted]
26 September 2010 02:02:19AM
7 points
[-]
So, given that we've got a high concentration of technical people around here, maybe someone can answer this for me:
Could it ever be possible to do some kind of counter-data mining?
Everybody has some publicly-available info on the internet -- information that, in general, we actually want to be publicly available. I have an online presence, sometimes under my real name and sometimes under aliases, and I wouldn't want to change that.
But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you're gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content. There's even a tool (www.recordedfuture.com) that purportedly helps clients like the CIA predict subjects' future movements. But so far, I've never heard of attempts to make data mining harder for the snoops. I'm not talking about advice like "Don't put anything online you wouldn't want in the newspaper." I'm interested in technical solutions -- the equivalent of cryptography.
It's a pipe dream, but it might not be impossible. Here's Wikipedia background, with good additional references, for nonlinear dimensionality reduction techniques, which is one of my academic interests. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)
These techniques involve taking a cloud of points in a high-dimensional space, and deciphering the low-dimensional manifold on which they lie. In other words, extracting salient information from data. And there are standard manifolds where various techniques are known to fail -- it's hard for algorithms to recognize the "swiss roll," for instance.
These hard cases are disappointments for the data miner, but they ought to be opportunities for the counter-data miner, right? Could it be possible to exploit the hard cases to make it more difficult for the snoops? One practical example of something like this already exists: the distorted letters in a CAPTCHA are "hard cases" for automated image recognition software.
I write data mining software professionally, and one weakness that comes to mind is the deduplication process. In order to combine data from different sources, the software has to determine which entries correspond to the same person. It does this by looking for common elements with a low false positive rate. If two records have the same phone number, email address, site plus account name, social security number, or name-address pair, they are almost certainly the same person, so they will be combined. This relation is transitive, so if A has the same phone number as B and B has the same email address as C, then A, B, and C will all be assumed to be the same person.
You can subvert this by creating records which map as equivalent to two different people, such as by having one person's phone number and another person's email address. If a data source contains too many entries like this, it's useless unless there's an easy way to filter them out. If a data source contains just a few entries like this, data miners are likely to get confused. Note that this is not necessarily a good idea, since having a computerized bureaucracy be confused about your identity can have very inconvenient consequences. It is also possible to detect and defeat this strategy, by looking for deduplications with strange results, but this is tricky in practice, since people often really do have multiple names (maiden names, alternate spellings), phone numbers, email addresses etc.
So, given that we've got a high concentration of technical people around here, maybe someone can answer this for me:
Could it ever be possible to do some kind of counter-data mining?
Everybody has some publicly-available info on the internet -- information that, in general, we actually want to be publicly available. I have an online presence, sometimes under my real name and sometimes under aliases, and I wouldn't want to change that.
But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you're gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content. There's even a tool (www.recordedfuture.com) that purportedly helps clients like the CIA predict subjects' future movements. But so far, I've never heard of attempts to make data mining harder for the snoops. I'm not talking about advice like "Don't put anything online you wouldn't want in the newspaper." I'm interested in technical solutions -- the equivalent of cryptography.
It's a pipe dream, but it might not be impossible. Here's Wikipedia background, with good additional references, for nonlinear dimensionality reduction techniques, which is one of my academic interests. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction) These techniques involve taking a cloud of points in a high-dimensional space, and deciphering the low-dimensional manifold on which they lie. In other words, extracting salient information from data. And there are standard manifolds where various techniques are known to fail -- it's hard for algorithms to recognize the "swiss roll," for instance.
These hard cases are disappointments for the data miner, but they ought to be opportunities for the counter-data miner, right? Could it be possible to exploit the hard cases to make it more difficult for the snoops? One practical example of something like this already exists: the distorted letters in a CAPTCHA are "hard cases" for automated image recognition software.
Does anybody have thoughts on this?
I write data mining software professionally, and one weakness that comes to mind is the deduplication process. In order to combine data from different sources, the software has to determine which entries correspond to the same person. It does this by looking for common elements with a low false positive rate. If two records have the same phone number, email address, site plus account name, social security number, or name-address pair, they are almost certainly the same person, so they will be combined. This relation is transitive, so if A has the same phone number as B and B has the same email address as C, then A, B, and C will all be assumed to be the same person.
You can subvert this by creating records which map as equivalent to two different people, such as by having one person's phone number and another person's email address. If a data source contains too many entries like this, it's useless unless there's an easy way to filter them out. If a data source contains just a few entries like this, data miners are likely to get confused. Note that this is not necessarily a good idea, since having a computerized bureaucracy be confused about your identity can have very inconvenient consequences. It is also possible to detect and defeat this strategy, by looking for deduplications with strange results, but this is tricky in practice, since people often really do have multiple names (maiden names, alternate spellings), phone numbers, email addresses etc.