wedrifid comments on Open Thread, September, 2010-- part 2 - Less Wrong

3 Post author: NancyLebovitz 17 September 2010 01:44AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (858)

You are viewing a single comment's thread. Show more comments above.

Comment author: [deleted] 26 September 2010 02:02:19AM 7 points [-]

So, given that we've got a high concentration of technical people around here, maybe someone can answer this for me:

Could it ever be possible to do some kind of counter-data mining?

Everybody has some publicly-available info on the internet -- information that, in general, we actually want to be publicly available. I have an online presence, sometimes under my real name and sometimes under aliases, and I wouldn't want to change that.

But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you're gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content. There's even a tool (www.recordedfuture.com) that purportedly helps clients like the CIA predict subjects' future movements. But so far, I've never heard of attempts to make data mining harder for the snoops. I'm not talking about advice like "Don't put anything online you wouldn't want in the newspaper." I'm interested in technical solutions -- the equivalent of cryptography.

It's a pipe dream, but it might not be impossible. Here's Wikipedia background, with good additional references, for nonlinear dimensionality reduction techniques, which is one of my academic interests. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction) These techniques involve taking a cloud of points in a high-dimensional space, and deciphering the low-dimensional manifold on which they lie. In other words, extracting salient information from data. And there are standard manifolds where various techniques are known to fail -- it's hard for algorithms to recognize the "swiss roll," for instance.

These hard cases are disappointments for the data miner, but they ought to be opportunities for the counter-data miner, right? Could it be possible to exploit the hard cases to make it more difficult for the snoops? One practical example of something like this already exists: the distorted letters in a CAPTCHA are "hard cases" for automated image recognition software.

Does anybody have thoughts on this?

Comment author: wedrifid 26 September 2010 04:45:54AM 2 points [-]

But data mining is, of course, a potential privacy nightmare. There are algorithms that can tell if you're gay from your facebook page, and reassemble your address and social security number from aggregating apparently innocuous web content.

Really? Where can I find said algorithms? Knowing how they work would obviously be a useful way of thwarting them.

Comment author: [deleted] 26 September 2010 05:11:55AM 2 points [-]
Comment author: sketerpot 26 September 2010 05:22:02AM *  1 point [-]

Apparently, it looks at the self-reported gender and sexual orientation of your Facebook friends, and uses that information to guess your own sexual orientation. Here's how I would do that:

  1. Gather three variables: your gender, the male/female ratio of your friends, and the ratio of gay-or-bisexual to straight people among those of your friends who state their own sexual orientation. If I wanted to be extra-fancy, I might also include a sparse array of events and clubs that the person was signed up for.

  2. Apply some standard machine learning tools to this, discretizing variables if necessary. Use people who report their sexual orientation as training and testing data.

  3. Practice my evil villain laugh.

In order to defend against this, you could apply steps 1 and 2, then look at what the machine learning program tells you. Try to match its profile of a straight person. Then you can remain Facebook-closeted even in the face of the all-seeing electronic gaydar.

Comment author: Vladimir_Nesov 26 September 2010 06:34:51AM *  1 point [-]

It's theoretically obvious that you can try to do it this way with a nontrivial chance of success, but not at all obvious that given enough skill and work, success is assured (which was the claim). The latter would require (knowledge of) actual experiments.

Comment author: wedrifid 26 September 2010 06:03:09AM 0 points [-]

Try to match its profile of a straight person. Then you can remain Facebook-closeted even in the face of the all-seeing electronic gaydar.

I have no problem with people knowing that I'm gay. Come to think of it I have no problem with people knowing my social security number. (We don't even have a commonly used equivalent here. Although driver's licence numbers and birth citificate IDs are sometimes useful.)

Comment author: arundelo 26 September 2010 05:16:55AM 1 point [-]

I've heard of one for determining your sexual orientation (if you don't reveal it on your info page), but it's based on the revealed sexual orientations of your friends (if a lot are gay, you probably are too), so it's harder to thwart than, say, something based on your favorite songs.