You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

Kawoomba comments on PSA: Eugine_Nier evading ban? - Less Wrong Discussion

17 Post author: Dahlen 07 December 2014 11:23PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (68)

You are viewing a single comment's thread. Show more comments above.

Comment author: Kawoomba 09 December 2014 12:40:43PM *  15 points [-]

Friend of mine did it via computational complexity: using gzip (as an approximation for KC) for attributing classical latin literature to their respective authors by checking which strings add the least additional complexity (due to shared writing styles, word choice, etc.) when compressed together and then clustering. Worked like a charm.

ETA: These were large bodies of text, however. Probably not gonna work for a bundle of comments, except for me, due to my overuse of "obviously", obviously.

Comment author: Kaj_Sotala 09 December 2014 08:43:24PM *  7 points [-]

I thought this was pretty impressive:

We study techniques for identifying an anonymous author via linguistic stylometry, i.e., comparing the writing style against a corpus of texts of known authorship. We experimentally demonstrate the effectiveness of our techniques with as many as 100,000 candidate authors.

[...]

In experiments where we match a sample of just 3 blog posts against the rest of the posts from that blog (mixed in with 100,000 other blogs), the nearest-neighbor/RLSC combination is able to identify the correct blog in about 20% of cases; in about 35% of cases, the correct blog is one of the top 20 guesses. Via confidence estimation, we can increase precision from 20% to over 80% with a recall of 50%, which means that we identify 50% of the blogs overall compared to what we would have if we always made a guess.

The efficacy of the attack varies based on the number of labeled and anonymous posts available. Even with just a single post in the anonymous sample, we can identify the correct author about 7.5% of the time (without any confidence estimation). When the number of available posts in the sample increases to 10, we are able to achieve a 25% accuracy. Authors with relatively large amounts of content online (about 40 blog posts) fare worse: they are identified in over 30% of cases (with only 3 posts in the anonymous sample).

[...]

Further, we confirmed that our techniques work in a cross-context setting: in experiments where we match an anonymous blog against a set of 100,000 blogs, one of which is a different blog by the same author, the nearest neighbor classifier can correctly identify the blog by the same author in about 12% of cases. Finally, we also manually verified that in crosscontext matching we find pairs of blogs that are hard for humans to match based on topic or writing style; we describe three such pairs in Appendix A.

The strength of the deanonymization attack we have presented is only likely to improve over time as better techniques are developed. Our results thus call into question the viability of anonymous online speech. Even if the adversary is unable to identify the author using our methods in a fully automated fashion, he might be able to identify a few tens of candidates for manual inspection as we detail in Section III.

Comment author: Kawoomba 09 December 2014 08:51:10PM 2 points [-]

Difference was one of scale. Much easier when just taking three dozen? pieces of classical latin literature, some of which were different parts of the same opus magnum, then see them cluster to their respective authors and to the other parts of the same piece. More of a "put the pieces into the box" as opposed to a 100,000 pieces puzzle. In the latter case, you just know most of the puzzle pieces will either show the blue sky, or the blue sea, both a similar shade of blue.

Comment author: gattsuru 09 December 2014 08:33:50PM 2 points [-]

Commenting to 'save' this comment. That's a really clever way to handle that.

Comment author: [deleted] 09 December 2014 08:22:23PM 0 points [-]

KC = Kolmogorov Complexity?

Comment author: Kawoomba 09 December 2014 08:25:11PM *  0 points [-]

Yeap, the paper linked in my other comment explains how it works.

Comment author: [deleted] 09 December 2014 07:36:37PM *  0 points [-]

gzip is a "crude upper bound" on KC, and afaik there is no known bound on the error.

EDIT: I'm pretty sure the following result isn't even true: If KC(x) < KC(y), then gzip(x) < gzip(y). Or even quantitatively weaker variants of the same.

Comment author: Kawoomba 09 December 2014 07:43:51PM *  0 points [-]

There can't be, since KC isn't computable (could be mistaken on that in itself precluding a (edit:) lower error bound). It's still kinda nice and works on a similar principle (and also, somewhat, in practice). Let's plug the Hutter prize while we're at it, for the 5 people reading this. Also, just saw a cool paper which kinda describes the same principle, here.

Comment author: gwern 09 December 2014 08:00:48PM *  3 points [-]

There can't be, since KC isn't computable (could be mistaken on that in itself precluding a bounded error).

KC may not be uncomputable in general, but I'm pretty sure that doesn't preclude all possible proofs or constructions*, and it seems odd to say that there is no upper bounds when we have what sure look like such things.

* I have just invented a Scheme-like programming language in which all tokens except numbers are 2 alphanumeric characters long; what is the Kolmogorov complexity of the bitstring '1'? 'Who knows, KC is uncomputable -' Bzzt! The answer is that the KC is exactly 1, since the shortest program which emits the bitstring '1' is a program consisting of the constant '1' which evaluates to '1', which as you can see, is indeed of length 1, and all other programs emitting the bitstring '1' are longer by definition.

Or if you don't like that example, consider taking your favorite language and enumerating all possible programs in length order; consider the outputs of the programs you've just enumerated when evaluated up to the highest available Busy Beaver in steps (to avoid nontermination issues), and the respective lengths of outputs - have you not found the KC for those outputs? If you run gzip over the outputs to get estimated KCs, are those not upper bounds on the actual KCs you proved?

Comment author: V_V 09 December 2014 08:30:50PM *  3 points [-]

Computing upper bounds on on Kolmogorov Complexity is not very difficult: gzip and all the other compression algorithms do it. The difficulty is computing non-trivial lower bounds:
For all programming languages (with a self-terminating encoding), there is a trivial lower bound that doesn't depend on the string. This bound is at least one token.

But there is also a language-dependent constant L_max which is the maximum KC complexity and lower bound on KC complexity that you can compute for any string: L_max is the length of the shortest program for which the halting property is uncomputable ( * ) (which makes L_max is uncomputable as well).
This implies that you can compute the KC complexity only for a finite number of strings.

( * And doesn't provably emit any token)

Comment author: Anatoly_Vorobey 11 December 2014 04:01:41PM 3 points [-]

There is no such thing as "the shortest program for which the halting property is uncomputable". That property is computable trivially for every possible program. What's uncomputable is always the halting problem for an infinity of programs using one common algorithm.

It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.

You were probably thinking of something else: that there exists a constant L, which depends on the language and a proof system T, such that it's not possible to prove in T that any string has Kolmogorov complexity larger than L. That is true. In particular, this means that there's a limit to lower bounds we can establish, although we don't know what that limit is.

Comment author: V_V 11 December 2014 07:11:34PM *  4 points [-]

There is no such thing as "the shortest program for which the halting property is uncomputable". That property is computable trivially for every possible program.

The halting property is semi-decidable: if a program halts, then you can always trivially prove that it halts, you just need run it. If a program does not halt, then sometimes you can prove that it doesn't halt, and sometimes you can't prove anything.
For any programming language, there exist a length such that you can compute the halting property for all programs shorter than that. At length L_max, there will be program Bad_0, which:

  • Does not halt.
  • Can't be proven not to halt.
  • Doesn't emit any output symbol.
  • It can't be proven that there exist any string of output symbols that it doesn't emit.

You can never prove that any string S has Kolmogorov complexity K if K > L_max, as it would imply that you proved that Bad_0 doesn't halt, or at least doesn't emit any symbol which is not a prefix of S.

Since there are only finitely many strings with complexity up to L_max, we can only compute Kolmogorov complexity, for finitely many strings.

It is also easy to make up artificial languages in which Kolmogorov complexity is computable for an infinite subset of all possible strings.

If the language is Turing-complete I don't think this is possible. If you think that an arbitrary string S has complexity K, how can you prove that there exist no program shorter than K that computes S?

Comment author: Anatoly_Vorobey 11 December 2014 08:10:02PM *  0 points [-]

[retracted]

Comment author: Kawoomba 09 December 2014 08:14:29PM *  0 points [-]

Yea, only talking about the general case. Even the Halting Problem is usually computable in daily life, just by running the program (ETA: and seeing it terminate).

Watch the program run

See it die before your eyes

Halting Problem? Solved!

Comment author: [deleted] 09 December 2014 09:01:00PM 0 points [-]

There can't be, since KC isn't computable

Exactly. So why bother saying gzip is an approximation of KC? (I assume: because KC is a well-known theoretical object with good properties, and one shouldn't let the fact that none of these properties carry over to gzip ruin one's chance of getting published cheaply.)

Comment author: lmm 16 December 2014 11:22:14PM 0 points [-]

Because gzip is being used to measure complexity. That's literally the reason they used gzip and not, I don't know, rot13. It's an explanation of the causal role that gzip is playing in the whole process.

Comment author: [deleted] 17 December 2014 12:57:37AM *  1 point [-]

No.

"gzip is being used to measure complexity" is an explanation of the causal role that gzip is playing in this study.

"gzip is an approximation of KC" is either 1) flatly not true, see edit to grandparent or 2) not relevant to the study at all.

Comment author: Kawoomba 09 December 2014 09:07:57PM 0 points [-]

And while we're at it, why do we even care about Turing Machines, it's not like we could ever build one anyways. ;-)

goes back to tending his potato garden