You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

sixes_and_sevens comments on Open Thread, June 2-15, 2013 - Less Wrong Discussion

5 Post author: TimS 02 June 2013 02:22AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (433)

You are viewing a single comment's thread.

Comment author: sixes_and_sevens 04 June 2013 04:00:50PM *  9 points [-]

I scraped the last few hundred pages of comments on Main and Discussion, and made a simple application for pulling the highest TF-IDF-scoring words for any given user.

I'll provide these values for the first ten respondents who want them. [Edit: that's ten]

EDIT: some meta-information - the corpus comprises 23.8 MB, and spans the past 400 comment pages on Main and Discussion (around six months and two and a half months respectively). The most prolific contributor is gwern with ~780kB. Eliezer clocks in at ~280kB.

Comment author: jkaufman 04 June 2013 06:00:16PM 3 points [-]

What about for the site overall?

Comment author: sixes_and_sevens 04 June 2013 10:12:41PM 3 points [-]

This was my eventual plan, but I haven't settled on a general corpus to compare it to yet.

Comment author: Kawoomba 06 June 2013 08:16:32PM 2 points [-]

Can you comment on your methodology - tools, wget scripts or what?

Comment author: sixes_and_sevens 06 June 2013 11:12:34PM 1 point [-]

Scraping is done with python and lxml, and the scoring is done in Java. It came about as I needed to brush up on my Java for work, and was looking for an extensible project.

I also didn't push it to my personal repo, so all requests will have to wait until I'm back at work.

Comment author: RichardKennaway 06 June 2013 10:57:50AM 1 point [-]

I'll provide these values for the first ten respondents who want them.

Yes please. I have no idea what they will look like.

Comment author: sixes_and_sevens 06 June 2013 11:26:04AM 1 point [-]

suffering -> 25.000
god -> 24.508
does -> 24.383
causal -> 21.584
np -> 21.259
utility -> 20.470
agi -> 20.470
who -> 20.169
pill -> 19.353
bayesian -> 18.965
u1 -> 17.567

The word 'who' seems to come up a lot for the contributors at the more prolific end of the scale. I don't have a satisfactory answer why this should be the case. Your contribution comprises ~170kB of plain text.

Comment author: [deleted] 11 June 2013 05:52:40PM *  0 points [-]

If I'm counting the replies correctly, nine respondents requested them so far. I'd like my word values. Thank you!

Comment author: sixes_and_sevens 12 June 2013 10:09:14AM 1 point [-]

political -> 28.733
power -> 27.093
moldbug -> 26.135
structural -> 24.192
he -> 24.082
reactionary -> 23.480
blog -> 21.973
good -> 21.373
social -> 20.470
his -> 20.470
very -> 20.169

Your contribution is ~167kB.

Comment author: ArisKatsaris 10 June 2013 10:48:42AM 0 points [-]

May I have mine? Thanks.

Comment author: sixes_and_sevens 10 June 2013 11:46:40AM 0 points [-]

moral -> 35.017
thread -> 34.250
bob -> 25.163
preferences -> 24.383
eu -> 23.739
column -> 23.537
matrix -> 23.419
mugging -> 22.367
pascals -> 21.479
lord -> 19.515
eg -> 19.266

Your contribution to the corpus is ~100kB.

Comment author: FiftyTwo 09 June 2013 08:41:09PM 0 points [-]

An alternative would be to ask people for donations to Against Malaria Foundation or your preferred charity.

Comment author: Dorikka 09 June 2013 07:10:45PM 0 points [-]

I'll provide these values for the first ten respondents who want them.

I'd like mine, please.

Comment author: sixes_and_sevens 10 June 2013 09:08:52AM 2 points [-]

gvrq -> 9.457
puppies -> 8.784
cute -> 7.141
creprag -> 7.119
gb -> 6.901
rewind -> 6.305
fvatyr -> 5.100
deck -> 4.838
stuff -> 4.816
vf -> 4.739
boom -> 4.221

As mentioned to other respondents, rot13 really messes with TF-IDF. I'm still not sure of the best way to deal with this.

Comment author: Douglas_Knight 11 June 2013 05:55:04AM 0 points [-]

If someone uses rot13, that is a highly informative. Is there any principled reason to like quoted words showing up, but not liking rot13? Anyhow, I think the disappeal of rot13 for TF-IDF is that it seems like a lower level feature than words. In particular, it is wasteful for it to show up more than once, if you're only doing top 11.

In some sense, I think the reason that the low level feature of rot13 is mixing with the high level feature of words is that you've jumped to the high level by fiat. Before looking a word frequency, you should look at letter frequency. With a sufficiently large corpus, rot13 should show up already there. I doubt that the corpus is big enough to detect the small usage by people here, but I think it might show up in bigrams or trigrams. I don't have a concrete suggestion, but when you look at bigrams, you should use both corpus bigrams and document letter frequencies to decide which document bigrams are surprising.

Comment author: sixes_and_sevens 11 June 2013 09:52:18AM 0 points [-]

You've already surmised why rot13 words are undesirable. Just to check, are you suggesting I use n-gram frequency to identify rot13 words, or replace TF-IDF with some sort of n-gram frequency metric instead?

Comment author: Douglas_Knight 12 June 2013 09:43:41PM 0 points [-]

You could use TF-IDF on n-grams. That's what I was thinking. But when I said to combine combine the local n-gram frequencies and the global n+1-gram frequencies to get a prediction of local n+1-gram frequencies to compare against, you might say it's too complicated to continue calling it TF-IDF.

If all you want to do is recognize rot13 words, then a dictionary and/or bigram frequencies sound pretty reasonable. But don't just eliminate rot13 words from the top 11 list; also include some kind of score of how much people use rot13. For example, you could use turn every word to 0 or 1, depending on rot13, and use TF-IDF. But it would be better to score each word and aggregate the scores, rather than thresholding.

What I was suggesting was a complicated (and unspecified) approach that does not assume knowledge of rot13 ahead of time. The point is to identify strange letter frequencies and bigrams as signs of a different language and then not take as significant words that are rare just because they are part of the other language. I think this would work if someone wrote 50/50 rot13, but if the individual used just a little rot13 that happened to repeat the same word a lot, it probably wouldn't work. (cf. "phyg")

There are two problems here, to distinguish individuals and to communicate to a human how the computer distinguishes. Even if you accept that my suggestion would be a good thing for the computer to do, there's the second step of describing the human the claim that it has identified another language that the individual is using. The computer could report unusual letter frequencies or bigrams, but that wouldn't mean much to the human. It could use the unusual frequencies to generate text, but that would be gibberish. It could find words in the corpus that score highly by the individual's bigrams and low by the corpus bigrams.

Comment author: Douglas_Knight 07 June 2013 03:04:36AM 0 points [-]

mine, please.

Comment author: sixes_and_sevens 07 June 2013 09:17:45AM 0 points [-]

sats -> 22.952
htt -> 22.810
sat -> 22.157
princeton -> 21.356
mathematicians -> 17.903
crack -> 16.812
harvard -> 16.661
delete -> 16.563
proofs -> 15.745
graph -> 15.565
regressions -> 15.301

Your corpus comprises ~77kB of plain text.

Comment author: Vaniver 06 June 2013 10:47:06PM 0 points [-]

I'd like mine, please!

Comment author: sixes_and_sevens 07 June 2013 09:16:09AM 2 points [-]

because -> 41.241
p -> 38.129
should -> 34.016
sat -> 33.974
much -> 33.113
cholesterol -> 33.056
evidence -> 32.444
iq -> 32.092
comments -> 31.454
scores -> 30.690
clear -> 28.899

Your contribution comprises ~284kB of plain text, and is the thirteenth-largest in the corpus.

Comment author: Vaniver 07 June 2013 05:04:54PM 1 point [-]

Thanks!

Interestingly, the only one of those that I recognize as clearly one of my verbal quirks is "clear," which I use a lot in "it's not clear to me that ...", but it barely made it onto the list. I participate in most of the discussions on intelligence testing, so it's no surprise that "sat," "iq," and "scores" are high. "Cholesterol" seems likely to be an artifact from a single detailed conversation about it, and then apparently I like words like "because," "should," and "much" more than normal, which is not that surprising given my general verbosity. I know I use the word "evidence" more than the general population, but am surprised I use it that much more than LW, and "comments" is unclear. Probably meta-discussion?

Comment author: sixes_and_sevens 07 June 2013 05:23:02PM 2 points [-]

Most incidence of "comments" seems to be in the context of moderator actions. There are 44 occurrences in your contribution to the corpus, which is around 50,000 words.

As for "evidence", there are 70 occurrences in 50,000 words. So on average, every 715th word you say in comments is "evidence".

Comment author: satt 06 June 2013 09:36:21PM 0 points [-]

Ooh, go on then.

Comment author: sixes_and_sevens 07 June 2013 09:13:02AM 1 point [-]

phd -> 34.505
teleology -> 25.661
maitzens -> 20.402
neutron -> 19.191
fusion -> 17.502
causal -> 17.267
argument -> 16.222
turtle -> 16.137
greenhouse -> 15.736
p1 -> 15.353
might -> 15.353

Your contribution comprises ~116kB.

Comment author: satt 07 June 2013 09:07:23PM 0 points [-]

Haha, I should've foreseen "maitzens", "causal", "argument" & "turtle" showing up there. (I'm lucky your corpus didn't go back far enough to capture this never-ending back-and-forth, otherwise my top 10 would probably be nothing but "HIV", "AIDS", "cases", "CDC", "Duesberg", "CD4", and such.) Thanks for running the numbers.

Comment author: TheOtherDave 06 June 2013 06:53:07PM 0 points [-]

Sure, why not? Thanks!

Comment author: sixes_and_sevens 07 June 2013 09:10:59AM 0 points [-]

x -> 98.136
confidence -> 87.600
value -> 66.797
agree -> 65.843
endorse -> 63.750
ok -> 60.507
said -> 59.640
evidence -> 54.869
say -> 54.185
bamboozled -> 53.497
values -> 53.122

Your contribution comprises ~420kB of plain text, and is the fifth largest in the corpus.

Comment author: arundelo 06 June 2013 04:53:16PM 0 points [-]

Cool! This (judging the relevance of words in documents in a corpus and analogous problems) is a subject I muse about sometimes. Thanks for introducing me to TF-IDF.

I'd like my top scoring words please.

Comment author: sixes_and_sevens 06 June 2013 04:56:02PM *  1 point [-]

comte -> 17.852
m1 -> 12.664
grumble -> 9.813
altruism -> 8.787
rotating -> 8.442
olive -> 8.150
comtes -> 8.025
m -> 7.383
workshop -> 7.157
egoistic -> 6.916
happiness -> 6.475

Your contribution comprises ~21kB of plain text.

Comment author: Kaj_Sotala 06 June 2013 09:04:56AM 0 points [-]

Curious to hear mine.

Comment author: sixes_and_sevens 06 June 2013 09:54:37AM *  1 point [-]

intelligence -> 17.119
machine -> 15.353
environments -> 15.052
reference -> 13.546
machines -> 12.304
views -> 12.253
legg -> 12.252
friedman -> 11.417
papers -> 10.792
we -> 10.536
exercises -> 9.532

Your contribution to the corpus amount to ~47kB of plain text. For reference, Eliezer is ~190kB and gwern is ~515kB. The scores are unadjusted for document size and not amazingly meaningful outside of this specific context.

Comment author: Kaj_Sotala 06 June 2013 10:43:04AM 0 points [-]

Huh, that seems different from what I'd have expected - but then again, I'm not sure of what I would have expected. Thanks.

Comment author: sixes_and_sevens 06 June 2013 10:59:06AM 2 points [-]

I've just fixed a bug in my scraper that was causing it to abandon 25% of the corpus. This has ended up tripling your contribution. Some new values for you:

agi -> 37.328
intelligence -> 22.367
moral -> 21.010
agis -> 20.087
eea -> 18.647
takeoff -> 17.500
credences -> 17.108
machine -> 16.902
our -> 16.222
environments -> 15.919
deer -> 15.761

This retains a similar "flavour" to the previous set, (AGI and ev-psych). The best way I've found to interpret it is "what sort of words describe what I use Less Wrong to talk about?"

As an interesting side-note, rot13 really messes with TF-IDF.

Comment author: Kaj_Sotala 06 June 2013 12:24:04PM 2 points [-]

Okay, that feels like it makes more sense. I'm a little confused about the "deer", though.

Comment author: sixes_and_sevens 06 June 2013 12:28:44PM 3 points [-]
Comment author: Kaj_Sotala 06 June 2013 02:35:04PM 1 point [-]

Hah, okay.

Comment author: RichardKennaway 06 June 2013 02:35:13PM 0 points [-]

You're not distinguishing original from quoted text, then?

Comment author: sixes_and_sevens 06 June 2013 03:37:43PM 0 points [-]

It's not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.

Comment author: RichardKennaway 06 June 2013 04:14:30PM 0 points [-]

TF-IDF is about identifying key terms in a document. Quoted text counts towards that.

That depends on what "the document" is. Everything appearing in a posting by a given author, or all of the text written by a given author?