Kawoomba comments on Open Thread, June 2-15, 2013 - Less Wrong Discussion
You are viewing a comment permalink. View the original post to see all comments and the full post content.
You are viewing a comment permalink. View the original post to see all comments and the full post content.
Comments (433)
I scraped the last few hundred pages of comments on Main and Discussion, and made a simple application for pulling the highest TF-IDF-scoring words for any given user.
I'll provide these values for the first ten respondents who want them. [Edit: that's ten]
EDIT: some meta-information - the corpus comprises 23.8 MB, and spans the past 400 comment pages on Main and Discussion (around six months and two and a half months respectively). The most prolific contributor is gwern with ~780kB. Eliezer clocks in at ~280kB.
Can you comment on your methodology - tools, wget scripts or what?
Scraping is done with python and lxml, and the scoring is done in Java. It came about as I needed to brush up on my Java for work, and was looking for an extensible project.
I also didn't push it to my personal repo, so all requests will have to wait until I'm back at work.