DanielVarga comments on Statistical Prediction Rules Out-Perform Expert Human Judgments - Less Wrong

68 Post author: lukeprog 18 January 2011 03:19AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (195)

You are viewing a single comment's thread. Show more comments above.

Comment author: DanielVarga 21 January 2011 06:43:53PM *  13 points [-]

I second the advice.

Let me brag a bit. Once in a friendly discussion the following question came up: How to predict for an unknown first name whether it is a male or female name. This was in a context of Hungarian names, as all of us were Hungarians. I had a list of Hungarian first names in digital format. The discussion turned into a bet: I said I can write a program in half an hour that tells with at least 70% precision the sex of a first name it never saw before. I am quite fast with writing small scripts. It wasn't even close: It took me 9 minutes to

  • split my sets of 1000 male and 1000 female names into a random 1000-1000 train-test split,
  • split each name into character 1,2- and 3-grams. E.g.: Luca was turned into ^L u c a$ ^Lu uc ca$ ^Luc uca$.
  • feed the training data into a command line tool to train a maxent model,
  • test the accuracy of the model on the unseen test data.

The model reached an accuracy of 90%. In retrospect, this is not surprising at all. Looking into the linear model, the most important feature it identified was whether the name ends with an 'a'. This trivial model alone reaches some 80% precision for Hungarian names, so if I knew this in advance, I could have won the bet in 30 seconds instead of 9 minutes, with the sed command s/a$/a FEMALE/.

Comment author: matt 27 January 2011 07:03:46AM *  2 points [-]

These sound like powers I should acquire. Could you drop some further hints on:

  • "a command line tool to train a maxent model"
  • how you tested the accuracy of the model (tools that let you do that in the remaining minutes, rather than general principles)
Comment author: DanielVarga 27 January 2011 08:34:31AM *  3 points [-]

I used Zhang Le's tool. Note that it is a rather obscure thing, not an industry standard like say, the huge Weka and Mallet packages. It made very easy the tasks you ask for. When I had a train and test data featurized,

maxent -m gender.model train.data

built the model and

maxent -p -m gender.model test.data

told me its accuracy on the test data.