You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

sixes_and_sevens comments on Open Thread, June 2-15, 2013 - Less Wrong Discussion

5 Post author: TimS 02 June 2013 02:22AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (433)

You are viewing a single comment's thread. Show more comments above.

Comment author: sixes_and_sevens 06 June 2013 12:28:44PM 3 points [-]
Comment author: Kaj_Sotala 06 June 2013 02:35:04PM 1 point [-]

Hah, okay.

Comment author: RichardKennaway 06 June 2013 02:35:13PM 0 points [-]

You're not distinguishing original from quoted text, then?

Comment author: sixes_and_sevens 06 June 2013 03:37:43PM 0 points [-]

It's not obvious to me that I should. TF-IDF is about identifying key terms in a document. Quoted text counts towards that.

Comment author: RichardKennaway 06 June 2013 04:14:30PM 0 points [-]

TF-IDF is about identifying key terms in a document. Quoted text counts towards that.

That depends on what "the document" is. Everything appearing in a posting by a given author, or all of the text written by a given author?

Comment author: sixes_and_sevens 06 June 2013 04:48:47PM 0 points [-]

"The document" is my wild sample that I've gone out and caught. TF-IDF tells me what it's broadly about. For this purpose, quoted text provides useful information.

If I want to infer personal facts about the author (beyond "what are the key terms in the posts they write"), it would make sense to weight original text higher than quoted text, but it would also make sense to use something other than TF-IDF for that purpose.