You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

philh comments on Open Thread March 21 - March 27, 2016 - Less Wrong Discussion

3 Post author: Gunnar_Zarncke 20 March 2016 07:54PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (160)

You are viewing a single comment's thread. Show more comments above.

Comment author: Stefan_Schubert 22 March 2016 02:14:30PM 2 points [-]

I have a maths question. Suppose that we are scoring n individuals on their performance in an area where there is significant uncertainty. We are categorizing them into a low number of categories, say 4. Effectively we're thereby saying that for the purposes of our scoring, everyone with the same score performs equally well. Suppose that we say that this means that all individuals with that score get assigned the mean actual performance of the individuals with that that score. For instance, if there were three people who got the highest score, and their perfomance equals 8, 12 and 13 units, the assigned performance is 11 units.

Now suppose that we want our scoring system to minimise information loss, so that the assigned performance is on average as close as possible to the actual performance. The question is: how do we achieve this? Specifically, how large a proportion of all individuals should fall into each category, and how does that depend on the performance distribution?

It would seem that if performance is linearly increasing as we go from low to high performers, then all categories should have the same number of individuals, whereas if the increase is exponential, then the higher categories should have a smaller number of individuals. Is there a theorem that proves this, and which exacty specifies how large the categories should be for a given shape of the curve? Thanks.

Comment author: philh 22 March 2016 04:00:21PM *  4 points [-]

If I'm understanding this correctly, it sounds like you're performing k-means clustering.