I have a maths question. Suppose that we are scoring n individuals on their performance in an area where there is significant uncertainty. We are categorizing them into a low number of categories, say 4. Effectively we're thereby saying that for the purposes of our scoring, everyone with the same score performs equally well. Suppose that we say that this means that all individuals with that score get assigned the mean actual performance of the individuals with that that score. For instance, if there were three people who got the highest score, and their perfomance equals 8, 12 and 13 units, the assigned performance is 11 units.
Now suppose that we want our scoring system to minimise information loss, so that the assigned performance is on average as close as possible to the actual performance. The question is: how do we achieve this? Specifically, how large a proportion of all individuals should fall into each category, and how does that depend on the performance distribution?
It would seem that if performance is linearly increasing as we go from low to high performers, then all categories should have the same number of individuals, whereas if the increase is exponential, then the higher categories should have a smaller number of individuals. Is there a theorem that proves this, and which exacty specifies how large the categories should be for a given shape of the curve? Thanks.
Your problem is called a clustering problem. First of all, you need to answer how you measure your error (information loss, as you call it). Typical error norms used are l1 (sum of individual errors), l2 (sum of squares of errors, penalizes larger errors more) and l-infinity (maximum error).
Once you select a norm, there always exists a partition that minimizes your error, and to find it there are a bunch of heuristic algorithms, e.g. k-means clustering. Luckily, since your data is one-dimensional and you have very few categories, you can just brute force it (for 4 categories you need to correctly place 3 boundaries, and naively trying all possible positions takes only n^3 runtime)
Hope this helps.
If it's worth saying, but not worth its own post (even in Discussion), then it goes here.
Notes for future OT posters:
1. Please add the 'open_thread' tag.
2. Check if there is an active Open Thread before posting a new one. (Immediately before; refresh the list-of-threads page before posting.)
3. Open Threads should be posted in Discussion, and not Main.
4. Open Threads should start on Monday, and end on Sunday.