You're looking at Less Wrong's discussion board. This includes all posts, including those that haven't been promoted to the front page yet. For more information, see About Less Wrong.

gjm comments on Open Thread March 21 - March 27, 2016 - Less Wrong Discussion

3 Post author: Gunnar_Zarncke 20 March 2016 07:54PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (160)

You are viewing a single comment's thread. Show more comments above.

Comment author: Vitor 22 March 2016 04:18:04PM 10 points [-]

Your problem is called a clustering problem. First of all, you need to answer how you measure your error (information loss, as you call it). Typical error norms used are l1 (sum of individual errors), l2 (sum of squares of errors, penalizes larger errors more) and l-infinity (maximum error).

Once you select a norm, there always exists a partition that minimizes your error, and to find it there are a bunch of heuristic algorithms, e.g. k-means clustering. Luckily, since your data is one-dimensional and you have very few categories, you can just brute force it (for 4 categories you need to correctly place 3 boundaries, and naively trying all possible positions takes only n^3 runtime)

Hope this helps.

Comment author: gjm 22 March 2016 04:41:08PM 2 points [-]

A possibly relevant paper for anyone wanting to do this in one dimension to a dataset large enough that they care about efficiency.