Epistemic status: Unsure. I had this in drafts from a year ago, and am posting it for Goodhart day. (Though it's April 1, all the arguments and statements in this post are things I think are true, with no jokes). I'm interested in arguments against this thesis, and especially interested in thoughts on the question at the end - does the distribution-summarizer corresponding to the  or  minimizers use more information than the mean (the  minimizer)?


The mean, median, and mode are the "big 3" location parameters that most have heard about. But they can have very different properties, and these different properties are related to the fact that the mean uses more information than the median, and the median uses more information than the mode.

Refresher

The mean, median, and mode measure the location of a probability distribution. For the Gaussian distribution, they are all the same, but this isn't the case in general. Here's an example of a Gamma distribution where the three differ:

gamma(shape = 2, rate = 1)

The mean corresponds to the middle of the distribution when weighted by frequency. 

The median corresponds to the middle of the distribution, without using the weights. The median is the vertical line that splits a distribution such that 50% of the probability mass is on the left and 50% on the right. 

The mode is the highest point on a distribution.

Different amounts of information usage

The median is preserved under a larger set of changes to the data than the mean is. Really, this is often why people use the median: outliers don't knock it around as much as they do the mean. But that ability to resist being knocked around - "robustness" - is the same as the ability to ignore information. The mean's sensitivity is sometimes seen as a liability (and sometimes is a liability), but being sensitive here is the same thing as reacting more to the data. It's good to react to the data: the mean can distinguish between the three different datasets below, but the median can't. The plots below are an example: if all you had was the median, you wouldn't know these three datasets were different; but if all you had was the mean, you would know:

Top: Some data .
Middle: The data from the top plot, but with the values right of the median shifted 4 to the left.
Bottom: The data from the middle plot, but with the values left of the median shifted and scrambled with noise.
Since no data crossed the green median line, it didn't move.

Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn't the peak of the distribution, and you'll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.

So there's a trade off between robustness and information. 
 

Using p-distances to reason about distribution summarizers

We can talk about location parameters in terms of different distance metrics, called " distances". An  distance measures the distance between two data vectors. If we call these vectors  and , and each has  elements, the general form is . For different , this can easily produce different distances, even for the same  and . So each different  represents a different way of measuring distance. 

(These things show up all over the place in math, under different names: I first saw them in linear algebra as , and . Then later I saw them in optimization, where the  norm was being called the Manhattan/Taxicab distance, and  the Euclidean distance -  the form  will look familiar to some. If any of those look familiar, you've heard of  norms.)

The reason  distances matter here is: the different location parameters minimize different  distances. The mean is the single number  that minimizes the  distance . The median is the number  that minimizes , and the mode is the  that minimizes 

 

(Proof for the  case: Say the scalar  minimizes . Since square root is a monotonic function, that's the same as  minimizing . One way to minimize something is to take the derivative and set it to 0, so let's do that: 

 [taking the derivative and setting it to 0]

 [dropping the irrelevant -2, and separating 

Rearranging, , which means . The right-hand side is the form of the mean, so if  minimizes the  distance, then .)

 

So... if the  minimizer uses more information from  than the  minimizer does, can we do better than the mean? Maybe the  minimizer uses yet more information? (I'm skipping  for reasons. [Edit, a year later: I've forgotten these reasons]). To investigate, we'll first need the form of  that minimizes , just like we found for  in the proof above. We're minimizing ; setting the derivative to 0 means finding  such that  [At this point, writing this a year ago, I got stuck, and now I am posting this anyway.]

 

Another approach to the same question: We compute how much Shannon information is in each parameter. For a fixed parameter estimate  /  / , we can compare how many different data vectors  would give that same estimate. For example, we can ask: Given a fixed mean estimate , for how many different  is it true that  minimizes ? It turns out there are way more such  for () than there are for (). To see this, think of a distribution of some count data. Since the median is the point for which 50% of data is on either side, moving around any value  located to the right of the median, as long as it doesn't pass , doesn't change the median! You could take a distribution of medical costs with median $40,000, and move a bunch of points that were around $50,000 up to $100,000, or 1 million or really as far as you want, for as many points as you want. Compare to the mean, where moving a bunch of points into the millions will strongly tug the mean to the right. In terms of the domain and range of these averaging functions: the mean is a function from some set X (data) to some set M (possible means), the median is a function from the same X to another set , and M is bigger than . Is the set for the  average - call it  - bigger than M? I'm interested in thoughts on this!

New Comment
10 comments, sorted by Click to highlight new comments since:
[-]TLW40

The plots below are an example: if all you had was the median, you wouldn't know these three datasets were different; but if all you had was the mean, you would know:

Take those three datasets, and shift them by (-mean) to generate 3 new datasets. Then suddenly the median is distinct on all 3 new datasets, but the mean the same.

It seems for any set of two datasets with identical (statistical parameter X) and different (statistical parameter Y), you can come up with a different set of two datasets with different (statistical parameter X) and identical (statistical parameter Y)[1], which seems rather symmetric?

There are differences between the median and mean; this does not appear to be a sound justification.

(Now, moving one datapoint on the other hand...)

  1. ^

    At least ones with a symmetry (in this case translation symmetry).

It’s a good observation that I hadn’t considered - thanks for sharing it. After thinking it over, I don’t think it’s important.

I don’t know quite how to say this succinctly, so I’ll say it in a rambling way: subtracting the mean is an operation precisely targeted to make the means the same. Of course mean-standardized distributions have the same means! In the family of all distributions, there will be a lot more with the same median than with the same mean (I can’t prove this properly, but it feels true). You’re right that, in the family of mean-standardized distributions, this isn’t true, but that’s a family of distributions specifically constructed to make it untrue. The behavior in that specific family isn’t very informative about the information content of these location parameters in general.

[-]TLW20

subtracting the mean is an operation precisely targeted to make the means the same.

Fair, but on the flipside moving all the data below the median a noisy but always-negative amount is an operation precisely targeted to make the medians the same.

(To be perfectly clear: there are clear differences between the median and mean in terms of robustness; I just don't think this particular example thereof is a good one.)

I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.

(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)

[-]TLW20

(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)

Subtracting the mean and then scaling the resulting distribution by any nonzero constant works.

Alternatively, if you have a distribution and want to turn it into a different distribution with the same mean but different median:

  • You can move two data points, one by X, the other by -X, so long as this results in a non-zero net number of crossings over the former median.
    • This is guaranteed to be the case with either any sufficiently large positive X or any sufficiently large negative X.
    • This is admittedly a one-dimensional subset of a 2-dimensional random space.
  • You can move three data points, one by X, one by Y, the last by -(X+Y), so long as this results in a non-zero net number of crossings over the former median.
    • This is guaranteed to be the case for large enough (absolute) values. (Unlike in the even-N case this always works for large X and/or Y, regardless of sign.)
    • This is admittedly a 2-dimensional subset of a 3-dimensional random space.
  • etc.

I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.

Aha. Now you are getting closer to the typical notion of robustness!

Imagine taking a sample of N elements (...N odd, to keep things simple) from a distribution and applying a random infinitesimal perturbation. Say chosen uniformly from  for every element in said distribution.

In order for the median to stay the same the median element must not change[1]. So we have a probability of 1/3rd that this doesn't change the median. This scales as .

In order for the mean to stay the same the resulting perturbation must have mean of 0 (and hence, must have a sum of zero). How likely is this? Well, this is just a lazy random walk. The resulting probability (in the large-N limit) is just[2]

[3]

This scales as .

  1. ^

    Because this is an infinitesimal perturbation the probability that this changes which element is the median is ~zero.

  2. ^
  3. ^

    A wild  appeared![4]

  4. ^

    I don't know why I am so amused by  turning up in 'random' places.

I like that notion of robustness! I’m having trouble understanding the big-O behavior here because of the 1/N^2 term - does the decreasing nature of this function as N goes up mean the mean becomes more robust than the median for large N, or does the median always win for any N?

[-]TLW20

Ah, to be clear:

There's a 1/3rd chance that the median does not change under an infinitesimal perturbation as I've defined it.
There's a  chance that the mean does not change under an infinitesimal perturbation as I've defined it.

Or, to flip it around:

There's a 2/3rds chance that the median does change under an infinitesimal perturbation as I've defined it.
There's a  chance that the mean does change under an infinitesimal perturbation as I've defined it.

As you increase the number of data points, the mean asymptotes towards 'almost always' changing under an infinitesimal perturbation, whereas the median stays at a 2/3rds[1] chance.

  1. ^

    Minor self-nit: this was assuming an odd number of data points. That being said, the probability assuming an even number of data points (and hence mean-of-center-two-elements) actually works out to the same -  all change the median, or 6/9 possibilities.

Gotcha - thanks.

Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn't the peak of the distribution, and you'll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.

This characterization doesn't seem to quantify information entirely.

If the data has say, a mode of 3 (as in the number 3), and the next runner up is 4, coming in at 2 4s and 3 3s, then adding a couple 4s makes 4 the mode, even if the median and mean haven't changed a lot (maybe there's a few dozen items on the list total).


The probabilistic mode: (that I just made up)

Pick an element at random, call that the mode. Now the expected distribution is such that it's max is the mode. By ditching the non-continuous nature of the mode, the question shifts to:

  • how informative is this (now)
  • or the old question
  • what is this useful for?

So... if the L2 minimizer uses more information from x than the L1 minimizer does, can we do better than the mean? Maybe the L4 minimizer uses yet more information? (I'm skipping L3 for reasons. [Edit, a year later: I've forgotten these reasons]).

What's L3?


Another approach to the same question: We compute how much Shannon information is in each parameter.

Good.

The reason  distances matter here is: the different location parameters minimize different  distances. The mean is the single number  that minimizes the  distance . The median is the number  that minimizes , and the mode is the  that minimizes 

I believe the mode minimizes the  distance rather than the  distance? Since  measures whether two values are different, and the mode maximizes the probability that a random point is the same as the mode?

I think the number minimizing  would be the mean of the minimum and the maximum value of the distribution, which we might call the "middle of the distribution". (Unfortunately "middle" also sounds like it could describe the median, but I think it better describes  than it describes the median.) This probably also gives a hint as to what minimizing  would look like; it would tend to be somewhere inbetween the mean and the middle, like a location statistic that's extra sensitive to outliers. 

(I'm skipping  for reasons. [Edit, a year later: I've forgotten these reasons])

Probably to avoid absolute values?