The plots below are an example: if all you had was the median, you wouldn't know these three datasets were different; but if all you had was the mean, you would know:
Take those three datasets, and shift them by (-mean) to generate 3 new datasets. Then suddenly the median is distinct on all 3 new datasets, but the mean the same.
It seems for any set of two datasets with identical (statistical parameter X) and different (statistical parameter Y), you can come up with a different set of two datasets with different (statistical parameter X) and identical (statistical parameter Y)[1], which seems rather symmetric?
There are differences between the median and mean; this does not appear to be a sound justification.
(Now, moving one datapoint on the other hand...)
At least ones with a symmetry (in this case translation symmetry).
It’s a good observation that I hadn’t considered - thanks for sharing it. After thinking it over, I don’t think it’s important.
I don’t know quite how to say this succinctly, so I’ll say it in a rambling way: subtracting the mean is an operation precisely targeted to make the means the same. Of course mean-standardized distributions have the same means! In the family of all distributions, there will be a lot more with the same median than with the same mean (I can’t prove this properly, but it feels true). You’re right that, in the family of mean-standardized distributions, this isn’t true, but that’s a family of distributions specifically constructed to make it untrue. The behavior in that specific family isn’t very informative about the information content of these location parameters in general.
subtracting the mean is an operation precisely targeted to make the means the same.
Fair, but on the flipside moving all the data below the median a noisy but always-negative amount is an operation precisely targeted to make the medians the same.
(To be perfectly clear: there are clear differences between the median and mean in terms of robustness; I just don't think this particular example thereof is a good one.)
I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.
(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)
(Although: is there a larger class of operations than just subtracting the mean that result in identical means but different medians? If there were, that would damage my conception of robustness here, but I haven’t tried to think of how to find such a class of operations, if they exist.)
Subtracting the mean and then scaling the resulting distribution by any nonzero constant works.
Alternatively, if you have a distribution and want to turn it into a different distribution with the same mean but different median:
I’d argue that moving the data like that is not as precise: “choose any data point right of the median, and add any amount to it” is a larger set of operations than “subtract the mean from the distribution”.
Aha. Now you are getting closer to the typical notion of robustness!
Imagine taking a sample of N elements (...N odd, to keep things simple) from a distribution and applying a random infinitesimal perturbation. Say chosen uniformly from for every element in said distribution.
In order for the median to stay the same the median element must not change[1]. So we have a probability of 1/3rd that this doesn't change the median. This scales as .
In order for the mean to stay the same the resulting perturbation must have mean of 0 (and hence, must have a sum of zero). How likely is this? Well, this is just a lazy random walk. The resulting probability (in the large-N limit) is just[2]
This scales as .
I like that notion of robustness! I’m having trouble understanding the big-O behavior here because of the 1/N^2 term - does the decreasing nature of this function as N goes up mean the mean becomes more robust than the median for large N, or does the median always win for any N?
Ah, to be clear:
There's a 1/3rd chance that the median does not change under an infinitesimal perturbation as I've defined it.
There's a chance that the mean does not change under an infinitesimal perturbation as I've defined it.
Or, to flip it around:
There's a 2/3rds chance that the median does change under an infinitesimal perturbation as I've defined it.
There's a chance that the mean does change under an infinitesimal perturbation as I've defined it.
As you increase the number of data points, the mean asymptotes towards 'almost always' changing under an infinitesimal perturbation, whereas the median stays at a 2/3rds[1] chance.
Minor self-nit: this was assuming an odd number of data points. That being said, the probability assuming an even number of data points (and hence mean-of-center-two-elements) actually works out to the same - all change the median, or 6/9 possibilities.
Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn't the peak of the distribution, and you'll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.
This characterization doesn't seem to quantify information entirely.
If the data has say, a mode of 3 (as in the number 3), and the next runner up is 4, coming in at 2 4s and 3 3s, then adding a couple 4s makes 4 the mode, even if the median and mean haven't changed a lot (maybe there's a few dozen items on the list total).
The probabilistic mode: (that I just made up)
Pick an element at random, call that the mode. Now the expected distribution is such that it's max is the mode. By ditching the non-continuous nature of the mode, the question shifts to:
So... if the L2 minimizer uses more information from x than the L1 minimizer does, can we do better than the mean? Maybe the L4 minimizer uses yet more information? (I'm skipping L3 for reasons. [Edit, a year later: I've forgotten these reasons]).
What's L3?
Another approach to the same question: We compute how much Shannon information is in each parameter.
Good.
The reason distances matter here is: the different location parameters minimize different distances. The mean is the single number that minimizes the distance . The median is the number that minimizes , and the mode is the that minimizes .
I believe the mode minimizes the distance rather than the distance? Since measures whether two values are different, and the mode maximizes the probability that a random point is the same as the mode?
I think the number minimizing would be the mean of the minimum and the maximum value of the distribution, which we might call the "middle of the distribution". (Unfortunately "middle" also sounds like it could describe the median, but I think it better describes than it describes the median.) This probably also gives a hint as to what minimizing would look like; it would tend to be somewhere inbetween the mean and the middle, like a location statistic that's extra sensitive to outliers.
(I'm skipping for reasons. [Edit, a year later: I've forgotten these reasons])
Probably to avoid absolute values?
Epistemic status: Unsure. I had this in drafts from a year ago, and am posting it for Goodhart day. (Though it's April 1, all the arguments and statements in this post are things I think are true, with no jokes). I'm interested in arguments against this thesis, and especially interested in thoughts on the question at the end - does the distribution-summarizer corresponding to the L3 or L4 minimizers use more information than the mean (the L2 minimizer)?
The mean, median, and mode are the "big 3" location parameters that most have heard about. But they can have very different properties, and these different properties are related to the fact that the mean uses more information than the median, and the median uses more information than the mode.
Refresher
The mean, median, and mode measure the location of a probability distribution. For the Gaussian distribution, they are all the same, but this isn't the case in general. Here's an example of a Gamma distribution where the three differ:
The mean corresponds to the middle of the distribution when weighted by frequency.
The median corresponds to the middle of the distribution, without using the weights. The median is the vertical line that splits a distribution such that 50% of the probability mass is on the left and 50% on the right.
The mode is the highest point on a distribution.
Different amounts of information usage
The median is preserved under a larger set of changes to the data than the mean is. Really, this is often why people use the median: outliers don't knock it around as much as they do the mean. But that ability to resist being knocked around - "robustness" - is the same as the ability to ignore information. The mean's sensitivity is sometimes seen as a liability (and sometimes is a liability), but being sensitive here is the same thing as reacting more to the data. It's good to react to the data: the mean can distinguish between the three different datasets below, but the median can't. The plots below are an example: if all you had was the median, you wouldn't know these three datasets were different; but if all you had was the mean, you would know:
Similar arguments show why the mode uses even less information than the median: you can shift and scramble around any data that isn't the peak of the distribution, and you'll still have the same mode. You can even pass data across the mode without changing it, unlike with the median.
So there's a trade off between robustness and information.
Using p-distances to reason about distribution summarizers
We can talk about location parameters in terms of different distance metrics, called "Lp distances". An Lp distance measures the distance between two data vectors. If we call these vectors x and y, and each has n elements, the general form is Lp(x,y)=(∑ni|xi−yi|p)1p. For different p, this can easily produce different distances, even for the same x and y. So each different p represents a different way of measuring distance.
(These things show up all over the place in math, under different names: I first saw them in linear algebra as ||x−y||, ||x−y||2, and ||x−y||∞. Then later I saw them in optimization, where the L1 norm was being called the Manhattan/Taxicab distance, and L2 the Euclidean distance - the form √∑ni(xi−yi)2 will look familiar to some. If any of those look familiar, you've heard of Lp norms.)
The reason Lp distances matter here is: the different location parameters minimize different Lp distances. The mean is the single number μ that minimizes the L2 distance L2(x,μ)=√∑ni(xi−μ)2. The median is the number ~m that minimizes L1(x,~m), and the mode is the ^m that minimizes L∞(x,^m).
(Proof for the L2 case: Say the scalar β minimizes √∑ni(xi−β)2. Since square root is a monotonic function, that's the same as β minimizing ∑ni(xi−β)2. One way to minimize something is to take the derivative and set it to 0, so let's do that:
ddβ∑ni(xi−β)2=−2∑ni(xi−β)=0 [taking the derivative and setting it to 0]
∑nixi−∑niβ=∑nixi−nβ=0 [dropping the irrelevant -2, and separating β]
Rearranging, ∑nixi=nβ, which means β=1n∑nixi. The right-hand side is the form of the mean, so if β minimizes the L2 distance, then β=μ.)
So... if the L2 minimizer uses more information from x than the L1 minimizer does, can we do better than the mean? Maybe the L4 minimizer uses yet more information? (I'm skipping L3 for reasons. [Edit, a year later: I've forgotten these reasons]). To investigate, we'll first need the form of η that minimizes L4(x,η), just like we found for β in the proof above. We're minimizing ∑ni(xi−η)4; setting the derivative to 0 means finding η such that −4∑ni(xi−η)3=0. [At this point, writing this a year ago, I got stuck, and now I am posting this anyway.]
Another approach to the same question: We compute how much Shannon information is in each parameter. For a fixed parameter estimate μ / ~m / ^m, we can compare how many different data vectors x would give that same estimate. For example, we can ask: Given a fixed mean estimate μ, for how many different x is it true that μ minimizes L2(x,μ)? It turns out there are way more such x for (μ, L2) than there are for (~m, L1). To see this, think of a distribution of some count data. Since the median is the point for which 50% of data is on either side, moving around any value xi located to the right of the median, as long as it doesn't pass ~m, doesn't change the median! You could take a distribution of medical costs with median $40,000, and move a bunch of points that were around $50,000 up to $100,000, or 1 million or really as far as you want, for as many points as you want. Compare to the mean, where moving a bunch of points into the millions will strongly tug the mean to the right. In terms of the domain and range of these averaging functions: the mean is a function from some set X (data) to some set M (possible means), the median is a function from the same X to another set ~M, and M is bigger than ~M. Is the set for the L4 average - call it M′ - bigger than M? I'm interested in thoughts on this!