Hold On To The Curiosity

Ben Pace

Recently, an excited friend was telling me the story behind why we care about the mean, median and mode.

They explained that a straightforward idea for what you might want in an ‘average’ number, is something that minimises how far it is from all the other numbers in the dataset - so if your numbers are 1, 2 and 3, you want a number $x$ such that the sum of the distance to each datapoint is as small as possible. It turns out this number is 2.

However, if your numbers are 1, 2, and 4, the number that minimises the distance from all of them is also 2.

Huh?

When my friend told me this, the two other people I was with sort of said “Okay”. I said “What? No! I don’t believe you! It has to change when the data does - it’s a linear sum, so it has to change! It’s like you’re saying the sum of 1, 2 and 3 is the same as the sum of 1, 2 and 4. This is just wrong." Suffice to say, my friend’s claim wasn’t predicted by my understanding of math.

Now, did I really not believe my friend? The other two people with us were certainly fine with it. Isn’t this just bayesianism? That’s how the old joke goes:

Math teacher: Now I’m going to prove to you that X is true.

Bayesian: You just did.

Actually, no. You taught me a detail to memorise, but my models didn’t improve. I won’t be able to improve how I use averages, because I don’t understand how it fits in with everything else I understand - it doesn’t fit with the models I use everywhere else in math.

I mean, I could’ve nodded along. It’s only one fact, after all. But if I’m going to remember it in the long term, it should connect to my other models and be reinforced. The alternative is to be stored in the brain with all those other memorised facts that students learn for exams and forget immediately after.

If you’re trying to build new models of a domain, it’s important to choose to speak from the confusion, not from the rest of yourself. Don’t have conversations about whether you believe a thing. Instead talk about whether you understand it.

(The problem above was the definition of the median, and an explanation of the math for the curious can be found in this comment.)

II.

It can be really hard to feel your models. Qiaochu Yuan’s method of learning involves ramping feeling-his-models up to 11. I recall him telling me about trying to learn what fire was once, where his first step was to just really feel his confusion:

What the hell is this orange stuff? How on earth does it get here? Why is it flickering? WHAT IS FIRE?!

After feeling the confusion, Qiaochu holds onto his frustration (which he finds easier to hold), and tries throwing ideas and possible explanations at it until all the parts finally fit together - that feeling when you say “Ohhhhhhh” and the models finally compute, and your beliefs predict the experience you have. Be frustrated with reality.

Tim Urban (of WaitButWhy) tells a similar story, where he can only write essays about things he doesn’t currently understand - and as he’s digging through all the facts and pieces things together, he writes down the things that made sense to him, that would successful get the models across to an earlier version of Tim Urban.

I used to think this made no sense and he must just be bad at introspecting - shouldn’t you have to build an excellent model of other people to write so compellingly for so many tens of thousands of them?

Yet it’s actually really rare for authors to be strongly connected to their own models - when a teacher explains something for the hundredth time, they likely can't remember what it was like to learn it for the first. And so Tim’s explanations can be clearer than most.

In the opening example where I was surprised by the definition of the median, if you had offered me a bet I would’ve bet on the side that this was the definition of a median. But it was not a useful thought for me in that moment, to set aside my confusion and say “On reflection I believe you”. It can be correct in conversation, when your goal is understanding, to hold onto the confusion, the frustration, and let your models do the speaking.

III.

I often feel people try to move a conversation toward whether I believe the claim, rather than discussing and sharing what we each understand.

“Do you believe me when I say picking an average by minimising the distance to all the points is the same as the median?

“Hmm, can you tell me why that’s the case? I have a model of arithmetic that says it shouldn’t be…”

A phrase I often use: “You may have changed my betting odds but you haven’t changed my models!"

We’re all in the game of trying to build models. Whether you’re trying to understand the field of science you’re attempting to add knowledge to, the product your startup is building, or the architecture of the AGI you’re trying to align, you need good models to leverage reality for whatever you care about.

One of the most important skills in life is the ability to hold onto your confusion and let your models do the talking, so they can interface with reality more directly. Choosing to notice and hold on to your confusion is hard, and it’s so easy to lose sight of it.

To put it another way, here are some perfectly acceptable noises to make when your goal is understanding:

What? No! I don’t believe you! That can't be true!

I expect that some but not all of this post is surprisingly Ben-specific. My thanks to Alex Zhu (zhukeepa) and Jacob Lagerros (jacobjacob) for reading drafts.

[Explanation of the math confusion] To solve the problem, think locally. At each possible number, you can move up or down, and see whether this increases or decreases the total cost. For example if you’re at 3 (which has cost 3) if you move up 1 to 4, you get cost of 6, and if you move down to 2 you get cost of 2.

One way you can think of the cost of 4 relative to 3, is that when you move up to 4 you move away 1 step from each of the three data points - so you increase the cost of three. You can think of moving from 3 to 2 as moving toward 2 data-points and away from 1, which is a net benefit.

Overall, the goal is to find a state where movement in either direction causes you to move away from more points than you move toward. And that will always be the central data point, which has $n$ data points on either side, and yet as it moves toward $n$ of them it moves away from $n + 1$ (it moves off the data point is was standing on). And yes, if you have an even number of datapoints, then every point between the two central datapoints is a median.

Thanks to Buck Shlegeris for teaching me this.

It's possibly worth fitting this into a broader framework. The median minimizes the sum of $| x - m |$ . (So it's the max-likelihood estimator if your tails are like $e x p (- | t |)$ .) The mean minimizes the sum of $| x - m |^{2}$ . (So it's the max-likelihood estimator if your tails are like $e x p (- t^{2})$ , a normal distribution.)

What about other exponents? 1 and 2 are kinda standard cases; what about 0 or infinity? Or negative numbers? Let's consider 0 first of all. $| x - m |^{0}$ is zero if $m = x$ and 1 otherwise. Minimizing the sum of these means maximizing the number of things equal to m. This is the mode! (We'll continue to get the mode if we use negative exponents. In that case we'd better maximize the sum instead of minimizing it, of course.) As p increases without limit, minimizing the sum of $| x - m |^{p}$ gets closer and closer to minimizing $m a x (| x - m |)$ . If the 0-mean is the mode, the 1-mean is the median and the 2-mean is the ordinary mean, then the infinity-mean is midway between the max and min of your data. This one doesn't get quite so much attention in stats class :-).

The median is famously more robust than the mean: it's affected less by outliers. (This goes along with the fatter tails it assumes: if you assume a very thin-tailed distribution, then an outlying point is super-unlikely and you're going to have to try very hard to make it less outlying.) The mode is more robust still, in that sense. The "infinity-mean" (note: these are my names, and so far as I know no one else uses them) is kinda the least robust average you can imagine, being affected *only* by the most outlying data points.

The standard name for the "infinity-mean" is the midrange.

Yeah, thanks for this comment, I sorta skipped it because I didn't want to write too much... or something. In retrospect I'm not sure I modelled curious readers well enough, I should've just left it in.

One thing I noticed that I'm not so sure about: A motivation you might have for $| x - m |^{2}$ over $| x - m |^{1}$ (i.e. mean over median) is that you want a summary statistic that always changes when the data points do. As you move from $1, 2, 3$ to $1, 2, 4,$ the median doesn't change but the mean does.

And yet, given that in $| x - m |^{p}$ with $p$ rising it approaches the centre of the max and min, it's curious to see that we've chosen $p = 2$ . We wanted a summary statistic that changed as the data did, but of all possible ones, changed the least with the data. We could've settled on any integer greater than 1, and we picked 2.

From a purely mathematical point of view I don't see why the exponent should be an integer. But p=2 is preferred over all other real values because of the Central Limit Theorem.

A longer explanation with pictures can be found here - Mean, median, mode, a unifying perspective

I don't think maximizing the sum of the negative exponents gets you the mode. If you use $0^{- p} = 0$ then the supremum (infinity) is not attained, while if you use $0^{- p} = \infty$ then the maximum (infinity) is attained at any data point. If you do it with a continuous distribution you get more sensible answers but the solution (which is intuitively the "point of greatest concentration") is not necessarily unique.

It's worth mentioning that when $p > 1$ the $p$ -mean is unique: this is because $x \mapsto | x - m |^{p}$ is a convex function, the sum of convex functions is convex, and convex functions have unique minima.

I'm using $0^{- p} = \infty$ and using the cheaty convention that e.g. $3 \cdot \infty > 2 \cdot \infty$ . I think this is what you get if you regard a discrete distribution as a limit of continuous ones. If this is too cheaty, of course it's fine just to stick with non-negative values of $p$ .

Yeah, OK. It works but you need to make sure to take the limit in a particular way, e.g. convolution with a sequence of approximations to the identity. Also you need to assume that $p > - 1$ since otherwise the statistic diverges even for the continuous distributions.

A geometric intuition I came up with while reading:

Take a number line, and put 1, 2, and 4 on it.

-1-2---4-
You're moving a pointer along this line, and trying to minimize its total distance to the data points:
-1-2---4-
....^
Intuitively, throwing it somewhere near the middle of the line makes sense. But drop 2 out, and look what happens as you move it:
-1-----4-
....^
.|--|
....|--|
vs.
-1-----4-
......^
.|----|
......||

The distance is the same either way! (Specifically, it's the same for any point in between 1 and 4.)

This means we're free to move our pointer only with respect to 2, so the best answer is to get a distance-from-2 of 0 by putting it directly on 2.

To generalize this to medians of larger data sets, imagine adding more pairs of points on the outside of the range - the total distance to those points will be the same, just as it was for 1 and 4.

[edit: formatting came out a bit ugly - monospace sections seem to eat multiple spaces when displayed but not in the editor for some reason?]

I'd like to take a point that I think is implicitly being made, and state it explicitly.

Holding on to the curiosity and pursuing a deeper understanding of things takes time, and the benefit isn't always worth that cost. There are some situations where it is worth the cost, and there are others where it isn't.

Furthermore, there are some particular types of situations where people tend to give up too easily when they should be taking the time to ask questions and pursue a deeper understanding. The domain of education tends to be one.

I find this post valuable as a reminder to not get lazy and consider holding on to the curiosity more often, but I'd find it more valuable if it explicitly stated some of the common places where people get it wrong by giving up too early. Without those examples, my takeaway is "remember that holding on to the curiosity is often a good idea", but with some more concrete examples my takeaway would be, "X, Y and Z are contexts where people often give up too early when they should be holding on to the curiosity, be sure to keep an eye out for those contexts".

Section 2 is very important and there's more there. I would highly recommend performing the section 2 heuristic on the section 2 phenomena.

Do you have more words to say on what more is there?

Less valuable than trying is my guess. I don't mean some heroic effort, I mean like a yoda timer. If you do try it and it *doesn't* work that would be an interesting data point for me.

LESSWRONG
LW

LESSWRONG
LW

46

Hold On To The Curiosity

46

46

46