The Geometric Expectation

Scott Garrabrant

A Suspicious Pattern

There is a pattern that shows up in many of the toys we like to play with around here: the pattern of maximizing the expected logarithm.

Nash bargaining is a method for aggregating preferences without a means to directly compare them. When Nash bargaining, you are maximizing the expected logarithm of utility, where the expectation is over uncertainty about which person you are.

Kelly betting is an extremely useful tool for not putting all your future wealth in one basket. When Kelly betting, you are maximizing the expected logarithm of your wealth.

The log scoring rule is a very natural way to extract beliefs. When maximizing your log score, you are maximizing the expectation of the logarithm of the probability you assign to the right answer. This is one example of a general pattern. Maximizations of expected logarithms show up all over information theory, often phrased as minimizing the negative of the expected logarithm.

Why does maximization of the expected logarithm keep showing up?

One answer is that all of the instances of it showing up are actually related. In my previous two posts, I made some connections between Nash bargaining and Kelly betting. The fact that Kelly betting can be used to model Bayesian updating illustrates its relationship with the information theory applications. To a certain extent, there is really only one instance of this pattern.

However, I think that there is another argument for why you should expect this pattern to show up a lot, which is that the pattern is very simple. More simple than it looks on the surface. It only looks complicated because mathematicians have failed us.

The Geometric Integral

One of the most underrated concepts in mathematics is the geometric integral, given by . (The fact that I couldn't easily get a latex symbol that looks like an elongated P is a testament to its underratedness.) The geometric integral is just like the standard integral, but everywhere you would add, you multiply instead. Defining it in terms of the standard (arithmetic) integral with logs and exponents is insulting to its nature, and I don't recommend thinking of it that way. (You wouldn't define $x \times y$ as $e^{ln (x) + ln (y)}$ .) Instead, you should just think of it as the multiplicative version of the integral. However, using logs and exponentiation, it is the fastest way to get the definition across.

I think people don't practice thinking multiplicatively enough, which causes them to throw inherently multiplicative things into logarithms, so they can think about them additively.

I will use the phrase geometric expectation when I take a geometric integral over a probability distribution, and I will use the symbol $G$ . Thus, we will write $G_{x \sim P} f (x) = e^{E_{x \sim P} ln f (x)}$ .

Discrete Geometric Expectations

Luckily, most of the time, we will want to talk about discrete geometric expectations, where we can use (possibly infinite) sums rather than integrals and (possibly infinite) products rather than geometric integrals.

Let us gain some intuition for discrete geometric expectations by going though some simple cases. We will start with a uniform distribution on a finite set.

Let $X = {x_{1}, \dots, x_{n}}$ be a finite set with $n$ elements. Let $f : X \to R^{\geq 0}$ be a function that assigns a nonnegative value to each $x_{i}$ . Let $P$ be the uniform probability distribution on $X$ that assigns probability $\frac{1}{n}$ to each element of $X$ .

We have that $E_{x \sim P} f (x) = \sum_{x \in X} P (x) f (x) = \sum_{i = 1}^{n} \frac{f (x_{i})}{n} = \frac{f (x_{1}) + \dots + f (x_{n})}{n}$ . This is just the average, or arithmetic mean of the $f$ values.

We can compute $G_{x \sim P} f (x)$ using the above formula $G_{x \sim P} f (x) = e^{E_{x \sim P} ln f (x)}$ . Here, we get

$G_{x \sim P} f (x) = e^{E_{x \sim P} ln f (x)} = e^{\frac{ln f (x_{1}) + \dots + ln f (x_{n})}{n}} = \sqrt[n]{e^{ln f (x_{1})} \dots e^{ln f (x_{n})}} = \sqrt[n]{f (x_{1}) \dots f (x_{n})}$ .

Thus, the geometric expectation of the uniform distribution is just the geometric mean of the $f$ values. Hence the name.

The infinite non-uniform discrete case is not much more difficult. If $X$ is a finite or countably infinite set, $f : X \to R^{\geq 0}$ assigns a nonnegative value to each $x \in X$ , and $P$ is a probability distribution on $Y$ , then $E_{x \sim P} f (x) = \sum_{x \in X} P (x) f (x)$ , and

$G_{x \sim P} f (x) = e^{E_{x \sim P} ln f (x)} = e^{\sum_{x \in X} P (x) ln f (x)} = \prod_{x \in X} e^{P (x) ln f (x)} = \prod_{x \in X} f (x)^{P (X)}$ .

These two values can be thought of as a weighted arithmetic mean and weighted geometric mean respectively.

When taking the geometric expectation of $f$ with respect to $P$ , you just take the product over all $x \in X$ of $f (x)^{P (x)}$ . You are multiplying together all the $f$ values, but the exponent $P (x)$ is saying that values with less probability get less weight (or less "power").

Maximizing the Geometric Expectation

Maximization is invariant under applying a monotonic function. Thus ${a r g m a x}_{y \in Y} E_{x \sim P} ln (f (x, y)) = {a r g m a x}_{y \in Y} e^{E_{x \sim P} ln (f (x, y))} = {a r g m a x}_{y \in Y} G_{x \sim P} f (x, y)$ .

So every time we maximize an expectation of a logarithm, this was equivalent to just maximizing the geometric expectation.

Rather than saying "maximize the geometric expectation", I will just say "geometrically maximize". For example, when Kelly betting, we are just geometrically maximizing wealth. Note that the unit on the geometric expectation of wealth is dollars. The unit on the expected logarithm of dollars is... confusing? It is log dollars, but like, you add it instead of multiplying? I don't know how it works. What even is a log dollar?

The geometric expectation just makes more sense than the expected logarithm. It is a real thing with a real meaning. However, when we put the geometric expectation inside of a maximization, and we don't naturally think in terms of geometric expectations, we are tempted to take a logarithm of the whole thing, (which we can do because the maximization eats the monotonic function), and end up with maximizing the expected logarithm.

Geometric Rationality

When Kelly betting, you are really just geometrically maximizing wealth.

When Nash Bargaining, you are really just geometrically maximizing expected utility with respect to your uncertainty about your identity. In defense of Nash bargaining, It is normally presented as maximizing the product of the utilities. However, if you don't already have the concept of geometric expectation, it is tempting to convert it to an expected logarithm so you can handle the weighted case and think of it as being about uncertainty behind the veil of ignorance. (Also, it is more like the square root of the product of the utilities rather than the product of the utilities.)

When maximizing log score, you are really just geometrically maximizing the probability you assign your observation.

I will informally use the phrase "geometric rationality" to refer to techniques that tend to geometrically maximize natural features (of the world or the self). I want to raise to attention the hypothesis that humans are evolved to be naturally inclined towards geometric rationality over arithmetic rationality, and that around here, the local memes have moved us too far off this path.

A video on the geometric derivative by the ever excellent Michael Penn:

Edit:
The geometric derivative is the instantaneous exponential growth rate i.e. where $f^{*} (x)$ is the geometric derivative.

Which is equivalent to

And if I pushed around symbols correctly, the geometric derivative can be pulled inside of a geometric expectation () similarly to how an additive derivative can be pulled inside an additive expectation ( $\nabla_{θ} E_{x \sim P (x)} [f_{θ} (x)] = E_{x \sim P (x)} [\nabla_{θ} f_{θ} (x)]$ ). Also, just as additive expectation distributes over addition ( $E [f (x) + g (x)] = E [f (x)] + E [g (x)]$ ), geometric expectation distributes over multiplication ( $G [f (x) g (x)] = G [f (x)] G [g (x)]$ ).

I think what is going on here is that both and $G$ are of the form $(e^{\land}) \circ g \circ ln$ with $g = \nabla$ and $g = E$ , respectively. Let's define the star operator as $g^{*} = (e^{\land}) \circ g \circ ln$ . Then $(f \circ g)^{*} = (e^{\land}) \circ (f \circ g) \circ ln = (e^{\land}) \circ f \circ ln \circ (e^{\land}) \circ g \circ ln = f^{*} \circ g^{*}$ , by associativity of function composition. Further, if $f$ and $g$ commute, then so do $f^{*}$ and $g^{*}$ : $g^{*} \circ f^{*} = (g \circ f)^{*} = (f \circ g)^{*} = f^{*} \circ g^{*} .$

So the commutativity of the geometric expectation and derivative fall directly out of their representation as $E^{*}$ and $\nabla^{*}$ , respectively, by commutativity of $E$ and $\nabla$ , as long as they are over different variables.

We can also derive what happens when the expectation and gradient are over the same variables: $(\nabla_{θ} \circ E_{x \sim P_{θ} (x)})^{*}$ . First, notice that $(* k)^{*} (x) = e^{k * ln x} = e^{ln x * k} = x^{k}$ , so $(* k)^{*} = (^{\land} k)$ .. Also $(+ k)^{*} (x) = e^{k + ln (x)} = e^{k} e^{ln (x)} = x e^{k} ⟹ (+ k)^{*} = (* e^{k})$ .

Now let's expand the composition of the gradient and expectation. $(\nabla_{θ} \circ E_{x \sim P_{θ} (x)}) (f (x)) = \nabla_{θ} \int P_{θ} (x) f (x) d x = E_{x \sim P_{θ} (x)} [\nabla_{θ} (f (x) ln P_{θ} (x))]$ , using the log-derivative trick. So $\nabla_{θ} \circ E_{x \sim P_{θ} (x)} = E_{x \sim P_{θ} (x)} \circ \nabla_{θ} \circ (* ln P_{θ} (x))$ .

Therefore, $\nabla_{θ}^{*} \circ G_{x \sim P_{θ} (x)} = (\nabla_{θ} \circ E_{x \sim P_{θ} (x)})^{*}$ $= E_{x \sim P_{θ} (x)}^{*} \circ \nabla_{θ}^{*} \circ (* ln P_{θ} (x))^{*}$ $= G_{x \sim P_{θ}} \circ \nabla_{θ}^{*} \circ (^{\land} ln P_{θ})$ .

Writing it out, we have $\nabla_{θ}^{*} G_{x \sim P_{θ} (x)} [f (x)] = G_{x \sim P_{θ} (x)} [\nabla_{θ}^{*} (f (x)^{ln P_{θ} (x)}]$ .

This entire series and especially this post are excellent, thanks :)

Thanks for the post -- I've been having thoughts in this general direction and found this post helpful. I'm somewhat drawn to geometric rationality because it gives more intuitive answers in thoughts experiments involving low probabilities of extreme outcomes, such as Pascal's mugging. I also agree with your claim that "humans are evolved to be naturally inclined towards geometric rationality over arithmetic rationality."

On the other hand, it seems like geometric rationality only makes sense in the context of natural features that cannot take on negative values. Most of the things I might want to maximize (e.g. utility) can be negative. Do you have thoughts on the extent to which we can salvage geometric rationality from this problem?

But if your utility function is bounded, as it apparently should be then you're one affine transform away from being able to use geometric rationality, no?

How much should you shift things by? The geometric argmax will depend on the additive constant.

If arithmetic and geometric means are so good, why not the harmonic mean? https://en.wikipedia.org/wiki/Pythagorean_means. What would a "harmonic rationality" look like?

I can answer this now!

Expected Utility, Geometric Utility, and Other Equivalent Representations

It turns out there are a large family of expectations we can use to build utility functions, including the arithmetic expectation , the geometric expectation $G$ , and the harmonic expectation $H$ , and they're all equivalent models of VNM rationality! And we need something beyond that family like Scott's $G [E [U]]$ to formalize geometric rationality.

Thank you for linking to these different families of means! The quasi-arithmetic mean turned out to be exactly what I needed for this result.

Very interesting! I'm excited to read your post.

Also here is a nice family that parametrizes these different kinds of average (https://m.youtube.com/watch?v=3r1t9Pf1Ffk)

Actually maybe this family is more relevant:
https://en.wikipedia.org/wiki/Generalized_mean, where the geometric mean is the limit as we approach zero.

The "harmonic integral" would be the inverse of integral of the inverse of a function -- https://math.stackexchange.com/questions/2408012/harmonic-integral

Some results related to logarithmic utility and stock market leverage (I derived these after reading your previous post, but I think it fits better here):

Tl;dr: We can derive the optimal stock market leverage for an agent with utility logarithmic in money. We can also back-derive a utility function from any constant leverage^[1], giving us a nice class of utility functions with different levels of risk-aversion. Logarithmic utility is recovered a special case, and has additional nice properties which the others may or may not have.

For an agent investing in a stock whose "instantaneous" price movements are i.i.d. with finite moments:

Suppose, for simplicity, that the agent's utility function is over the amount of money they have in the next timestep. (As opposed to more realistic cases like "amount they have 20 years from now".)
- If , then:
  - The optimal leverage for the agent to take is given by the formula $L = m / (2 s^{2})$ , where $m = E [r e t u r n P e r T i m e s t e p - r i s k F r e e R e t u r n P e r T i m e s t e p]$ and s is the standard deviation of the same. Derivation here. By my calculations, this implies a leverage of about 1.8 on the S&P 500.
- What if we instead suppose the agent prefers some constant leverage $L = m / (2 c s^{2})$ , and try to infer it's utility function?
  - The relevant differential equation is $x U^{''} (x) = - c U^{'} (x)$
  - This is solved by $U (x) = 1 - x^{1 - c}$ for $c \neq 1$ and $U (x) = l n (x)$ for $c = 1$ . You can play with the solutions here.
Now suppose instead that the agent's utility function is "logarithmic withdrawals, time-discounted exponentially" -- $U = \int_{t = 0}^{\infty} l n (w (t)) e^{γ t}$ , where $w (t)$ is the absolute^[2] rate of withdrawal at time $t$ . It turns out that optimal leverage is still constant, and is still given by the same formula $L = m / (2 s^{2})$ . Furthermore, the optimal rate of withdrawal is a constant $w (t) = 1 - γ$ , regardless of what happens.
- Things probably don't work out as cleanly for the non-logarithmic case.

[Disclaimer: This is not investment advice.]

^{^}
Caveats:
1. This assumption of constant leverage is pretty arbitrary, so there's no normative or descriptive force to the class of utility functions we derive from it
2. We have to make an unrealistic assumption that the utility function is over $$ at the next timestep, rather than further in the future. In the log case, these kind of assumptions tend to not change anything, but I'm not sure whether the general case is as clean.
^{^}
i.e. in dollars, not percents

X

extreme nit, you probably meant for this be lowercase. I love this series!

I was wondering if is anything. I don't recognize $\frac{1}{Π_{k} p_{k}^{p_{k}}}$ , though.

it's not intuitive to me when it's reasonable to apply geometric rationality in an arbitrary context.

e.g. if i offered you a coin flip where i give you $0.01 with p=50%, and $100 with q=50%, i get G = = $1, which like, obviously you would go bankrupt really fast valuing things this way.

in kelly logic, i'm instead supposed to take the geometric average of my entire wealth in each scenario, so if i start with $1000, I'm supposed to take $\sqrt{1000.01} \sqrt{1100}$ = $1048.81, which does the nice, intuitive thing of penalizing me a little vs. linear expectation for the added volatility.

but... what's the actual rule for knowing the first approach is wrong?

Another way of looking at this question: Arithmetic rationality is shift invariant, so you don't have to know your total balance to calculate expected values of bets. Whereas for geometric rationality, you need to know where the zero point is, since it's not shift invariant.

I think the rule is "you maximize your bank account, not the addition to it". I.e. your value of deals depends on how many you already have.

Way late to the game, but just arrived here through Richard Ngo's recent Substack post and figured I might just mention that this is identical to the Ergodicity Economics program being developed by Ole Peters and Alex Adamou since about 2011. Probably worth checking out for some cross-pollination!

The infinite non-uniform discrete case is not much more difficult. If is a finite or countably infinite set, $f : X \to R^{\geq 0}$ assigns a nonnegative value to each $x \in X$ , and $P$ is a probability distribution on $Y$ , then $E_{x \sim P} f (x) = \sum_{x \in X} P (x) f (x)$

Very minor, but shouldn't this read " $P$ is a probability distribution on $X$ " not $Y$ ?