New to LessWrong?

New Comment
21 comments, sorted by Click to highlight new comments since: Today at 8:57 AM

That's a very confused post.

Let's start with the obvious -- the Sharpe ratio is not a test statistic (and does not lead to a p-value). To observe this note that the calculation of a test statistic involves a very important input -- the number of observations. Your denominator is not volatility, it's volatility divided by the square root of n. The Sharpe value does not care about the number of observations at all. So statements like

Each (arbitrary and useless) p-value cutoff corresponds to a test statistic cutoff which translates to a Sharpe Ratio cutoff above which an investment strategy is “significant” enough to brag about in front of a finance professor

are just figments of your misinformed imagination. In reality, having a strategy with a Sharpe of, say, 1 is very very good (the Sharpe ratio of the S&P500 this century is somewhere around 0.25).

In general, people in finance don't care about the p<0.05 (even when it's statistically "valid") because that threshold assumes a stable and unchanging underlying process which in the financial markets is very much not the case.

A test statistic involves dividing some measurement by the standard deviation of that same measurement.

When you calculate a p-value for a number of observations, you commonly take the average result and the standard deviation of the average, which equals the standard deviation of a single measurement divided by the square root of n. You can also calculate a test statistic and a p-value for a single observation, in which case you have just one result and one standard deviation.

This is what you do when calculating the Sharpe Ratio: you divide excess return over some period (usually a year) by the standard deviation of returns for that exact period. You can also interpret it in p-value language: if the S&P 500 has a (yearly) Sharpe Ratio of 0.25, that corresponds to a p-value of 0.4 if we assume that everything is distributed normally 1 - NORM.DIST(0.25,0,1,1) = 0.4. This can be interpreted to mean that a leveraged portfolio of the risk-free asset (which has a Sharpe Ratio of 0) has a 40% chance of outperforming the S&P 500 in a given year. The p-values in finance are low (because the markets are pretty efficient), but the math is the same.

And if you think something is wrong with the math, I encourage you to discuss the math without talking about my "misinformed imagination".

Sigh. Advice: first think, then type.

Emphasis yours:

When you calculate a p-value for a number of observations, you commonly take the average result and the standard deviation of the average

Emphasis mine:

This is what you do when calculating the Sharpe Ratio: you divide excess return over some period (usually a year) by the standard deviation of returns for that exact period.

Note: not the standard deviation of the AVERAGE, but the standard deviation of the RETURNS. These are different things.

You are also confusing annualization with estimating the volatility of the mean.

This can be interpreted to mean that a leveraged portfolio of the risk-free asset (which has a Sharpe Ratio of 0) has a 40% chance of outperforming the S&P 500 in a given year.

That's not what test statistics measure.

[Executive summary: Jacob is getting some things right that at first sight Lumifer is accusing him of getting wrong. But I think the actual point Lumifer is making is correct and Jacob is confused. Butbut I think the main point of Jacob's article isn't altogether invalidated by this, though considerations like these weaken it.]

Part One, in which Jacob is Right*

Let's have a concrete example. Consider an asset that, over a period of one year, has expected return 0.1 (in excess of the risk-free rate) and s.d. of expected return 0.01. Its (ex ante) Sharpe ratio will be 0.1/0.01 = 10.

Digression that's actually relevant: The Sharpe ratio depends on the time period. If you define returns as log(new/old) rather than new/old-1, or if the returns are small enough for linear approximations to be good, then (e.g.) over half a year the return will be half as much and its variance will be half as much, so its s.d. will be 1/sqrt(2) as much, so the Sharpe ratio will be sqrt(2) times smaller. I think it's usual to do the obvious adjustment and report Sharpe ratios for a notional period of one year, even if they are actually derived from measurements over a different period.

If you know the standard deviation of the returns, make the usual everything-is-Gaussian assumptions, and do classical hypothesis testing with null hypothesis "this asset grows on average at the risk-free rate" then the z-score you compute will in fact be (return - risk-free return) / stddev(return) which is exactly the Sharpe ratio.

So far, it seems, so good for Jacob. What about Lumifer's point about observations?

To observe this note that the calculation of a test statistic involves a very important input -- the number of observations. Your denominator is not volatility, it's volatility divided by the square root of n. The Sharpe value does not care about the number of observations at all.

You can compute a z-score from one observation. (In the case above, an "observation" of your asset's return.) If you have lots of observations then indeed you end up dividing by sqrt(n). Suppose you are computing an asset's annualized Sharpe ratio from 365 daily returns. Then you want mean annual return / stddev of annual return = 365 mean daily return / sqrt(365) stddev of daily return = mean daily return / [stddev of daily return / sqrt(365)] which is exactly the "divided by the square root of n" calculation Lumifer mentioned.

Part Two, in which Lumifer turns the Tables

HOWEVER, here is where I think Jacob goes wrong and I agree with Lumifer. Everything above supposes that the amount of data you are basing your Sharpe calculation on agrees with the time period you're normalizing it for. But there's no reason why it should.

Suppose we have two years' data for that asset, and suppose that over each year it happens to give an average return of 0.1 and an s.d. of returns of 0.01. The Sharpe ratio of the asset hasn't changed from what I described before. But now your p-value is completely different: you have twice as much evidence for an effect of the same size, and your actual z-score will be sqrt(2) times bigger and the p-value correspondingly smaller.

The point here is that there are two separate kinds of normalization you might want to do. If you are interested in the best estimate you can get of the size of your effect (in this case, your excess returns) then you want to normalize according to the time period you care about. If you are interested in how reliable your estimate of that effect is, then you want to normalize according to the amount of data you've got. The first kind of normalization gives you an annualized Sharpe ratio. The second kind gives you a p-value.

Part Three, in which I offer Jacob some qualified support

What's actually going on in Jacob's article? He tells a story of how someone -- presumably working at an investment bank or hedge fund or whatever -- found something he was confident was beating the market; but on seeing what it was, it became apparent that he was engaged in data-mining and his result was probably less significant than he thought it was.

Nothing in that paragraph changes merely because Sharpe ratios are not really z-scores. Our analyst presumably thought he was presenting not mere numerological coincidences but evidence for a pattern that one might hope would continue to apply in the future; setting aside the fact that unfortunately markets tend to destroy patterns simple enough to be found (since after all the way they destroy them is by people making money out of them for a while), our confidence that the analyst's result isn't just coincidence has to be reduced by the fact that the analyst was presumably computing lots of other comparable statistics and picking the best one. The multiplicity problem is real, even if the details of Jacob's description of it could be improved.

But ... suppose you are an analyst at, let's say, a hedge fund. You consider a hundred models that might or might not do better than chance. The best of them produces results that would be clearly significant if it weren't for the data-mining problem, but that in the light of that problem aren't significant. Do you sigh and stop trading? Nope, you probably use that model (or something like it) and hope -- because your actual task is to make money, not only to know in advance how much money you're going to make. You shouldn't tell your investors or your boss "we've got this strategy that will definitely perform 20% above the market"; you haven't the evidence for that, and probably nothing you could possibly have done could have given you the evidence for it. But if this strategy maximizes expected utility (which probably means something not entirely unlike Sharpe ratio) then you should pursue it even though the chance that it's not really the best is bigger than 0.05.

(I think Jacob understands this; he's taking aim not at people who use necessarily-unproven strategies in investment, but at people who misrepresent the evidence for their strategies to their clients.)

[EDITED to add: In the above, I pass rather briefly over the fact that, as I put it, "markets tend to destroy patterns simple enough to be found", but for the avoidance of doubt this is a big deal and makes a big difference to how you should interpret any quantity of evidence you think you have for a pattern in what a market is doing.]

A few comments and nitpicks.

The Sharpe ratio depends on the time period.

Yes. It's common to annualize the Sharpe, but it's here already that you need to make assumptions about the properties of your time series because you need to specify how does the volatility scale. In the usual assume-it's-all-Gaussian case the volatility scales by the square root of time.

So for example if you have monthly log returns with the mean of mu and standard deviation of sd, your annualized Sharpe would be (12 mu) / (sd sqrt(12)).

You can compute a z-score from one observation.

Nope, you can't, because you cannot get a standard deviation estimate.

there are two separate kinds of normalization you might want to do

I think the confusion is at a deeper level.

The Sharpe ratio is a description of a random variable. There needs to be no estimation involved, it's just a straightforward ratio of two moments of a distribution. If we have full information about that random variable, we know the population (not sample) values, the error/noise is zero, etc. we can produce a meaningful Sharpe ratio easily enough.

Test statistics (like the t stat and the associated p-value) are not descriptions of random variables. They are metrics of how well does the data match your hypothesis. If there is no hypothesis, there is no test statistic. If we have full information and so know which hypothesis is true and which is not, the concept of a test statistic is meaningless.

Nope, you probably use that model (or something like it) and hope

Heh, nope. Finance people (other than marketers) are very interested in empirical truth because for them the match between the map and the territory directly translates into money. Hope is not a virtue in finance. If your hypothetical analyst suggests that hope is sufficient to replace significance, he'll get laughed at and, if he persists in that belief, fired in short order.

Nope, you can't, because you cannot get a standard deviation estimate.

You might have one from another source. (And have one observation of the mean.) If that actually isn't your situation -- if you're estimating the standard deviation from the same data as the mean -- then a z-score isn't the right thing to be computing anyway.

The Sharpe ratio is a description of a random variable. There is no estimation involved.

My impression is that there is (1) the Sharpe ratio of the underlying random variable, which you do not know, and (2) various estimates of it on the basis of empirical data, which you do know but might be wrong. Actual Sharpe ratios that people report are pretty much always in the second class, no?

Test statistics [...]

Oh yes, I meant to comment on that. One thing that really grates in Jacob's article is his use of the phrase "the test statistic" as if it meant the same thing as the z-score. Jacob, if nothing else you should fix that.

Test statistics [...] are not descriptions of random variables. It's a metric of how well does the data match your hypothesis.

Yes. But from a given batch of data you can compute (1) an estimate of the underlying process's Sharpe ratio and (2) a measure of how confidently you can say that that process isn't just noise on top of something behaving like (an idealization of) a Treasury bond. And in the case where the quantity of data you have and the time period you're interested in match up, the fact that you're doing the same calculation in both cases isn't just coincidence. There is a relationship between Sharpe ratio and z-score, just as there is between (say) weight[1] and density; in some circumstances the two consistently vary together.

[1] I do mean weight and not mass, as it happens.

(I agree that the conceptual distinction between these two things is super-important, and that Jacob may be fuzzy about it.)

Finance people [...] are very interested in empirical truth

Oh, yes. But none the less, you have to decide what trades to make somehow, and there is no law (physical or otherwise) that says you get the best results by only ever deciding according to models whose backtesting is significant at the p=0.05 level.

But perhaps in practice operating according to such a rule gives better results than not, by reducing the danger of catastrophe. (Or, for that matter, by making it easier to convince your investors that you're doing something reasonable.) I don't have enough information to know what fraction of trading entities do operate according to such a rule. It seems to me like it doesn't make the right explore/exploit tradeoffs, but I could be wrong.

(For the avoidance of doubt, I was certainly not suggesting that hope is a virtue in finance.)

if you're estimating the standard deviation from the same data as the mean -- then a z-score isn't the right thing to be computing anyway.

Why not? Assuming you're estimating things correctly, what's wrong with taking the ratio? Do you think it will mislead you in some way?

Actual Sharpe ratios that people report are pretty much always in the second class, no?

Yes, of course. And people, as usual, tend to forget about the difference between the sample Sharpe which you have just calculated from the available data and the population Sharpe which you don't actually know and which you need to estimate. The sample Sharpe is an adequate estimator, but you need to realize that it's a mere estimate and there is uncertainty involved. (Calculating the p-value of a Sharpe estimate is left as an exercise for the reader. Keep in mind that it's more complicated than it looks because the errors in estimating the mean and the errors in estimating the standard deviation are not independent).

And in the case where the quantity of data you have and the time period you're interested in match up

Well, that basically never happens. There is a reason why everyone reports annualized Sharpe ratios -- for the same process the Sharpe ratio is a function of your time period.

For example, upthread there is the example of calculating the annualized Sharpe from monthly log returns which is (12 mu) / (sd sqrt(12)). What happens if we calculate a biannual Sharpe? It's (24 mu) / (sd sqrt(24)) and hey! it looks just like the annual Sharpe, except that it's bigger. There is a multiplier 2 * sqrt(0.5) or just sqrt(2). Exactly the same process, but because we picked a different time period -- two years instead of one -- the Sharpe changed.

you have to decide what trades to make somehow

Yep, but it's a complicated process which involved things like risk management and portfolio optimization both of which require estimates of risk which depend on your confidence in your models... it gets quite hairy quite fast :-)

by only ever deciding according to models whose backtesting is significant at the p=0.05 level

That's a horrible idea. No one has ever shown a backtest to investors (or management) which didn't look excellent. All backtests are highly significant and promise untold riches. That's why competent people pay little attention to them unless there's very good visible evidence that the selection bias was adequately controlled (no, the Bonferroni correction is not sufficient) or the significance is absolutely humongous -- and that happens only in high-frequency trading, pretty much.

Why not [compute a z-score]?

Well, obviously you can compute it. But you shouldn't call it "z", you should call it "t" :-).

Well, that basically never happens.

Indeed. (That was the point of about 1/3 of what I wrote.)

it's a complicated process

No shit :-). But the point is, it's a more complicated process than the one Jacob seems to have been envisaging, where you do a statistical test, find that your results aren't significant at whatever level you chose, and give up. (And of course more complicated again than the one that on the face of it he's criticizing, where you do a statistical test, get it wrong, think your results are significant, and immediately decide to use them.) So when Jacob says (I paraphrase) "you need to correct your results for multiple comparisons, so they won't be significant any more, so they're useless", he's not necessarily correct. What you have may be hundreds of results that all fall well short of statistical significance (when you test them right) because you've got such noisy and complicated data, and the task is to make the best you can of them.

That's a horrible idea.

That was kinda my point :-).

But you shouldn't call it "z", you should call it "t" :-)

A fair point.

gjm and Lumifer, thanks for the detailed discussion.

I want to clarify a few points, mainly regarding the context of what I'm writing. My goal was to give an intuition about multiplicity cropping up in different contexts in 400 words, not to explain the details of financial engineering or the difference between t and Z scores.

There are advanced approaches to annualizing Sharpe Ratios but the first method that people are taught is to:

  1. Take the average daily return over a number of days, and multiply that by 252 (number of trading days in a year) to get the yearly return.
  2. Take the daily standard deviation and multiply by sqrt(252) to get the yearly standard deviation.
  3. Subtract the risk-free return from #1 and divide by #2.

Similarly, the basic way that people are taught to test for statistical significance is:

  1. Calculate the average measurement.
  2. Calculate the standard deviation of the average by dividing the sd of the sample by sqrt(sample size).
  3. Subtract the null (usually 0) from #1 and divide by #2.

As a convention, Sharpe Ratios are standardized to a single year and statistics in science (e.g. a drug effectiveness) are standardized to a single average person. If everyone measured daily Sharpe Ratios (instead of yearly) and the effects of drugs on groups of 252 people at once, whether we multiply or divide would switch. But at the core, we're doing the same thing: making a lot of assumptions about how something is distributed and then dividing the "excess" result for a standard unit by the SD of that unit.

I agree that in practice people look at those things very differently. In social psychology you get p 2) and go publish, while in stock-picking Sharpe Ratios rarely get anywhere close to 2, and you compare the ratios directly instead of thinking of them as p-values. Still, both measurements are equally affected by testing multiple hypotheses and reporting the best one. If someone tells me of a stock picking strategy (17th lag!) that has a Sharpe Ratio of 0.4 (as compared to the S&P's 0.25) but they tried 20 strategies to get there, it's worth as much as green jelly beans. That's all I was trying to get at in the first half of the post, and I don't think anyone disagrees on this point.

Heh, nope. Finance people (other than marketers) are very interested in empirical truth because for them the match between the map and the territory directly translates into money. Hope is not a virtue in finance.

And that's exactly what I was trying to get at in the second half of the post.

Glad to be of service :-)

My goal was to give an intuition about multiplicity

In which case you don't need the digression into Sharpe ratios at all. It just distracts from the main point.

the first method that people are taught is to: 1. Take the average daily return over a number of days, and multiply that by 252

Err... If I may offer more advice, don't breezily barge into subjects which are more complicated than they look.

The "average daily return" for people who are taught their first method usually means the arithmetic return (P1/P0 - 1). If so, you do NOT multiply that number by 252 because arithmetic returns are not additive across time. Log returns (log(P1/P0)) are, but people who are using log returns are usually already aware of how Sharpe ratios work.

the basic way that people are taught to test for statistical significance

This is testing the significance of the mean. I would probably argue that the most common context where people encounter statistical significance is a regression and the statistical significance in question is that of the regression coefficients. And for these, of course, it's a bit more complicated.

Still, both measurements are equally affected by testing multiple hypotheses

I don't understand what this means. If you do multiple tests and pick the best, any measurement is affected.

The "average daily return" for people who are taught their first method usually means the arithmetic return (P1/P0 - 1). If so, you do NOT multiply that number by 252 because arithmetic returns are not additive across time. Log returns (log(P1/P0)) are, but people who are using log returns are usually already aware of how Sharpe ratios work.

If your daily returns are so big that ln(P1/P0) is non-negligibly different from P1/P0 - 1, I'm interested in knowing what your investment strategy is. ;-)

Once you upscale your daily returns by more than two orders of magnitude (that is, multiply them by 250), the difference becomes quite noticeable.

In which case you don't need the digression into Sharpe ratios at all. It just distracts from the main point.

I wrote a long post specifically about adjusting for multiplicity, this is just a follow up to demonstrate that multiplicity is a problem even if you don't call what you measure "p-values". I think you read that one, and commented that I shouldn't use p-values at all. I agreed.

If you do multiple tests and pick the best, any measurement is affected.

You know this, I know this, but a lot of people from investment bankers to social psychologists don't know that, or at least don't understand it on a deep level. That's probably also true of many of my blog readers.

I feel like we're not really disagreeing. My point was "there are similarities between p-values and Sharpe Ratios, especially in the way they're affected by multiplicity". Your point was that they're not exactly the same thing. OK.

If I may offer more advice, don't breezily barge into subjects which are more complicated than they look.

Thanks, but I plan to follow the opposite advice. It's a popular blog, not a textbook, and part of my goal in writing it is to learn stuff. For example, since I wrote the last post I learned a lot about Sharpe Ratios. I also think that the thrust of my post stands regardless of the exact parameters and assumptions for calculating Sharpe Ratios (which is a subject of textbooks).

but a lot of people from investment bankers to social psychologists don't know that, or at least don't understand it on a deep level

You'll excuse me if I find myself a bit sceptical with respect to your opinion about what investment bankers understand on a deep level and what they don't...

Your point was that they're not exactly the same thing

Well, actually my point was that they are not the same thing at all and confusing them is a category error. Perhaps I didn't express my point strongly enough :-P

It's a popular blog, not a textbook, and part of my goal in writing it is to learn stuff.

Sure, it's your blog. I just think it would be best not to mislead your readers.

I'm not versed enough in finance to assess the accuracy of your information, but I'll add that if what you say is true then Jacob should etiher rewrite that post substantially or take it down.

My post has all the links you need to understand it, really just the Wikipedia definitions of Sharpe Ratio and various test statistics like the standard score. Would you feel better about the post if Lumifer just talked about math and not about me being "very confused" and "misinformed"?

When a commenter on my blog actually catches a mistake in the math, I not only fix it immediately, I pay them $5 for the trouble.

My post has all the links you need to understand it, really just the Wikipedia definitions of Sharpe Ratio and various test statistics like the standard score.

I spent three hours studying all the various concepts, and I've reached the conclusion that there's a bit of confusion everywhere, due to nobody defining their terms.
Wikipedia gives four different definitions of the Sharpe ratio, the original/revised vs. ex-ante/ex-post. Plus, they are defined using momenta of a distribution, but for the examples where it's used the sample means and sample standard deviation, which are the momenta only for a uniform distribution.
Lumifer says that the Sharpe ratio is a descriptive statistics because it's the excess returns over volatility, which I presume he intends to mean the istantaneous standard deviation.
You say that the Sharpe ratio is a test statistics, because it is a measure divided by the standard deviation.

Neither of those assertions are true, though, per se. The Sharpe ratio is a statistics, full stop. How you want to use it determines if it's a description or a test.
Indeed: ex-ante Sharpe over a prior distribution uses the Lumifer definition and still can be used to test a null hypothesis, and your definition can still be used as descriptive statistics if your sample is the entire population (an easy feat for the price of a stock or index). Besides: multiple hypothesis testing = parameter estimation, so the guy who found the 17th value is best described as a case of overfitting.

I agree though that Lumifer used an unnecessary rude tone.

Wikipedia gives four different definitions of the Sharpe ratio

Sure, but this discussion has a specific context: finance. In finance the words "Sharpe ratio" are well defined, they mean the annualized ratio of the sample mean of excess returns to the sample standard deviation of the same returns.

and still can be used to test a null hypothesis

Can it? Let's try. I have a series of excess returns for which I know the Sharpe ratio (defined as above), let's say it is 0.5. Given the null hypothesis that the true mean of these returns is zero, what is the probability that my sample mean is what it is conditional on the null hypothesis being true?

Sounds to me like a parallel in which volatility is being used as the Standard Deviation of the Mean. Which, as you point out, it isn't.

Some of those financial assumptions are overly rosy. I did my finance homework and leanred a 5% vacancy rate implies one month every two years--nice if you can get it, but not necessarily realistic. And 10% seems to be the typical management fee, so the management + maintenance figure would need to be higher than that. 

Also worth considering: if you can make a sound investment in a house for little money down in a neighborhood with good prospects, why don't your tenants do the same? Perhaps one has overestimated the prospects of the neighborhood or the creditworthiness of one's tenants.