Less Wrong is a community blog devoted to refining the art of human rationality. Please visit our About page for more information.

Comment author: MrMind 15 March 2017 01:55:10PM 2 points [-]

I love your articles! What tool did you use to calculate the linear regression?

Comment author: Jacobian 15 March 2017 04:55:27PM 2 points [-]

I didn't actually do linear regression for the girlfriend matrix, but in general I use the glm function in R. It allows you to quickly generate regression models with any combination of variables by changing a couple of words, and saves the full model for analysis and manipulation.

If you don't need to compare different models (e.g. to pick out which variable are useful and which are just noise), you can even run a regression in Excel (Data -> Data Analysis). If you don't want to install any software at all, Wolfram Alpha got you covered.

[Link] Putanumonit: A spreadsheet helps debias decisions, like picking a girl to date

10 Jacobian 15 March 2017 03:19AM
Comment author: Lumifer 03 February 2017 05:41:37PM *  0 points [-]

Glad to be of service :-)

My goal was to give an intuition about multiplicity

In which case you don't need the digression into Sharpe ratios at all. It just distracts from the main point.

the first method that people are taught is to: 1. Take the average daily return over a number of days, and multiply that by 252

Err... If I may offer more advice, don't breezily barge into subjects which are more complicated than they look.

The "average daily return" for people who are taught their first method usually means the arithmetic return (P1/P0 - 1). If so, you do NOT multiply that number by 252 because arithmetic returns are not additive across time. Log returns (log(P1/P0)) are, but people who are using log returns are usually already aware of how Sharpe ratios work.

the basic way that people are taught to test for statistical significance

This is testing the significance of the mean. I would probably argue that the most common context where people encounter statistical significance is a regression and the statistical significance in question is that of the regression coefficients. And for these, of course, it's a bit more complicated.

Still, both measurements are equally affected by testing multiple hypotheses

I don't understand what this means. If you do multiple tests and pick the best, any measurement is affected.

Comment author: Jacobian 03 February 2017 10:10:25PM *  0 points [-]

In which case you don't need the digression into Sharpe ratios at all. It just distracts from the main point.

I wrote a long post specifically about adjusting for multiplicity, this is just a follow up to demonstrate that multiplicity is a problem even if you don't call what you measure "p-values". I think you read that one, and commented that I shouldn't use p-values at all. I agreed.

If you do multiple tests and pick the best, any measurement is affected.

You know this, I know this, but a lot of people from investment bankers to social psychologists don't know that, or at least don't understand it on a deep level. That's probably also true of many of my blog readers.

I feel like we're not really disagreeing. My point was "there are similarities between p-values and Sharpe Ratios, especially in the way they're affected by multiplicity". Your point was that they're not exactly the same thing. OK.

If I may offer more advice, don't breezily barge into subjects which are more complicated than they look.

Thanks, but I plan to follow the opposite advice. It's a popular blog, not a textbook, and part of my goal in writing it is to learn stuff. For example, since I wrote the last post I learned a lot about Sharpe Ratios. I also think that the thrust of my post stands regardless of the exact parameters and assumptions for calculating Sharpe Ratios (which is a subject of textbooks).

Comment author: Lumifer 03 February 2017 03:31:07PM 0 points [-]

But you shouldn't call it "z", you should call it "t" :-)

A fair point.

Comment author: Jacobian 03 February 2017 04:53:24PM 0 points [-]

gjm and Lumifer, thanks for the detailed discussion.

I want to clarify a few points, mainly regarding the context of what I'm writing. My goal was to give an intuition about multiplicity cropping up in different contexts in 400 words, not to explain the details of financial engineering or the difference between t and Z scores.

There are advanced approaches to annualizing Sharpe Ratios but the first method that people are taught is to: 1. Take the average daily return over a number of days, and multiply that by 252 (number of trading days in a year) to get the yearly return. 2. Take the daily standard deviation and multiply by sqrt(252) to get the yearly standard deviation. 3. Subtract the risk-free return from #1 and divide by #2.

Similarly, the basic way that people are taught to test for statistical significance is: 1. Calculate the average measurement. 2. Calculate the standard deviation of the average by dividing the sd of the sample by sqrt(sample size). 3. Subtract the null (usually 0) from #1 and divide by #2.

As a convention, Sharpe Ratios are standardized to a single year and statistics in science (e.g. a drug effectiveness) are standardized to a single average person. If everyone measured daily Sharpe Ratios (instead of yearly) and the effects of drugs on groups of 252 people at once, whether we multiply or divide would switch. But at the core, we're doing the same thing: making a lot of assumptions about how something is distributed and then dividing the "excess" result for a standard unit by the SD of that unit.

I agree that in practice people look at those things very differently. In social psychology you get p<0.05 (t-score > 2) and go publish, while in stock-picking Sharpe Ratios rarely get anywhere close to 2, and you compare the ratios directly instead of thinking of them as p-values. Still, both measurements are equally affected by testing multiple hypotheses and reporting the best one. If someone tells me of a stock picking strategy (17th lag!) that has a Sharpe Ratio of 0.4 (as compared to the S&P's 0.25) but they tried 20 strategies to get there, it's worth as much as green jelly beans. That's all I was trying to get at in the first half of the post, and I don't think anyone disagrees on this point.

Heh, nope. Finance people (other than marketers) are very interested in empirical truth because for them the match between the map and the territory directly translates into money. Hope is not a virtue in finance.

And that's exactly what I was trying to get at in the second half of the post.

Comment author: MrMind 02 February 2017 08:37:42AM 0 points [-]

I'm not versed enough in finance to assess the accuracy of your information, but I'll add that if what you say is true then Jacob should etiher rewrite that post substantially or take it down.

Comment author: Jacobian 02 February 2017 05:17:42PM 0 points [-]

My post has all the links you need to understand it, really just the Wikipedia definitions of Sharpe Ratio and various test statistics like the standard score. Would you feel better about the post if Lumifer just talked about math and not about me being "very confused" and "misinformed"?

When a commenter on my blog actually catches a mistake in the math, I not only fix it immediately, I pay them $5 for the trouble.

Comment author: Lumifer 31 January 2017 06:37:28PM *  1 point [-]

That's a very confused post.

Let's start with the obvious -- the Sharpe ratio is not a test statistic (and does not lead to a p-value). To observe this note that the calculation of a test statistic involves a very important input -- the number of observations. Your denominator is not volatility, it's volatility divided by the square root of n. The Sharpe value does not care about the number of observations at all. So statements like

Each (arbitrary and useless) p-value cutoff corresponds to a test statistic cutoff which translates to a Sharpe Ratio cutoff above which an investment strategy is “significant” enough to brag about in front of a finance professor

are just figments of your misinformed imagination. In reality, having a strategy with a Sharpe of, say, 1 is very very good (the Sharpe ratio of the S&P500 this century is somewhere around 0.25).

In general, people in finance don't care about the p<0.05 (even when it's statistically "valid") because that threshold assumes a stable and unchanging underlying process which in the financial markets is very much not the case.

Comment author: Jacobian 02 February 2017 05:10:20PM 1 point [-]

A test statistic involves dividing some measurement by the standard deviation of that same measurement.

When you calculate a p-value for a number of observations, you commonly take the average result and the standard deviation of the average, which equals the standard deviation of a single measurement divided by the square root of n. You can also calculate a test statistic and a p-value for a single observation, in which case you have just one result and one standard deviation.

This is what you do when calculating the Sharpe Ratio: you divide excess return over some period (usually a year) by the standard deviation of returns for that exact period. You can also interpret it in p-value language: if the S&P 500 has a (yearly) Sharpe Ratio of 0.25, that corresponds to a p-value of 0.4 if we assume that everything is distributed normally 1 - NORM.DIST(0.25,0,1,1) = 0.4. This can be interpreted to mean that a leveraged portfolio of the risk-free asset (which has a Sharpe Ratio of 0) has a 40% chance of outperforming the S&P 500 in a given year. The p-values in finance are low (because the markets are pretty efficient), but the math is the same.

And if you think something is wrong with the math, I encourage you to discuss the math without talking about my "misinformed imagination".

[Link] Putanumonit - A "statistical significance" story in finance

1 Jacobian 31 January 2017 05:54PM
In response to First impressions...
Comment author: Oscar_Cunningham 24 January 2017 04:32:54PM 0 points [-]

You might be right that we tend to focus on details too much, but I don't think your example shows this.

when I asked for metrics to evaluate a presidency, few people actually provided any - most started debating the validity of metrics, and one subthread went off to discuss the appropriateness of the term "gender equality".

All this shows is that we're bad at solving the problem you asked us to solve. But it's not like you're paying us to solve it. We can choose to talk about whatever we find most interesting. That doesn't mean we couldn't solve the problem if we wanted to.

Comment author: Jacobian 25 January 2017 05:13:48PM 3 points [-]

I completely agree. Almost all of us here have jobs/pursuits/studies that we are good at, and that require a lot of object-level problem solving. LW is a quiet corner where we come in our free time to discuss meta-level philosophical questions of rationality and have a good time. For these two goals, LW has also acquired a norm of not talking about object-level politics too much, because politics makes it hard to stay meta-level rational and isn't always a good time.

Now with that said, you're of course welcome to post an object-level political question on the forum. It's an open community. But if people here don't want to play along, you should take it as a sign that you missed something about LW, not that we miss something about answering questions practically.

Comment author: gjm 23 January 2017 12:41:16AM *  2 points [-]

I think there's something wrong with your analysis of the longer/shorter survey data.

[EDITED to add:] ... and, having written this and gone back to read the comments on your post, I see that someone there has already said almost exactly the same as I'm saying here. Oh well.

You start out by saying that you should write longer posts if 25% more readers prefer long than prefer short (and similarly for writing shorter posts).

Then you consider three hypotheses: that (as near as possible to) exactly 25% more prefer long than prefer short, that (as near as possible to) exactly 25% more prefer short, and that the numbers preferring long and preferring short are equal.

And you establish that your posterior probability for the first of those is much bigger than for either of the others, and say

Our simple analysis led us to an actionable conclusion: there’s a 97% chance that the preference gap in favor longer posts is closer to 25% than to 0%, so I shouldn’t hesitate to write longer posts.

Everything before the last step is fine (though, as you do remark explicitly, it would be better to consider a continuous range of hypotheses about the preference gap). But surely the last step is just wrong in at least two ways.

  • You can't get from "preference gap of exactly 25% is much more likely than preference gap of exactly 0%" to "preference gap of at least 12.5% is much more likely than preference gap of at most 12.5%".
  • The original question wasn't whether the preference gap is at least 12.5%, it was whether it's at least 25%.

With any reasonable prior, I think the data you have make it extremely unlikely that the preference gap is at least 25%.

[EDITED to add:] Oh, one other thing I meant to say but forgot (which, unlike the above, hasn't already been said in comments on your blog). The assumption being made here is, roughly, that people responding to the survey are a uniform random sample from all your readers. But I bet they aren't. In particular, I bet more "engaged" readers are (1) more likely to respond to the survey and (2) more likely to prefer longer meatier posts. So I bet the real preference gap among your whole readership is smaller than the one found in the survey. Of course you may actually prefer to optimize for the experience of your more engaged readers, but again that isn't what you said you wanted to do :-).

Comment author: Jacobian 23 January 2017 04:11:32PM 1 point [-]

Since at 4,000 words the post was running up against the limits of my stamina regardless of readers' preferences, I trust my smart and engaged readers to make all the necessary nitpicks and caveats for me :)

First of all, according to the site stats more than 80% of the people who read the survey filled it out so it makes sense to treat it as representative sample. I forgot to mention that.

To your first point: you're correct that "the real gap is almost certainly above 12.5%" isn't exactly what my posterior is. Again, my goal was to make a decision, so I had to assign decisions based on what the data could show me. I don't need to have a precise interpretation of the result to make a sensible decision based on them, as long as I'm not horribly mistaken about what the results mean.

And what the results mean is, in fact, pretty close to "the real gap is almost certainly above 12.5%" under some reasonable assumptions. Whatever the "real" gap (i.e. the gap I would get if I got an answer from every single one of my current and future readers), the possible gaps I could measure on the survey are almost certainly distributed in some unimodal and pretty symmetric distribution around it. This means that the measured results are about as likely to overshoot the "real gap" by x% as they are to undershoot, at least to a first approximation (i.e. ignoring things like how the question was worded and the phase of the moon). This in turn means that a measured result of a 15% gap on a large sample of reasers does imply that the "real gap" is very likely to be close to 15% and above 12.5%.

Thanks for taking the time to dig into the math, this is what it's about.

[Link] Putanumonit - Bayesian inference vs. null hypothesis testing

5 Jacobian 22 January 2017 02:31PM

View more: Next