Comment author: benkuhn 01 April 2015 01:56:15AM 3 points [-]

Is my general line of reasoning correct here, and is the style of reasoning a good style in the general case? I am aware that Eliezer raises points against "small probability multiplied by high impact" reasoning, but the fact is that a rational agent has to have a belief about the probability of any event, and inaction is itself a form of action that could be costly due to missing out on everything; privileging inaction is a good heuristic but only a moderately strong one.

Sometimes, especially in markets and other adversarial situations, inaction is secretly a way to avoid adverse selection.

Even if you're a well-calibrated agent--so that if you randomly pick 20 events with a 5% subjective probability, one of them will happen--the set "all events where someone else is willing to trade on odds more favorable than 5%" is not a random selection of events.

Whether the Bitcoin markets are efficient enough to worry about this is an open question, but it should at least be a signal for you to make your case more robust than pulling a 5% number out of thin air, before you invest. I think the Reddit commenters were reasonable (a sentence I did not expect to type) for pointing this out, albeit uncharitably.

Is "take the inverse of the size of the best-fitting reference class" a decent way of getting a first-order approximation? If not, why not? If yes, what are some heuristics for optimizing it?

In my experience, this simply shifts the debate to which reference class is the best-fitting one, aka reference-class tennis. For instance, a bitcoin detractor could argue that the reference class should also include Beanie Babies, Dutch tulips, and other similar stores of value.

Comment author: RyanCarey 18 December 2014 07:00:40AM 0 points [-]

Model accuracy actually isn't actually a great measure of predictive power, because it's sensitive to base rates.

I was told that you only run into severe problems with model accuracy if the base rates are far from 50%. Accuracy feels pretty interpretable and meaningful here as the base rates are 30%-50%.

As far as I know, if you don't have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.

Although ROC area under curve seems to have an awkward downside in that it penalises you for having poor prediction even when you set the sensitivity (the threshold) to a bad parameter. The F Score is pretty simple, and doesn't have this drawback - it's just a combination of some fixed sensitivity and specificity.

As you point out, there is ongoing research and discussion of this, which is confusing because as far as math goes, it doesn't seem like that hard of a problem.

Comment author: benkuhn 18 December 2014 08:37:04AM *  0 points [-]

I was told that you only run into severe problems with model accuracy if the base rates are far from 50%. Accuracy feels pretty interpretable and meaningful here as the base rates are 30%-50%.

It depends on how much signal there is in your data. If the base rate is 60%, but there's so little signal in the data that the Bayes-optimal predictions only vary between 55% and 65%, then even a perfect model isn't going to do any better than chance on accuracy. Meanwhile the perfect model will have a poor AUC but at least one that is significantly different from baseline.

[ROC AUC] penalises you for having poor prediction even when you set the sensitivity (the threshold) to a bad parameter. The F Score is pretty simple, and doesn't have this drawback - it's just a combination of some fixed sensitivity and specificity.

I'm not really sure what you mean by this. There's no such thing as an objectively "bad parameter" for sensitivity (well, unless your ROC curve is non-convex); it depends on the relative cost of type I and type II errors.

The F score isn't comparable to AUC since the F score is defined for binary classifiers and the ROC AUC is only really meaningful for probabilistic classifiers (or I guess non-probabilitstic score-based ones like uncalibrated SVMs). To get an F score for a binary classifier you have to pick a single threshold, which seems even worse to me than any supposed penalization for picking "bad sensitivities."

there is ongoing research and discussion of this, which is confusing because as far as math goes, it doesn't seem like that hard of a problem.

Because different utility functions can rank models differently, the problem "find a utility-function-independent model statistic that is good at ranking classifiers" is ill-posed. A lot of debates over model scoring statistics seem to cash out to debates over which statistics seem to produce model selection that works well robustly over common real-world utility functions.

Comment author: RyanCarey 18 December 2014 12:36:02AM 0 points [-]

A lecturer suggested that its good to do leave-one-out cross validation if you have 20ish data points. 20-200 is leave out 10% territory, and then with > 200, it's good to just use 30% for a test set. Although I he didn't justify this. And it doesn't seem that important to stick to.

Comment author: benkuhn 18 December 2014 06:11:07AM 0 points [-]

I would beware the opinions of individual people on this, as I don't believe it's a very settled question. For instance, my favorite textbook author, Prof. Frank Harrell, thinks 22k is "just barely large enough to do split-sample validation." The adequacy of leave-one-out versus 10-fold depends on your available computational power as well as your sample size. 200 seems certainly not enough to hold out 30% as a test set; there's way too much variance.

Comment author: JonahSinick 18 December 2014 04:11:30AM *  0 points [-]

Thanks Ben!

In this case, with only selectivity ratings, you can't do better than 50% accuracy (any person wants to date any other person with 50% probability). But with eye-color ratings, you can get it perfect.

Edit: I initially misread your remark. I tried to clarify the setup with:

In this blog post I’m restricting consideration to signals of the partners’ general selectivity and general desirability, without considering how their traits interact.

Is this ambiguous?

My impression is that there are significant structural correlations in your data that I don't really understand the impact of. (For instance, at least if everyone rates everyone, I think the correlation of attr with attrAvg is guaranteed to be positive, even if attr is completely random.)

I may not fully parse what you have in mind, but I excluded the rater and ratee from the averages. This turns out not to be enough to avoid contamination for subtle reasons, so I made a further modification. I'll be discussing this later, but if you're wondering about this particular point, I'd be happy to now.

The relevant code is here. Your remark prompted me to check my code by replacing the ratings with random numbers drawn from a normal distribution. Using 7 ratings and 7 averages, the mean correlation is 0.003, with 23 negative and 26 positive.

Nitpick: There are only 10 distinct such correlations that are not 1 by definition.

Thanks, that was an oversight on my part. I've edited the text.

Model accuracy actually isn't actually a great measure of predictive power, because it's sensitive to base rates. (You at least mentioned the base rates, but it's still hard to know how much to correct for the base rates when you're interpreting the goodness of a classifier.)

I suppressed technical detail in this first post to make it more easily accessible to a general audience. I'm not sure whether this answers your question, but I used log loss as a measure of accuracy. The differentials were (approximately, the actual final figures are lower):

For Men: ~0.690 to ~0.500. For Women: ~0.635 to ~0.567. For Matches: ~0.432 to ~0.349

I'll also be giving figures within the framework of recommendation systems in a later post.

As far as I know, if you don't have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.

Thanks, I've been meaning to look into this.

Comment author: benkuhn 18 December 2014 06:04:56AM 0 points [-]

Is this ambiguous?

It wasn't clear that this applied to the statement "we couldn't improve on using these" (mainly because I forgot you weren't considering interactions).

I excluded the rater and ratee from the averages.

Okay, that gets rid of most of my worries. I'm not sure it account for covariance between correlation estimates of different averages, so I'd be interested in seeing some bootstrapped confidence intervals). But perhaps I'm preempting future posts.

Also, thinking about it more, you point out a number of differences between correlations, and it's not clear to me that those differences are significant as opposed to just noise.

I'm not sure whether this answers your question, but I used log loss as a measure of accuracy.

I was using "accuracy" in the technical sense, i.e., one minus what you call "Total Error" in your table. (It's unfortunate that Wikipedia says scoring rules like log-loss are a measure of the "accuracy" of predictions! I believe the technical usage, that is, percentage properly classified for a binary classifier, is a more common usage in machine learning.)

The total error of a model is in general not super informative because it depends on the base rate of each class in your data, as well as the threshold that you choose to convert your probabilistic classifier into a binary one. That's why I generally prefer to see likelihood ratios, as you just reported, or ROC AUC scores (which integrates over a range of thresholds).

(Although apparently using AUC for model comparison is questionable too, because it's noisy and incoherent in some circumstances and doesn't penalize miscalibration, so you should use the H measure instead. I mostly like it as a relatively interpretable, utility-function-independent rough index of a model's usefulness/discriminative ability, not a model comparison criterion.)

Comment author: benkuhn 18 December 2014 01:46:40AM *  5 points [-]

Nice writeup! A couple comments:

If the dataset contained information on a sufficiently large number of dates for each participant, we could not improve on using [frequency with which members of the opposite sex expressed to see them again, and the frequency with which the participant expressed interest in seeing members of the opposite sex again].

I don't think this is true. Consider the following model:

  • There is only one feature, eye color. The population is split 50-50 between brown and blue eyes. People want to date other people iff they are of the same eye color. Everyone's ratings of eye color are perfect.

In this case, with only selectivity ratings, you can't do better than 50% accuracy (any person wants to date any other person with 50% probability). But with eye-color ratings, you can get it perfect.

[correlation heatmap]

My impression is that there are significant structural correlations in your data that I don't really understand the impact of. (For instance, at least if everyone rates everyone, I think the correlation of attr with attrAvg is guaranteed to be positive, even if attr is completely random.)

As a result, I'm having a hard time interpreting things like the fact that likeAvg is more strongly correlated with attr than with like. I'm also having a hard time verifying your interpretations of the observations that you make about this heatmap, because I'm not sure to what extent they are confounded by the structural correlations.

It seems implausible to me that each of the 25 correlations between the five traits of attractiveness, fun, ambition, intelligence and sincerity is positive.

Nitpick: There are only 10 distinct such correlations that are not 1 by definition.

The predictive power that we obtain

Model accuracy actually isn't actually a great measure of predictive power, because it's sensitive to base rates. (You at least mentioned the base rates, but it's still hard to know how much to correct for the base rates when you're interpreting the goodness of a classifier.)

As far as I know, if you don't have a utility function, scoring classifiers in an interpretable way is still kind of an open problem, but you could look at ROC AUC as a still-interpretable but somewhat nicer summary statistic of model performance.

In response to Effective Writing
Comment author: pewpewlasergun 18 July 2014 09:51:15PM 4 points [-]

What's the best way to improve at writing? I was surprised at how much harder it is a few years out of school.

Comment author: benkuhn 19 July 2014 06:17:43AM 4 points [-]

I can't speak to "best," but I suggest reading Style: Lessons in Clarity and Grace by Joseph M. Williams, which crystallizes lots of non-trivial components of "good writing." (The link is to an older, less expensive edition which I used.)

I'll also second "write a lot" and "read a lot." Reading closely and with purpose in mind will speed up the latter (as opposed to the default of throwing books at your brain and hoping to pick up good writing by osmosis). Also, read good writers.

In response to Too good to be true
Comment author: benkuhn 12 July 2014 01:14:35AM *  7 points [-]

In your "critiquing bias" section you allege that 3/43 studies supporting a link is "still surprisingly low". This is wrong; it is actually surprisingly high. If B ~ Binom(43, 0.05), then P(B > 2) ~= 0.36.*

*As calculated by the following Python code:

from scipy.stats import binom
b = binom(43, 0.05)
p_less_than_3 = sum(b.pmf(i) for i in [0,1,2])
print 1 - p_less_than_3
Comment author: benkuhn 19 June 2014 02:55:30AM 4 points [-]

I think you're being a little uncharitable to people who promote interventions that seem positional (e.g. greater educational attainment). It may be true that college degrees are purely for signalling and hence positional goods, but:

(a) it improves aggregate welfare for people to be able to send costly signals, so we shouldn't just get rid of college degrees;

(b) if an intervention improves college graduation rate, it (hopefully) is not doing this by handing out free diplomas, but rather by effecting some change in the subjects that makes them more capable of sending the costly signal of graduating from college, which is an absolute improvement.

Similarly, while height increase has no plausible mechanism for improving absolute wellbeing, some mechanisms for improving absolute wellbeing are measured using height as a proxy (most prominently nutritional status in developing countries).

It should definitely be a warning sign if an intervention seems only to promote a positional good, but it's more complex than it seems to determine what's actually positional.

Comment author: benkuhn 28 April 2014 02:03:19AM *  4 points [-]

Fun question.

The takeover vector that leaps to mind is remote code execution vulnerabilities on websites connected to important/sensitive systems. This lets you bootstrap from ability to make HTTP GET requests, to (partial) control over any number of fun targets, like banks or Amazon's shipping.

The things that are one degree away from those (via e.g. an infected thumb drive) are even more exciting:

  • Iranian nuclear centrifuges
  • US nuclear centrifuges
  • the electrical grid
  • hopefully not actual US nuclear weapons, but this should be investigated...

Plausible first attempt: get into a defense contractor's computers and install Thompson's compiler backdoor. Now the AI can stick whatever code it wants on various weapons and blackmail anyone it wants or cause havoc in any number of other ways.

Comment author: shminux 17 April 2014 04:28:09PM 1 point [-]

I'd beware conflating "interpersonal skills" with "playing politics."

Certainly there is a spectrum there.

The subtext of your comment is that the companies you mention were somehow duped into promoting these bad engineers to executive roles

I did not mean it that way in general, but in one particular case both ran the company into the ground, one by picking a wrong (dying) market, the other by picking a poor acquisition target (the code base hiding behind a flashy facade sucked). I am not claiming that if the company promoted someone else they would have done a better job.

Second, I think that the "playing politics" part is true at some organizations but not at others.

If we define "playing politics as "using interpersonal relationships to one's own advantage and others' detriment", then I am yet to see a company with more than a dozen employees where this wasn't commonplace.

If we define "interpersonal skills" as "the art of presenting oneself in the best possible light", then some people are naturally more skilled at it than others and techies rarely top the list.

As for trusting the management to accurately figure out who actually deserves credit, I am not as optimistic. Dilbert workplaces are contagious and so very common. I'm glad that you managed to avoid getting stuck in one.

Comment author: benkuhn 17 April 2014 04:47:00PM 1 point [-]

Yes, definitely agree that politicians can dupe people into hiring them. Just wanted to raise the point that it's very workplace-dependent. The takeaway is probably "investigate your own corporate environment and figure out whether doing your job well is actually rewarded, because it may not be".

View more: Prev | Next