lessdazed comments on Minicamps on Rationality and Awesomeness: May 11-13, June 22-24, and July 21-28 - Less Wrong

24 Post author: AnnaSalamon 29 March 2012 08:48PM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (239)

You are viewing a single comment's thread. Show more comments above.

Comment author: lessdazed 29 March 2012 10:23:36PM 3 points [-]

The median is almost always around 7, for almost anything.

I tried to take that into account when reading.

treating the indexes as utilities

Please explain.

Comment author: [deleted] 29 March 2012 10:55:18PM 11 points [-]

I tried to take that into account when reading.

I know, I did too, but that is really the sort of calculation that should be done by a large-scale study that documents a control distribution for 0-10 ratings that such ratings can be calibrated against.

treating the indexes as utilities

Please explain.

In my engineering school, we had some project planning classes where we would attempt to calculate what was the best design based on the strength of our preference for performance in a variety of criteria (aesthetics, wieght, strength, cost, etc). Looking back I recognize what we were doing as coming up with a utility function to compute the utilities of the different designs.

Unfortunately, none of us (including the people who had designed the procedure) knew anything about utility functions or decision theory, so they would do things like rank the different criteria, and the strength of each design in each criteria, and then use those directly as utility wieghts and partial utilities.

(so for example strength might be most important (10), then cost (9) then wieght (8) and so on. and then maybe design A would be best (10) in wieght, worst (1) in strength, etc)

I didn't know any decision theory or anything, but I have a strong sense for noticing errors in mathematical models, and this thing set off alarm bells like crazy. We should have been giving a lot of thought to calibration of our wieghts and utilities to make sure arbitraryness of rankings can't sneak through and change the answer, but no one gave a shit. I raised a fuss and tried to rederive the whole thing from first principles. I don't think I got anything, tho, it was only one assignment so I might have given up because of low value (it's a hard problem). Don't remember.

Moral:

With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use "math" on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.

This is why you never use explicit probabilities that aren't either empirically determined or calculated theoretically.

Comment author: Nick_Tarleton 30 March 2012 06:46:11PM *  5 points [-]

With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use "math" on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.

Despite anti-arbitrariness intuitions, there is empirical evidence that this is wrong.

The Robust Beauty of Improper Linear Models

Proper linear models are those in which predictor variables are given weights in such a way that the resulting linear composite optimally predicts some criterion of interest; examples of proper linear models are standard regression analysis, discriminant function analysis, and ridge regression analysis. Research summarized in Paul Meehl's book on clinical versus statistical prediction—and a plethora of research stimulated in part by that book—all indicates that when a numerical criterion variable (e.g., graduate grade point average) is to be predicted from numerical predictor variables, proper linear models outperform clinical intuition. Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge's predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weighting is quite robust for making such predictions. The article discusses, in some detail, the application of unit weights to decide what bullet the Denver Police Department should use. Finally, the article considers commonly raised technical, psychological, and ethical resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances.

(this is about something somewhat less arbitrary than using ranks as scores, but it seems like evidence in favor of that approach as well)

Comment author: Will_Newsome 30 March 2012 11:55:58PM *  -1 points [-]

Dawes is not a reliable researcher; I have very little confidence in his studies. Check it.

(ETA: I also have other reasons to mistrust Dawes, but shouldn't go into those here. In general you just shouldn't trust heuristics and biases results any more than you should trust parapsychology results. (Actually, parapsychology results tend to be significantly better supported.) Almost all psychology is diseased science; the hypotheses are often interesting, the statistical evidence given for them is often anti-informative.)

Comment author: Blueberry 29 March 2012 11:09:42PM 1 point [-]

You do not use "math" on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.

I'm not sure I understand what you mean by pseudo-quantities.

strength might be most important (10), then cost (9) then wieght (8) and so on.

So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?

Comment author: [deleted] 29 March 2012 11:29:47PM 1 point [-]

"pseudo-quantity" is a term I just made up for things that look like quantities (they may even have units), but are fake in some way. Unlike real quantities, for which correct math is always valid, you cannot use math on pseudo-quantities without calibration (which is not always possible).

Example: uncalibrated probability ratings (I'm 95% sure) are not probabilities, and you cannot use them in probability calculations, even though they seem to be numbers with the right units. You can turn them into real probabilities by doing calibration. (assuming they correllate well enough)

So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?

More or less. Other ranking systems could be calibrated to get actual utility coeficients, but rank indexes loose information and cannot even be calibrated.

Comment author: Blueberry 29 March 2012 11:37:04PM *  0 points [-]

So the pseudo-quantities in your example are strength ratings on a 1-10 scale?

I actually think that's acceptable, assuming the ratings on the scale are equally spaced, and the weights correspond to the spacing. For instance, space strengths out from 1 to 10 evenly, space weights out from 1 to 10 evenly (where 10 is the best, i.e., lightest), where each interval corresponds to roughly the same level of improvement in the prototype. Then assign weights to go along with how important an improvement is along one axis compared to the other. For instance, if improving strength one point on the scale is twice as valuable as improving weight, we can give strength a weight of 2, and computations like:

  • Option A, strength 3, weight 6, total score 2(3) + 6 = 12
  • Option B, strength 5, weight 3, total score 2(5) + 3 = 13

make sense.

Comment author: [deleted] 29 March 2012 11:52:11PM 0 points [-]

Still have one degree of freedom. What if you ranked from 10-20? or -5 to 5? As a limiting case consider rankings 100-110: the thing with the highest preference (strength) would totally swamp the calculation, becoming the only concern.

Once you have scale and offset correctly calibrated, you still need to worry about nonlinearity. In this case (using rank indexes), the problem is even worse. Like I said, rank indexes lose information. What if they are all the same wieght but one is drastically lighter? Consider that the rankings are identical no matter how much difference there is. That's not right. Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.

This is not as hard as FAI, but it's harder than pulling random numbers out of your butt, multiplying them, and calling it a decision procedure.

Comment author: Blueberry 30 March 2012 12:09:57AM *  2 points [-]

I agree that ranking the weights from 1 to N is idiotic because it doesn't respect the relative importance of each characteristic. However, changing the ratings from 101-110 for every scale will just add a constant to each option's value:

  • Option A, strength 103, mass 106, total score 2(103) + 106 = 312
  • Option B, strength 105, mass 103, total score 2(105) + 103 = 313

(I changed 'weight to 'mass' to avoid confusion with the other meaning of 'weight')

Using something approximating a real-valued ranking (rank from 1-10) instead of rank indicies reduces the problem to mere nonlinearity.

I assume you mean using values for the weights that correspond to importance, which isn't necessarily 1-10. For instance, if strength is 100 times more important than mass, we'd need to have weights of 100 and 1.

You're right that this assumes that the final quality is a linear function of the component attributes: we could have a situation where strength becomes less important when mass passes a certain threshold, for instance. But using a linear approximation is often a good first step at the very least.

Comment author: [deleted] 30 March 2012 12:22:45AM *  0 points [-]

Option A, strength 103, mass 106, total score 2(103) + 106 = 312 Option B, strength 105, mass 103, total score 2(105) + 103 = 313

Oops, I might have to look at that more closely. I think you are right. The shared offset cancels out.

I assume you mean using values for the weights that correspond to importance, which isn't necessarily 1-10. For instance, if strength is 100 times more important than mass, we'd need to have weights of 100 and 1.

Using 100 and 1 for something that is 100 times more important is correct (assuming you are able to estimate the weights (100x is awful suspicious)). Idiot procedures were using rank indicies, not real-valued weights.

But using a linear approximation is often a good first step at the very least.

agree. Linearlity is a valid assumption

The error is using uncalibrated rating from 0-10, or worse, rank indicies. Linear valued rating from 0-10 has the potential to carry the information properly, but that does not mean people can produce calibrated estimates there.

Comment author: orthonormal 29 March 2012 11:20:00PM 1 point [-]

So the problem is that these attributes were given rankings from 10 down to 1, rather than their weights that corresponded to their actual importance?

Right- that can cause this problem. (Not quite the same dynamic, but you get the idea.)

Comment author: Vaniver 30 March 2012 04:53:31AM 0 points [-]

Multicriteria objective functions are really hard to get right. Weighting features from 10 to 1 is actually a decent first approach- it should separate good solutions from bad solutions- but if you're down to narrow differences of the weighted objective function, it's typically time to hand off to a human decision-maker, or spend a lot of time considering tradeoffs to elicit the weights. (Thankfully, a first pass should show you what features you need to value carefully and which features you can ignore.)

Comment author: Grognor 30 March 2012 04:44:44AM *  0 points [-]

With this sort of thing, or anything really, you either use bulletproof mathematical models derived from first principles (or empirically) with calibrated real quantities, or you wing it intuitively using your built-in hardware. You do not use "math" on uncalibrated pseudo-quantities; that just tricks you into overriding your intuition for something with no correct basis.

This is why you never use explicit probabilities that aren't either empirically determined or calculated theoretically.

This is a very good general point, one that I natively seem to grasp, but even so I'd appreciate it if you wrote a top-level post about it.