RichardKennaway comments on Self-Congratulatory Rationalism - Less Wrong

51 Post author: ChrisHallquist 01 March 2014 08:52AM

You are viewing a comment permalink. View the original post to see all comments and the full post content.

Comments (395)

You are viewing a single comment's thread. Show more comments above.

Comment author: RichardKennaway 25 April 2014 04:39:00PM 3 points [-]

What actually happens is that the reasons for the summary judgements are examined.

Three for, one against. Is the dissenter the only one who has not understood the paper, or the only one who knows that although the work is good, almost the same paper has just been accepted to another conference? The set of summary judgements is the same but the right final judgement is different. Therefore there is no way to get the latter from the former.

Aumann agreement requires common knowledge of each others' priors. When does this ever obtain? I believe Robin Hanson's argument about pre-priors just stands the turtle on top of another turtle.

Comment author: TheAncientGeek 25 April 2014 04:49:11PM *  2 points [-]

People don't coincide in their priors, don't have access to the same evidence and aren't running off the same epistemology, and can't settle epistemologiical debates non-circularly......

Threr's a lot wrong with Aumannn, or at least the way some people use it.

Comment author: gwern 25 April 2014 06:28:02PM *  -2 points [-]

What actually happens is that the reasons for the summary judgements are examined.

Really? My understanding was that

Between each iteration of the questionnaire, the facilitator or monitor team (i.e., the person or persons administering the procedure) informs group members of the opinions of their anonymous colleagues. Often this “feedback” is presented as a simple statistical summary of the group response, usually a mean or median value, such as the average group estimate of the date before which an event will occur. As such, the feedback comprises the opinions and judgments of all group members and not just the most vocal. At the end of the polling of participants (after several rounds of questionnaire iteration), the facilitator takes the group judgment as the statistical average (mean or median) of the panelists’ estimates on the final round.

(From Rowe & Wright's "Expert opinions in forecasting: the role of the Delphi technique", in the usual Armstrong anthology.) From the sound of it, the feedback is often purely statistical in nature, and if it wasn't commonly such restricted feedback, it's hard to see why Rowe & Wright would criticize Delphi studies for this:

The use of feedback in the Delphi procedure is an important feature of the technique. However, research that has compared Delphi groups to control groups in which no feedback is given to panelists (i.e., non-interacting individuals are simply asked to re-estimate their judgments or forecasts on successive rounds prior to the aggregation of their estimates) suggests that feedback is either superfluous or, worse, that it may harm judgmental performance relative to the control groups (Boje and Murnighan 1982; Parenté, et al. 1984). The feedback used in empirical studies, however, has tended to be simplistic, generally comprising means or medians alone with no arguments from panelists whose estimates fall outside the quartile ranges (the latter being recommended by the classical definition of Delphi, e.g., Rowe et al. 1991). Although Boje and Murnighan (1982) supplied some written arguments as feedback, the nature of the panelists and the experimental task probably interacted to create a difficult experimental situation in which no feedback format would have been effective.

Comment author: RichardKennaway 25 April 2014 06:50:58PM 3 points [-]

What actually happens is that the reasons for the summary judgements are examined.

Really? My understanding was that

I was referring to what actually happens in a programme committee meeting, not the Delphi method.

Comment author: gwern 25 April 2014 06:57:06PM *  0 points [-]

I was referring to what actually happens in a programme committee meeting, not the Delphi method.

Fine. Then consider it an example of 'loony' behavior in the real world: Delphi pools, as a matter of fact, for many decades, have operated by exchanging probabilities and updating repeatedly, and in a number of cases performed well (justifying their continued usage). You don't like Delphi pools? That's cool too, I'll just switch my example to prediction markets.

Comment author: RichardKennaway 25 April 2014 07:02:16PM *  3 points [-]

It would be interesting to conduct an experiment to compare the two methods for this problem. However, it is not clear how to obtain a ground truth with which to judge the correctness of the results. BTW, my further elaboration, with the example of one referee knowing that the paper under discussion was already published, was also non-fictional. It is not clear to me how any decision method that does not allow for sharing of evidence can yield the right answer for this example.

What have Delphi methods been found to perform well relative to, and for what sorts of problems?

Comment author: ChristianKl 25 April 2014 07:55:06PM *  -1 points [-]

However, it is not clear how to obtain a ground truth with which to judge the correctness of the results.

That assumes we don't have any criteria on which to judge good versus bad scientific papers.

You could train your model to predict the amount of citations that a paper will get. You can also look at variables such as reproduced papers or withdrawn papers.

Define a utility function that collapses such variables into a single one. Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.

Comment author: RichardKennaway 30 April 2014 08:14:54AM *  1 point [-]

You could train your model to predict the amount of citations that a paper will get. You can also look at variables such as reproduced papers or withdrawn papers.

Define a utility function that collapses such variables into a single one. Run a real world experiment in a journal and do 50% of the paper submissions with one mechanism and 50% with the other. Let a few years go by and then you evaluate the techniques based on your utility function.

Something along those lines might be done, but an interventional experiment (creating journals just to test a hypothesis about refereeing) would be impractical. That leaves observational data-collecting, where one might compare the differing practices of existing journals. But the confounding problems would be substantial.

Or, more promisingly, you could do an experiment with papers that are already published and have a citation record, and have experimental groups of referees assess them, and test different methods of resolving disagreements. That might actually be worth doing, although it has the flaw that it would only be assessing accepted papers and not the full range of submissions.

Comment author: ChristianKl 30 April 2014 09:06:37AM 0 points [-]

Then no reason why you can't test different procedures in an existing journal.