Shallow Review of Consistency in Statement Evaluation

Elizabeth

Overview

Most existing forecasting or evaluation platform questions are for very clearly verifiable questions:

"Who will win the next election?"
"How many cars will Tesla sell in 2030?"
“How many jelly beans are in this jar?”

But many of the questions we care about are do not look like this. They might…

Be severely underspecified, e.g. “How much should we charge this customer for this vague feature request?”
Involve value judgements, e.g. “What is the optimum prison sentence for this convict?”, “How much does this plaintiff deserve for pain and suffering?”
Not have a clear stopping point, e.g. "What is the relative effectiveness of AI safety research vs. bio risk research?"
Require multiple steps instead of a yes/no or numerical answer, e.g. “What treatment is appropriate for this patient with precancerous cells?”
Not have good referrents, e.g. “What is the market size for this completely new tech?”

An entity who could answer these questions well would be a very valuable asset. But what does well even mean here? We want people to be accurate, of course, but in many cases we also need their predictions/evaluations to be consistent to be actionable. This is especially true when fairness norms are in play, such as pricing[1] and prison sentencing.

There is a lot of research showing that people make inconsistent evaluations (with each other and themselves across time) across a wide variety of fields, even those that more closely resemble the “easy” questions above (valuing stocks, appraising real estate, sentencing criminals, evaluating job performance, auditing financial statements)[2]. It is even more difficult to consistently evaluate or predict novel questions or low-frequency events, like “Will India use a nuclear weapon on Pakistan by 1/1/20” or “How much counterfactual value has this organization created?”.

This paper is a shallow review of the literature around how to get entities to make consistent judgements. I want to note up front that a major limitation of this write-up and of shallow reviews in general is that I mostly relied on authors’ descriptions of their work and conclusions, rather than verifying their experimental design and conclusions for myself, or looking up others’ opinions of papers. As such, this post should be taken as a description of the state of the literature, not the state of the world.

Speaking of underspecified questions, “how to get consistent answers to complicated questions?” sure is one. I started this research project with a vague sense of an area from Ozzie Gooen; as I iterated, we came up with more specific questions. The following is a list of questions or hooks that came up as we discussed the research:

Overall, what literature is available to answer the question “how to get people to answer messy questions consistently?”
What are the costs of consistency?
How often are evaluations / sentences simply misunderstood by people? Is there a decent science behind understanding and expecting levels of misunderstandings in different cases?
How often are the evaluations doing the wrong things? What are the failure modes? For instance, one failure mode is that they have misjudged the value of some intermediate variables. Maybe that’s all there is?
In what domains are subjective measures likely to be informative, especially about things other than subjective states? (For instance, the subjective measure of “I think this work was done at a 8/10, is very different than, “I’m feeling an 8/10 now”, in that both of them require an intuitive judgement, but in one case the intuitive judgement **is** the measure.
What are the main problems that come up for nonprofit evaluations? Have they found any methods that would be useful to us?
How difficult is/can it be to come up with these composite indexes/linear models? What should we know when attempting them?
Can we have any clever models where evaluators are really just predicting what other evaluators would say?
What are good areas for follow-ups?

Some of these questions were answered in more detail than others, some were not answerable at all in the time available. Here is what I found.

Methods to Improve Consistency in Evaluations

Hold Keynesian beauty contests, in which the goal is to guess what other people will guess, not what you think is true.

A single study suggested this improves recall and precision.
“What do other people think?” is also a well known trick for getting people to be honest about opinions over which they expect to receive censure.

Use groups instead of individuals (Zhitomirsky-Geffet, Bar-Ilan, and Mark Levene)

Configuring groups such that each group has the same variety of expertise allows you to use some non-common knowledge in your estimates (personal guess).
For procedures with many iterations (e.g., image labeling), combine multiple predictors with a mathematical model that incorporates varying skill, expertise, and task difficulty level (Welinder et al, Bachrach et al)

Remove extraneous information. Individuals’ estimates are widely affected by extraneous information even when they themselves view it as extraneous (Grimstad and Jørgensen, Stewart). In the real world this may be a lengthy process of determining what information is extraneous.
Force participants to write up models of their thinking (using variables for unknowns), and then evaluate the variables separately (Kahneman, Lovallo, and Sibony).

Kahneman suggests 5-6 variables, and absolutely no more than 8 (Knowledge@Wharton).
To preserve independence, have individuals write up their models before sharing with the group and coming to consensus.
See “Creating Composite Models” below.

Let participants know you’ll be asking about their reasoning afterwords (Kahneman, Lovallo, and Sibony).
Create reference guides that forecasters can refer to while making an estimate (e.g. “this is what Level 4 teaching looks like, this is what Level 5 teaching looks like). Better, after they’ve made their estimate show them the nearest reference and ask how they compare (Penny, Johnson, and Gordon).

In the case of novel questions, I speculate that it would be useful to make an imaginary reference chart (“this is what a country that’s 20% likely to launch a nuclear missile in the next year would look like…”) .

Some evaluations can be broken down into sub-evaluations, in which people tend to agree on the first step but disagree on the second. E.g., they’ll agree on the ordering of the severity of personal injury cases, but translate the severity into wildly different dollar amounts (Sunstein, Kahneman, and Schkade). Or doctors will agree on the severity of a case but not the patient’s future outcome (Dwyer et al).
Training and retraining. With e.g. educational assessment, this means giving people reference evaluations and then practicing on a second set of evaluations until they get the right result (Wikipedia, Polin et al). Even after this was done, evaluators benefited from periodic retraining (Polin et al).

Edit 9/16: One particular thing to train is probabilistic thinking (found by Ruby in Uncertain Judgements).

Creating Composite Models

One idea that came up repeatedly in business literature was forcing predictors to build (potentially very crude) mathematical models.

Kahneman recommends the following procedure, which he calls creating a “reasoned rule” (summary from Jason Collins):

Select six to eight variables that are distinct and obviously related to the predicted outcome. Assets and revenues (weighted positively) and liabilities (weighted negatively) would surely be included, along with a few other features of loan applications.
Take the data from your set of cases (all the loan applications from the past year) and compute the mean and standard deviation of each variable in that set.
For every case in the set, compute a “standard score” for each variable: the difference between the value in the case and the mean of the whole set, divided by the standard deviation. With standard scores, all variables are expressed on the same scale and can be compared and averaged.
Compute a “summary score” for each case―the average of its variables’ standard scores. This is the output of the reasoned rule. The same formula will be used for new cases, using the mean and standard deviation of the original set and updating periodically.
Order the cases in the set from high to low summary scores, and determine the appropriate actions for different ranges of scores. With loan applications, for instance, the actions might be “the top 10% of applicants will receive a discount” and “the bottom 30% will be turned down.”

Richard H. Moss recommends a similar procedure in his paper on estimating climate change:

For each of the major findings you expect to be developed in your chapter, identify the most important factors and uncertainties that are likely to affect the conclusions. Also specify which important factors/variables are being treated exogenously or fixed, as it will almost always be the case that some important components will be treated in this way when addressing the complex phenomena examined in the TAR.
Document ranges and distributions in the literature, including sources of information on the key causes of uncertainty. Note that it is important to consider the types of evidence available to support a finding (e.g., distinguish findings that are well established through observations and tested theory from those that are not so established).
Given the nature of the uncertainties and state of science, make an initial determination of the appropriate level of precision—is the state of science such that only qualitative estimates are possible, or is quantification possible, and if so, to how many significant digits? As the assessment proceeds, recalibrate level of precision in response to your assessment of new information.
Quantitatively or qualitatively characterize the distribution of values that a parameter, variable, or outcome may take. First identify the end points of the range that the writing team establishes, and/or any high consequence, low probability outcomes or “outliers.” Particular care needs to be taken to specify what portion of the range is included in the estimate (e.g., this is a 90% confidence interval) and what the range is based on. Then provide an assessment of the general shape (e.g., uniform, bell, bimodal, skewed, symmetric) of the distribution. Finally, provide your assessment of the central tendency of the distribution (if appropriate).
Using the terms described below, rate and describe the state of scientific information on which the conclusions and/or estimates (i.e. from step 4) are based.
Prepare a “traceable account” of how the estimates were constructed that describes the writing team’s reasons for adopting a particular probability distribution, including important lines of evidence used, standards of evidence applied, approaches to combining/reconciling multiple lines of evidence, explicit explanations of methods for aggregation, and critical uncertainties.
OPTIONAL: Use formal probabilistic frameworks for assessing expert judgment (i.e. decision analytic techniques), as appropriate for each writing team.

Costs of Consistency

It is trivial to get 100% consistency: just have everyone guess 0 every time. If you’re feeling fancy they could guess base rate. Obviously this would be pointless because you would learn nothing

If two individuals are to come up with the same answer to a problem, they can only use information both of them have. This should on average damage the accuracy of the work (if it doesn’t, you have more problems). This can be okay in certain circumstances; the penal system sometimes values predictability over getting exactly the right answer, customers get irate if quoted widely varying prices. But often it is not okay, and false precision is harmful.

Measures of Noise in Answers

There’s a robust field of inter-rater reliability statistics, of which The Handbook of Inter-Rater Reliability appears to be the best single source. Due to time constraints and the density of the subject I did not follow up on this further.

Measures of Ambiguity in Questions

I found no data on ambiguity in predictions or statement evaluation. The richest source of related data was on ambiguity in product requirement specifications. There are several systems for measuring ambiguity in natural language, the most prominent of which is LOLITA. Other systems include:

QuARs
Unnamed SATC tool at NASA

I found no data on the cost ambiguous requirements exact, or how much of this cost could be avoided with NLP systems. These systems had major types of ambiguity they could not detect and were not a substitute for human evaluation.

Subjective Judgements

I found very mixed results on whether subjective judgements could replace objective composite measurements, and no obvious trends in which areas were robust to subjective predictions: negative, positive, negative, positive, negative.

Papers tended to assume the objective measurements were more accurate, without considering how they could be tampered with. E.g.,in this study of the Denver police, crime rates were not found to be heavily correlated with resident satisfaction. The paper seemed to think this was a deficit in the residents’ understanding, as opposed to the police department interfering with crime statistics. So perhaps one area where subjective measurements are preferable is where nominally objective measurements are controlled by the institution being measured.

Limitations of This Paper and Future Work

Due to time constraints, I had to take papers’ word for their findings. I did not have time to look for replicability or statistical errors, and could only do quick checks of methodology. A future deep dive in any subject covered should include a more skeptical reading of my sources.
Most work done on inter-rater reliability is in fields like medicine, teacher evaluations, and image labeling. These involve estimating fairly known things with lots of reference instances. This is a fundamentally different kind of problem than predicting novel, low-probability events- among other differences, it’s harder to generate reference charts and training data.
There are many, many on inter-rater reliability in narrow fields. Sometimes they contain suggestions for mitigations; usually they do not. Additionally, an overwhelming majority of these studies are on cancer-diagnosis-type problems, not low-frequency-global-event-type problems. I read a few of these and got some value out of them (mostly mitigation techniques, such as asking why someone believed something), but hit diminishing returns after a few papers. A more thorough reading of the genre of “Humans are unreliable” would probably find more mitigations.
There are also many, many studies on using multiple human labelers to do image labeling or NLP tasks, often using mathematical models. I did not have time to dig into the actual models and took the papers’ word for their power. This paper on bootstrapping from 0 to known question answer, question difficulty, and IQ assessment of participants looks especially interesting.

Edit 9/16: This review paper, found by DanielFilan, looks even better.

A more thorough understanding of the statistics would be useful, perhaps starting with The Handbook of Inter-Rater Reliability or http://inter-rater-reliability.blogspot.com/.
How to get the best work out of groups working together? This is a social psychology research project in its own right.
There is a lot of information about how to make crowds more accurate, but not more consistent.
Investigate the bias-variance trade off more, especially for human decision making.
Books that would be relevant to the questions:

Protocol Analysis includes sections on coding verbal reports reliably.
Daniel Kahnemen is writing a tantalizingly relevant book (Noise) that will not be available for at least a year, possibly more.
Emerging Trends in the Development and Application of Composite Indicators
Superforecasting
The Power of Mathematical Thinking
The Power of Intuition (least sure about that one)
WISER: Getting Beyond Groupthink to Make Groups Smarter
Psychology of Intelligence Analysis

Edit 9/16: Raemon describes this as "Thinking Fast and Slow" for CIA agents.

Collective Wisdom: Principles and Mechanisms
Dialogue Mapping: Building Shared Understanding of Wicked Problem
How to Measure Anything
Uncertain Judgements: Eliciting Experts' Probabilities

Edit 9/16: on skimming, Ruby did not find anything specifically related to consistency.

Cambridge Handbook on Expertise

This report was funded by a forecasting infrastructure project managed by Ozzie Gooen, which is itself funded by a grant from the Effective Altruism Long Term Future Fund.

My raw notes are available here.

[1] While companies are typically trying to maximize profits, customers are often extremely sensitive to perceived injustices in pricing, and inconsistencies are perceived as injustices.

[2] List courtesy https://hbr.org/2016/10/noise.

9/16/2019: Made various updates based on other people's research, seen in the comments of this post, related questions, and privately shared write ups. Thanks to everyone for coming out.

Partial review / thoughts / summaries of "Psychology of Intelligence Analysis" (Work in progress)

This book generally reads as if a CIA analyst wrote "Thinking Fast and Slow" with "CIA analysts" as a target audience (although written in 1999, a decade earlier), Mostly it's arguing that the CIA should take cognitive biases and other intelligence failure modes seriously, and implement study and training to improve the situations. He has some overall suggestions on how to go about that which I didn't find very surprising.

This web page is an overall review/summary (about 1.5 pages long) which I recommend reading if you want an overall sense of the book.

In subsequent comments here, I'll be jumping around to more concrete empirical claims that I could find.

Chapter 5: Do you really need more information?

Once an experienced analyst has the minimum information necessary to make an informed judgment, obtaining additional information generally does not improve the accuracy of his or her estimates. Additional information does, however, lead the analyst to become more confident in the judgment, to the point of overconfidence

Experienced analysts have an imperfect understanding of what information they actually use in making judgments. They are unaware of the extent to which their judgments are determined by a few dominant factors, rather than by the systematic integration of all available information. Analysts actually use much less of the available information than they think they do

Example Experiment: How many variables are relevant to betting on horses?

Eight experienced horserace handicappers were shown a list of 88 variables found on a typical horse-past-performance chart. Each handicapper identified the 5 most important items of information—those he would wish to use to handicap a race if he were limited to only five items of information per horse. Each was then asked to select the 10, 20, and 40 most important variables he would use if limited to those levels of information.

At this point, the handicappers were given true data (sterilized so that horses and actual races could not be identified) for 40 past races and were asked to rank the top five horses in each race in order of expected finish. Each handicapper was given the data in increments of the 5, 10, 20 and 40 variables he had judged to be most useful. Thus, he predicted each race four times—once with each of the four different levels of information. For each prediction, each handicapper assigned a value from 0 to 100 percent to indicate degree of confidence in the accuracy of his prediction.

When the handicappers’ predictions were compared with the actual outcomes of these 40 races, it was clear that average accuracy of predictions remained the same regardless of how much information the handicappers had available.

3 of the handicappers showed less accuracy as the amount of information increased

2 improved their accuracy

3 were unchanged.

All, however, expressed steadily increasing confidence in their judgments as more information was received. This relationship between amount of information, accuracy of the handi-

Paul Slovic, “Behavioral Problems of Adhering to a Decision Policy,” unpublished manuscript, 1973.

Graph of their confidence and accuracy:

The same relationships among amount of information, accuracy, and analyst confidence have been confirmed by similar experiments in other fields. (footnote claims a list of references available in Lewis R. Goldberg's “Simple Models or Simple Processes? Some Research on Clinical Judgments"

Some relevant quotes from Scott’s review of Superforecasting are bolded below:

“First of all, is it just luck? After all, if a thousand chimps throw darts at a list of stocks, one of them will hit the next Google, after which we can declare it a “superchimp”. Is that what’s going on here? No. Superforecasters one year tended to remain superforecasters the next. The year-to-year correlation in who was most accurate was 0.65; about 70% of superforecasters in the first year remained superforecasters in the second. This is definitely a real thing.”

Could imply that accuracy (in predictionmaking) correlates with consistency. Would need to look into whether there’s a relationship between the consistency of someone’s status as a superforecaster and their consistency with respect to the answer of a particular question, or group of questions
Could also imply that our best method for verifying accuracy is by ascertaining consistency.
Further steps: look into the good judgement project directly, try out some metrics that point at “how close is this question to another related question”, see how this metric varies with accuracy.

“One result is that while poor forecasters tend to give their answers in broad strokes – maybe a 75% chance, or 90%, or so on – superforecasters are more fine-grained. They may say something like “82% chance” – and it’s not just pretentious, Tetlock found that when you rounded them off to the nearest 5 (or 10, or whatever) their accuracy actually decreased significantly. That 2% is actually doing good work.”

This seems to be a different kind of “precision” than the “consistency” that you’re looking for. Maybe it’s worth separating refinement-type precision from reliability-type precision.

Scott notably reports that IQ, well-informed-ness, and math ability only correlate somewhat with forecasting ability, and that these traits don’t do as good a job of distinguishing superforecasters.

On the other hand, AI Impacts did a review of data from the Good Judgement Project, the project behind Tetlock’s conclusions, that suggests that some of these traits might actually be important -- particularly intelligence. Might be worth looking into the GJP data specifically with this question in mind.

I had a look over Uncertain Judgements: Eliciting Experts' Probabilities, mostly reading the through the table of contents and jumping around and reading bits which seemed relevant.

The book is pretty much exactly what the title says: it's all about how to accurately get expert's opinions, whatever those opinions might be (as opposed to trying to get experts to be accurate). Much probability/statistics theory is explained (especially Bayesianism) as well as a good deal of heuristics and biases material like anchoring-adjusting, affect heuristic + inside/outside view.

Some points:

A repeated point is that experts, notwithstanding their subject expertise, are often not trained in probability and probabilistic thinking such that they're not very good by default at reporting estimates.

Part of this is most people are familiar with probability only in terms of repeatable, random events that are nicely covered by frequentist statistics and don't know how to give subjective probability estimates well. (The book calls subjective probabilities "personal probabilities".)
A suggested solution is giving experts appropriate training, calibration training, etc. in advance of trying to elicit their estimates.

There's discussion of coherence (in the sense of conforming to the basic probability theorems). An interesting point is that while it's easy to see if probabilities of mutually exclusive events add up to greater than 1, it can harder to see if several correlations one believes in are inconsistent (say, resulting in a covariance matrix that isn't positive-definite). Each believed correlation on its own can seem fine to a person even though in aggregate they don't work.
Another interesting point is the observation is that people are good at reporting the frequency of their own observation of thing, but bad at seeing or correcting for the fact that sampling biases can affect what they end up observing.

On the whole, kinda interesting stuff on how to actually get experts actual true beliefs, but nothing really specifically on the topic of getting consistent estimates. The closest thing to that seems to be the parts on getting coherent probability estimates from people, though generally, the book mixes between "accurately elicit expert's beliefs" and "get experts to have accurate, unbiased beliefs."

I feel like the simple Kahneman algorithms are amazing. Based on what I read in the Harvard Business Review article this isn't six to eight complex variables; this is more like six cells in a spreadsheet. This has several implications:

Cheap: algorithms should be considered superior to expert opinion because they perform similarly for a fraction of the price.
Fast: spreadsheet calculations are very, very fast relative to an expert review process. Decision speed is a very common bottleneck in organizations and complex tasks; having more time is like an accumulated general advantage in the same way as having more money.
Simple: the low number of variables makes the options for changes to make clear, and it is easy to tell the difference between two versions of the algorithm.
Testable: being cheap, fast, and simple makes them ideal candidates for testing. It is easy to run multiple versions of an algorithm side by side, for almost no more resources than it takes to run one version.
Bootstrapping: because it is easy to test them, this effectively lowers the threshold of expertise required to identify the variables in the first place. Instead literature reviews no more intensive than the kind we do here would suffice to identify candidates for variables, and then testing can sort the most effective ones.

Even in the case where such an algorithm is exceeded by expertise, these factors make it easy to make the algorithm ubiquitous which implies we can use them to set a new floor on the goodness of decisions in the relevant domain. That really seems like raising the sanity waterline.

Decisions: fast, cheap, good. Sometimes we can have all three.

“What do other people think?” is also a well known trick for getting people to be honest about opinions over which they expect to receive censure.

I'm somewhat newly against this way of asking questions because I answered questions framed this way directly by supplying responses based on my model of my others rather than supplying answers based on my own preferences (not revealing anything about opinion I was trying to hide). To be specific, in this case the City of Berkeley had a survey asking people questions about housing policy, and they framed many of the questions like "How likely would your neighbors be comfortable with X in your neighborhood?", and I answered based on how I think most of my neighbors would respond, not based on what I would prefer, and in fact I believe those two things are in opposition. When I realized they were using this technique it felt like getting duped into betraying myself.

I view this technique as relying on reversed stupidity: people may not vote their preferences because it may be psychically painful to admit their own preferences to themselves (noticing their preferences would cause cognitive dissonance) but they do project those preferences onto others and so you can find them out by surreptitiously asking them about their projections, but like all reversed stupidity it turns into just plain stupidity if the original stupidity is not there for it to act against.

This is not to say these kind of approaches can't work if you honestly care about the question they are asking. If you want to know what people think about what other people think, this is the right kind of question to ask (I sometimes intentionally ask questions like these to find out about people's models rather than what the models say about reality). But using them to try to "trick" people into telling you their honest opinions seems like a dishonest strategy that will not even always work.