Partial review / thoughts / summaries of "Psychology of Intelligence Analysis" (Work in progress)
This book generally reads as if a CIA analyst wrote "Thinking Fast and Slow" with "CIA analysts" as a target audience (although written in 1999, a decade earlier), Mostly it's arguing that the CIA should take cognitive biases and other intelligence failure modes seriously, and implement study and training to improve the situations. He has some overall suggestions on how to go about that which I didn't find very surprising.
This web page is an overall review/summary (about 1.5 pages long) which I recommend reading if you want an overall sense of the book.
In subsequent comments here, I'll be jumping around to more concrete empirical claims that I could find.
Chapter 5: Do you really need more information?
Once an experienced analyst has the minimum information necessary to make an informed judgment, obtaining additional information generally does not improve the accuracy of his or her estimates. Additional information does, however, lead the analyst to become more confident in the judgment, to the point of overconfidence
Experienced analysts have an imperfect understanding of what information they actually use in making judgments. They are unaware of the extent to which their judgments are determined by a few dominant factors, rather than by the systematic integration of all available information. Analysts actually use much less of the available information than they think they do
Example Experiment: How many variables are relevant to betting on horses?
Eight experienced horserace handicappers were shown a list of 88 variables found on a typical horse-past-performance chart. Each handicapper identified the 5 most important items of information—those he would wish to use to handicap a race if he were limited to only five items of information per horse. Each was then asked to select the 10, 20, and 40 most important variables he would use if limited to those levels of information.
At this point, the handicappers were given true data (sterilized so that horses and actual races could not be identified) for 40 past races and were asked to rank the top five horses in each race in order of expected finish. Each handicapper was given the data in increments of the 5, 10, 20 and 40 variables he had judged to be most useful. Thus, he predicted each race four times—once with each of the four different levels of information. For each prediction, each handicapper assigned a value from 0 to 100 percent to indicate degree of confidence in the accuracy of his prediction.
When the handicappers’ predictions were compared with the actual outcomes of these 40 races, it was clear that average accuracy of predictions remained the same regardless of how much information the handicappers had available.
3 of the handicappers showed less accuracy as the amount of information increased
2 improved their accuracy
3 were unchanged.
All, however, expressed steadily increasing confidence in their judgments as more information was received. This relationship between amount of information, accuracy of the handi-
Paul Slovic, “Behavioral Problems of Adhering to a Decision Policy,” unpublished manuscript, 1973.
Graph of their confidence and accuracy:
The same relationships among amount of information, accuracy, and analyst confidence have been confirmed by similar experiments in other fields. (footnote claims a list of references available in Lewis R. Goldberg's “Simple Models or Simple Processes? Some Research on Clinical Judgments"
Some relevant quotes from Scott’s review of Superforecasting are bolded below:
Scott notably reports that IQ, well-informed-ness, and math ability only correlate somewhat with forecasting ability, and that these traits don’t do as good a job of distinguishing superforecasters.
On the other hand, AI Impacts did a review of data from the Good Judgement Project, the project behind Tetlock’s conclusions, that suggests that some of these traits might actually be important -- particularly intelligence. Might be worth looking into the GJP data specifically with this question in mind.
I had a look over Uncertain Judgements: Eliciting Experts' Probabilities, mostly reading the through the table of contents and jumping around and reading bits which seemed relevant.
The book is pretty much exactly what the title says: it's all about how to accurately get expert's opinions, whatever those opinions might be (as opposed to trying to get experts to be accurate). Much probability/statistics theory is explained (especially Bayesianism) as well as a good deal of heuristics and biases material like anchoring-adjusting, affect heuristic + inside/outside view.
Some points:
On the whole, kinda interesting stuff on how to actually get experts actual true beliefs, but nothing really specifically on the topic of getting consistent estimates. The closest thing to that seems to be the parts on getting coherent probability estimates from people, though generally, the book mixes between "accurately elicit expert's beliefs" and "get experts to have accurate, unbiased beliefs."
I feel like the simple Kahneman algorithms are amazing. Based on what I read in the Harvard Business Review article this isn't six to eight complex variables; this is more like six cells in a spreadsheet. This has several implications:
Even in the case where such an algorithm is exceeded by expertise, these factors make it easy to make the algorithm ubiquitous which implies we can use them to set a new floor on the goodness of decisions in the relevant domain. That really seems like raising the sanity waterline.
Decisions: fast, cheap, good. Sometimes we can have all three.
“What do other people think?” is also a well known trick for getting people to be honest about opinions over which they expect to receive censure.
I'm somewhat newly against this way of asking questions because I answered questions framed this way directly by supplying responses based on my model of my others rather than supplying answers based on my own preferences (not revealing anything about opinion I was trying to hide). To be specific, in this case the City of Berkeley had a survey asking people questions about housing policy, and they framed many of the questions like "How likely would your neighbors be comfortable with X in your neighborhood?", and I answered based on how I think most of my neighbors would respond, not based on what I would prefer, and in fact I believe those two things are in opposition. When I realized they were using this technique it felt like getting duped into betraying myself.
I view this technique as relying on reversed stupidity: people may not vote their preferences because it may be psychically painful to admit their own preferences to themselves (noticing their preferences would cause cognitive dissonance) but they do project those preferences onto others and so you can find them out by surreptitiously asking them about their projections, but like all reversed stupidity it turns into just plain stupidity if the original stupidity is not there for it to act against.
This is not to say these kind of approaches can't work if you honestly care about the question they are asking. If you want to know what people think about what other people think, this is the right kind of question to ask (I sometimes intentionally ask questions like these to find out about people's models rather than what the models say about reality). But using them to try to "trick" people into telling you their honest opinions seems like a dishonest strategy that will not even always work.
Overview
Most existing forecasting or evaluation platform questions are for very clearly verifiable questions:
But many of the questions we care about are do not look like this. They might…
An entity who could answer these questions well would be a very valuable asset. But what does well even mean here? We want people to be accurate, of course, but in many cases we also need their predictions/evaluations to be consistent to be actionable. This is especially true when fairness norms are in play, such as pricing[1] and prison sentencing.
There is a lot of research showing that people make inconsistent evaluations (with each other and themselves across time) across a wide variety of fields, even those that more closely resemble the “easy” questions above (valuing stocks, appraising real estate, sentencing criminals, evaluating job performance, auditing financial statements)[2]. It is even more difficult to consistently evaluate or predict novel questions or low-frequency events, like “Will India use a nuclear weapon on Pakistan by 1/1/20” or “How much counterfactual value has this organization created?”.
This paper is a shallow review of the literature around how to get entities to make consistent judgements. I want to note up front that a major limitation of this write-up and of shallow reviews in general is that I mostly relied on authors’ descriptions of their work and conclusions, rather than verifying their experimental design and conclusions for myself, or looking up others’ opinions of papers. As such, this post should be taken as a description of the state of the literature, not the state of the world.
Speaking of underspecified questions, “how to get consistent answers to complicated questions?” sure is one. I started this research project with a vague sense of an area from Ozzie Gooen; as I iterated, we came up with more specific questions. The following is a list of questions or hooks that came up as we discussed the research:
Some of these questions were answered in more detail than others, some were not answerable at all in the time available. Here is what I found.
Methods to Improve Consistency in Evaluations
Creating Composite Models
One idea that came up repeatedly in business literature was forcing predictors to build (potentially very crude) mathematical models.
Kahneman recommends the following procedure, which he calls creating a “reasoned rule” (summary from Jason Collins):
Richard H. Moss recommends a similar procedure in his paper on estimating climate change:
Costs of Consistency
It is trivial to get 100% consistency: just have everyone guess 0 every time. If you’re feeling fancy they could guess base rate. Obviously this would be pointless because you would learn nothing
If two individuals are to come up with the same answer to a problem, they can only use information both of them have. This should on average damage the accuracy of the work (if it doesn’t, you have more problems). This can be okay in certain circumstances; the penal system sometimes values predictability over getting exactly the right answer, customers get irate if quoted widely varying prices. But often it is not okay, and false precision is harmful.
Measures of Noise in Answers
There’s a robust field of inter-rater reliability statistics, of which The Handbook of Inter-Rater Reliability appears to be the best single source. Due to time constraints and the density of the subject I did not follow up on this further.
Measures of Ambiguity in Questions
I found no data on ambiguity in predictions or statement evaluation. The richest source of related data was on ambiguity in product requirement specifications. There are several systems for measuring ambiguity in natural language, the most prominent of which is LOLITA. Other systems include:
I found no data on the cost ambiguous requirements exact, or how much of this cost could be avoided with NLP systems. These systems had major types of ambiguity they could not detect and were not a substitute for human evaluation.
Subjective Judgements
I found very mixed results on whether subjective judgements could replace objective composite measurements, and no obvious trends in which areas were robust to subjective predictions: negative, positive, negative, positive, negative.
Papers tended to assume the objective measurements were more accurate, without considering how they could be tampered with. E.g.,in this study of the Denver police, crime rates were not found to be heavily correlated with resident satisfaction. The paper seemed to think this was a deficit in the residents’ understanding, as opposed to the police department interfering with crime statistics. So perhaps one area where subjective measurements are preferable is where nominally objective measurements are controlled by the institution being measured.
Limitations of This Paper and Future Work
Edit 9/16: This review paper, found by DanielFilan, looks even better.
Edit 9/16: Raemon describes this as "Thinking Fast and Slow" for CIA agents.
Edit 9/16: on skimming, Ruby did not find anything specifically related to consistency.
This report was funded by a forecasting infrastructure project managed by Ozzie Gooen, which is itself funded by a grant from the Effective Altruism Long Term Future Fund.
My raw notes are available here.
[1] While companies are typically trying to maximize profits, customers are often extremely sensitive to perceived injustices in pricing, and inconsistencies are perceived as injustices.
[2] List courtesy https://hbr.org/2016/10/noise.
9/16/2019: Made various updates based on other people's research, seen in the comments of this post, related questions, and privately shared write ups. Thanks to everyone for coming out.