Help improve reasoning evaluation in intelligence organisations

LThorburn

Cross-posted from the EA Forum.

TL;DR - My research group at the University of Melbourne is working to improve methods for evaluating quality of reasoning, particularly for use within government intelligence organisations. We’re conducting a study to compare a new evaluation method with the method currently used by intelligence agencies in the US. By participating, you get access to training materials in both the existing and proposed methods. You can sign up here.

Study Motivation

It is important that conclusions reached by analysts working in professional intelligence organisations are accurate so that resulting decisions made by governments and other decision-makers are grounded in reality. Historically, failures of intelligence have contributed to decisions or oversights that wasted resources and often caused significant harm. Prominent examples from US history include the attack on Pearl Harbour, the 1961 Bay of Pigs invasion, 9/11, and the Iraq War.

Such events are at least partly the result of institutional decisions made based on poor reasoning. To reduce the risk of such events, it is important that the analysis informing those decisions is well reasoned. We use the phrase well reasoned to mean that the arguments articulated establish the stated conclusion. (If the arguments fail to establish the stated conclusion, we say the analysis is poorly reasoned.)

The ‘industry standard’ method for evaluating quality of reasoning (QoR) amongst intelligence organisations in the US is the IC Rating Scale, a rubric based on a set of Analytic Standards issued by the US Office of the Director of National Intelligence (ODNI) in 2015. There are significant question marks over the extent to which the IC Rating Scale is (and can be) operationalised to improve the QoR in intelligence organisations. See here for a detailed summary, but in brief:

Inter-rater reliability of the Rating Scale between individual raters is poor. (Though reliability between aggregated ratings - constructed by averaging the ratings of multiple raters - is better.)
Information is lacking on whether or not the Rating Scale is valid (whether it in fact measures QoR, as intended).
Ambiguities in the specification of the Rating Scale can make it difficult for raters to apply.
The Rating Scale can be overly prescriptive and detailed, making it difficult to quickly distinguish well reasoned from poorly reasoned analytic products.

Our research group has been developing an alternative method for evaluating QoR, notionally called the Reasoning Stress Test (RST), which focuses on detecting the presence of particular types of reasoning flaws in written reasoning. The RST is designed to be an easy to apply and efficient method, but this approach comes at a cost: raters do not consider the degree to which the reasoning displays other reasoning virtues, nor go through a checklist of the necessary and sufficient conditions of good reasoning.

We are conducting a study to compare the ability of participants trained in each method to discriminate between well and poorly reasoned intelligence-style products (among other research questions).

We are offering training in both the current and novel methods for evaluating QoR in return for participation in the study. The training has been primarily designed for intelligence analysis, so will give you insight into how reasoning is evaluated in such institutions. However, the principles of reasoning quality taught are much more broadly applicable. They apply to all types of reasoning, and can be used to assess QoR in any institution with intelligence or analytical roles.

Methodological Note

We are aware that by publicly describing the potential limitations of the two methods—as we have done above—we risk prejudicing participants’ responses to either method in the study. The alternative, not to provide such information, would make it harder for you to decide whether the training is of interest. We decided to provide the information because:

we will be modelling the effect of existing familiarity with either method, rather than excluding participants on that basis;
in the context of our study design, it is difficult to articulate a plausible mechanism through which such prejudice could influence good faith participation in the study; and
at the current stage of research into methods for evaluating QoR, we believe that the value of additional data that may be gained by explaining the study motivation outweighs the potential limitations of that data as a result of this potential prejudicing effect.

Significant work has been done to develop polished, insightful training into both methods, and we are confident that learning the principles behind and application of both methods will help you evaluate the reasoning of others.

What does participation involve?

Participating in the study involves:

Random allocation to one of the two reasoning evaluation methods
Training on how to use the method, including some simple review questions
A series of challenging fictional intelligence products (i.e. reports or assessments) to evaluate. In previous testing, we have found that many of these are very difficult to evaluate.
Expert responses to each question to compare to your own.
After you have completed all the training on the first method, you will be given access to the training material for the other method. You can choose to complete the training in the second method or not as you prefer.

Sign Up Link

You can sign up here.

LESSWRONG
LW

8

Help improve reasoning evaluation in intelligence organisations

8

Study Motivation

Methodological Note

What does participation involve?

Sign Up Link

Related Reading

8