Cross-posted from the EA Forum.

TL;DR - My research group at the University of Melbourne is working to improve methods for evaluating quality of reasoning, particularly for use within government intelligence organisations. We’re conducting a study to compare a new evaluation method with the method currently used by intelligence agencies in the US. By participating, you get access to training materials in both the existing and proposed methods. You can sign up here.

Study Motivation

It is important that conclusions reached by analysts working in professional intelligence organisations are accurate so that resulting decisions made by governments and other decision-makers are grounded in reality. Historically, failures of intelligence have contributed to decisions or oversights that wasted resources and often caused significant harm. Prominent examples from US history include the attack on Pearl Harbour, the 1961 Bay of Pigs invasion, 9/11, and the Iraq War.

Such events are at least partly the result of institutional decisions made based on poor reasoning. To reduce the risk of such events, it is important that the analysis informing those decisions is well reasoned. We use the phrase well reasoned to mean that the arguments articulated establish the stated conclusion. (If the arguments fail to establish the stated conclusion, we say the analysis is poorly reasoned.)

The ‘industry standard’ method for evaluating quality of reasoning (QoR) amongst intelligence organisations in the US is the IC Rating Scale, a rubric based on a set of Analytic Standards issued by the US Office of the Director of National Intelligence (ODNI) in 2015. There are significant question marks over the extent to which the IC Rating Scale is (and can be) operationalised to improve the QoR in intelligence organisations. See here for a detailed summary, but in brief:

  • Inter-rater reliability of the Rating Scale between individual raters is poor. (Though reliability between aggregated ratings - constructed by averaging the ratings of multiple raters - is better.)
  • Information is lacking on whether or not the Rating Scale is valid (whether it in fact measures QoR, as intended).
  • Ambiguities in the specification of the Rating Scale can make it difficult for raters to apply.
  • The Rating Scale can be overly prescriptive and detailed, making it difficult to quickly distinguish well reasoned from poorly reasoned analytic products.

Our research group has been developing an alternative method for evaluating QoR, notionally called the Reasoning Stress Test (RST), which focuses on detecting the presence of particular types of reasoning flaws in written reasoning. The RST is designed to be an easy to apply and efficient method, but this approach comes at a cost: raters do not consider the degree to which the reasoning displays other reasoning virtues, nor go through a checklist of the necessary and sufficient conditions of good reasoning.

We are conducting a study to compare the ability of participants trained in each method to discriminate between well and poorly reasoned intelligence-style products (among other research questions).

We are offering training in both the current and novel methods for evaluating QoR in return for participation in the study. The training has been primarily designed for intelligence analysis, so will give you insight into how reasoning is evaluated in such institutions. However, the principles of reasoning quality taught are much more broadly applicable. They apply to all types of reasoning, and can be used to assess QoR in any institution with intelligence or analytical roles.

Methodological Note

We are aware that by publicly describing the potential limitations of the two methods—as we have done above—we risk prejudicing participants’ responses to either method in the study. The alternative, not to provide such information, would make it harder for you to decide whether the training is of interest. We decided to provide the information because:

  • we will be modelling the effect of existing familiarity with either method, rather than excluding participants on that basis;
  • in the context of our study design, it is difficult to articulate a plausible mechanism through which such prejudice could influence good faith participation in the study; and
  • at the current stage of research into methods for evaluating QoR, we believe that the value of additional data that may be gained by explaining the study motivation outweighs the potential limitations of that data as a result of this potential prejudicing effect.

Significant work has been done to develop polished, insightful training into both methods, and we are confident that learning the principles behind and application of both methods will help you evaluate the reasoning of others.

What does participation involve?

Participating in the study involves:

  • Random allocation to one of the two reasoning evaluation methods
  • Training on how to use the method, including some simple review questions
  • A series of challenging fictional intelligence products (i.e. reports or assessments) to evaluate. In previous testing, we have found that many of these are very difficult to evaluate.
  • Expert responses to each question to compare to your own.
  • After you have completed all the training on the first method, you will be given access to the training material for the other method. You can choose to complete the training in the second method or not as you prefer.

You can sign up here.

---

Any questions, comments or suggestions welcome.

New Comment
4 comments, sorted by Click to highlight new comments since:
[-][anonymous]60

Hi, judging from your post history, you seem to be new to LessWrong. It might help if you add some info on why you think the study would be of interest to our specific community.

Thanks for your comment, this is my first post but I have been reading LessWrong and adjacent sites for 6+ years, so I'm not unfamiliar with the rationalist community.

I don't think I have much to add beyond the pitch in the original post. This is an opportunity to help improve reasoning and decision-making in real-world institutions with significant influence. By participating you get access to training materials used to teach reasoning evaluation in such institutions, which may be of intrinsic methodological interest. Further, by completing the training you may learn new skills that you can apply to improve your own reasoning.

It is also an opportunity to benchmark your own reasoning evaluation skills against others: we will be sending out such feedback once the study is complete, and are currently looking at ways to incorporate benchmarking into the training itself.

This post seems both interesting and like a way to get a very unrepresentative sample.

Appreciate your comment - we are aware of that and this is just one of several recruitment avenues we are pursuing.