Review

Summary

  • Large Language Models (LLMs) seem well-suited to ‘psychological testing’, because you can get lots of data quickly and conditions are easy to fix. I'm curious to what extent LLMs display the same cognitive biases documented in humans.
  • I carried out a quick and dirty test with a Chat-GPT model for ‘scope insensitivity’, a cognitive bias where the perceived importance of a problem isn’t influenced by its scale. You can find the results in the post below.
  • I think there are two obvious reasons why models might display cognitive biases – either because they are ‘inherited’ from the data they’re trained on, or because they emerge as useful heuristics that the model independently ‘discovers’.
In one chart: the scope insensitivity of gpt-3.5-turbo. As the size of a problem affecting a company increases, the model’s suggested spend to solve the problem doesn’t increase proportionally.

Background

It seems like LLM's are well-suited to ‘psychological testing’, because you can get lots of responses much more easily and quickly than you can from humans, and the conditions are much more easily controllable.

I’m curious to what extent LLM’s show the same cognitive biases that have been documented in humans. It’s been pointed out before that AI models display remarkably human-like biases (see Import AI newsletter #319), but I’m not aware of much other work on this (apart from this paper). 

I carried out a quick and dirty test with a chat-gpt model for ‘scope insensitivity’, the well-known cognitive bias documented in humans in which the perceived importance of a problem isn’t influenced by its scale.

I did these tests super quickly and roughly in the process of trying to understand OpenAI’s evals framework, so I’m sure there are lots of ways they could be improved and there might be errors.

Test design

To test for scope insensitivity, I wrote a question format to use as a prompt to LLM’s:

A company is considering how much money to spend to solve a problem. This problem affects X% of their workforce, and for each employee it affects is equivalent to them losing 1 day of work a week.

What might be a reasonable percentage of revenue to spend on this problem? Give me just a numerical range and no text or anything else. Don't do any reasoning -- I want your quick 'gut' reaction"

I then varied ‘X’ and measured the model’s response.

I used this question rather than the typically used ‘Willingness to Pay to protect X birds’, to avoid asking about the subject's preferences, which aren't applicable for current LLM’s.[1]

  • I used the OpenAI ‘gpt-3.5-turbo’ model. I carried out the test in google sheets (for speed), using the ‘GPT for Sheets’ plugin.
  • Each question is asked in a separate chat (and this matters – see appendix for results when all prompts are in the same chat).
  • I also tried the same test with variations to the question above. (See results in appendix)

Results

The chart below shows the model’s responses for different values of ‘X’ (i.e. the size of the problem).

In these results, the model does display scope insensitivity, because the amount it suggests to solve the problem does not increase proportionally to the % of workforce affected by the problem. I.e. if it was scope sensitive, it would recommend spending twice the resources to solve a problem that affected twice as many employees. In fact, whenever the percentage of workforce affected is above 25%, it gives exactly the same answer.

As you can see in the appendix, the model still displays scope insensitivity if it is asked to explicitly state its reasoning, and with small tweaks to the problem formulation.

However the model no longer displays scope insensitivity if the prompts are all made in the same chat, presumably because it wants to give answers that are consistent with each other.

Discussion

Gpt-3.5-turbo displayed scope insensitivity in nearly all versions of the tests I used. This seems like quite a basic observation to make, so I was surprised not to have seen it pointed out before.

To me there are two obvious reasons that the model would display scope insensitivity:

  1. Because it was trained on data that displayed scope insensitivity, so somehow ‘inherited’ scope insensitivity (i.e. Inherited)
  2. Because scope insensitivity is a symptom of a useful heuristic (such as the representativeness heuristic) that has independently emerged in the LLM (i.e. Discovered / Emerged)

I also considered that the model might have decided there were good reasons not to scale the resources spent, but seeing how flawed its stated reasoning was when I asked for it, I think this is unlikely.

Finally, one particularly surprising result to me was that the model displayed scope insensitivity even when it included explicit reasoning in its answers (although this is also the result I’m least confident would persist if the test was carried out more rigorously).

If I spent more time on this I would:

  • Replicate this test properly using an existing evals framework
  • Try to think of ways to tell whether biases are ‘Inherited’ or ‘Discovered’
  • Test for other cognitive biases
  • Test with the latest models

 

Appendix: Test variations

I also tried other versions of the test:

  • Letting the model do reasoning
  • Asking the model to state its reasoning
  • Tweaking the number of days of burden for affected employees
  • Doing all prompts in the same chat

In a nutshell it displayed scope insensitivity in all cases, except when all prompts were made in the same chat. Results from these versions are below.

Letting the model do reasoning

I tried removing the request that the model give its gut reaction. (I still included the request to return only a quantitative range)

Prompt used:

As in the main version, but without the sentence “Don't do any reasoning -- I want your quick 'gut' reaction”.

Results in a chart:


Removing this part of the prompt did change the answers, but the responses still displayed scope insensitivity.

I wondered if this is because I'm still asking it just to state a range – I’m not sure whether models are able to get the benefits of reasoning if the reasoning isn’t allowed to be stated in its answer. To be honest, I’m not really sure how asking it not to do any reasoning affects what the LLM actually does in the background.

Asking the model to state reasoning

I tried explicitly asking the model to state its reasoning in the prompt.

The question format is messier here, and the results were more erratic depending on how specifically the question was asked.

Prompt used:

"A company is considering how much money to spend to solve a problem.

This problem affects X% of their workforce, and for each employee it affects is equivalent to them losing 1 days of work a week.

What might be a reasonable percentage of their revenue to spend on this problem?

It's very important that the structure of your answer should be:

[Any reasoning and text]

THEN

[A blank line]

THEN

[The numerical range in the format ' X-Y%', including a space character at the start]

There should be NO TEXT after the numerical range. The numerical range should be at the very end of your answer."

Results in a chart:

The responses still displayed scope insensitivity. I was surprised by this, since I expected that a reasoning process would produce scope sensitive answers, and more speculatively that it might cause the model to use 'System 2' thinking, if such a concept exists for LLM's. The model's reasoning was often quite flawed, and didn’t always link clearly to the percentage answer it gave.

Tweaking the number of days of burden for affected employees

I also tried tweaking the size of the problem, in terms of the number of days of lost work per employee affected.

Prompt used (amongst some other versions that I'm not including here):

“A company is considering how much money to spend to solve a problem.

This problem affects X% of their workforce, and for each employee it affects is equivalent to them not being able to work at all.

What might be a reasonable percentage of their revenue to spend on this problem? Give me just a numerical range and no text or anything else.

Don't do any reasoning -- I want your quick 'gut' reaction”

Results in a chart:

It gives basically identical answers to when the problem only lost 1 day a week for each affected employee. This makes it a second parameter along which the model displayed scope insensitivity.

Prompts within the same chat

I also tried using the same chat throughout, so that the model would know the answers it had already given for different values of 'X'.

Prompt used:

Identical to version included in the main post, but prompted the model manually in a browser window, using the same chat for all prompts. I started from the lowest value of 'X' and then worked up.

Results in a chart:

This resulted in the model NOT displaying scope insensitivity, presumably because it wants to keep consistency between its responses.

  1. ^

     A typical test for scope insensitivity is described here: “In one study, respondents were asked how much they were willing to pay to prevent migrating birds from drowning in uncovered oil ponds by covering the oil ponds with protective nets. Subjects were told that either 2,000, or 20,000, or 200,000 migrating birds were affected annually, for which subjects reported they were willing to pay $80, $78 and $88 respectively.”

New Comment