Towards a Clever Hans Test: Unmasking Sentience Biases in Chatbot Interactions

Review

The famous story of Clever Hans has become a cautionary tale in animal cognition. Hans was a horse in Germany in the early 1900s who could seemingly perform all kinds of smart tasks, such as simple arithmetic and spelling words. It is not explicitly documented, but it is probably safe to assume that Hans would have even been able to count the number of Rs in the word "strawberry", a feat that we, of course, know today to be fiendishly hard. To cut a long story short, it turned out that Hans could not actually do any of these things but was merely reading subtle cues from his handlers.

Based on this story, the Clever Hans effect describes the phenomenon where humans inadvertently influence animals they interact with in ways that lead the humans to ascribe more cognitive abilities to the animals than they actually have. It has recently been argued that this can also happen with AI algorithms, particularly with conversational agents. I suspect that this effect creates an implicit bias in the standard setup of the Turing test, where the human tester interacts with two other agents (a human and an AI) and a-priori might assume both of them to be sentient. This could then create a Clever Hans effect that might make the human more likely to perceive the AI as actually being sentient, by unconsciously prompting the AI in a way that would manifest such apparent behavior.

To mitigate this issue, I therefore propose a Clever Hans Test to account for (or at least measure) this prompting-dependent effect. The test could work roughly like this: Take two LLMs, one interlocutor (A) and one LLM to be tested (B). Let the LLMs talk to each other, similar to the setup in the Chatbot Arena. The crux is now that you repeat this experiment at least twice. Once, A is told that it will have a conversation with a sentient being, while the other time, A is told that it will interact with a mindless machine. Finally, we take the conversation logs from these two experiments and show them to a judge (either a human or another LLM) and ask how sentient B seems in these two conversations.

I would hypothesize that for most current LLMs, we should be able to see a clear difference in the way that B behaves in these two settings. I hope that this would help provide a more objective foundation for the current discussion about potential LLM sentience.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

4

Towards a Clever Hans Test: Unmasking Sentience Biases in Chatbot Interactions

4

4