Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

Miles Turpin

16 Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

by Miles Turpin

11th Mar 2024

AI Alignment Forum

1 min read

0

16 Ω 13

This is a linkpost for https://arxiv.org/abs/2403.05518

[Twitter thread]

I'm not going to add much additional commentary at the moment and will just let people check out the paper!

But to give a bit more context: This paper is building off prior work we have done showing that chain-of-thought explanations can be misleading, which I wrote about on the alignment forum here. Broadly, this work fits into the process-based oversight agenda (e.g., here). Consistency training/evaluation also fits into scalable oversight: Evaluating consistency may be easier than evaluating a model's reasoning or final predictions directly (e.g., also explored here).

Abstract:

While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models' behavior--for example, rationalizing answers in line with a user's opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where supervision for ground truth reasoning is unavailable.

Chain-of-Thought AlignmentLanguage ModelsAI

Frontpage

16 Ω 13

New Comment

Moderation Log