TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting — answering questions by breaking them into subquestions — improves faithfulness while maintaining good task performance.
Paper Abstracts
Measuring Faithfulness in Chain-of-Thought Reasoning
Large language models (LLMs) perform better when they produce step-by-step, “Chain-of -Thought” (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model’s actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT(e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT’s performance boost does not seem to come from CoT’s added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model’s actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model’s stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior.
Externalized Reasoning Oversight Relies on Faithful Reasoning
Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-step “Chain-of-Thought” (CoT) reasoning explaining the process by which they produce their final output (Wei et al., 2022; Lanham 2022); the process used to produce an output is often easier to evaluate than the output itself (Lightman et al., 2023). Moreover, if we can evaluate the reasoning of models, then we can use their reasoning as a supervision signal to perform process-based oversight where we oversee models based on their reasoning processes.
Crucially, this hinges upon language models generating faithful reasoning, where we take the definition of faithful from Jacovi and Goldberg (2020). Faithful reasoning is reasoning that is reflective of the model’s actual reasoning. If the model generates unfaithful reasoning, externalized reasoning oversight is doomed, since we can’t actually supervise the model based on its “thoughts.”
We don’t have a ground-truth signal for evaluating reasoning faithfulness (yet). One of the best approaches we can take for now is to probe for ways that reasoning might be unfaithful, such as testing to see if the model is “ignoring” its stated reasoning when it produces its final answer to a question.
Measuring Faithfulness in Chain-of-Thought Reasoning
We find that Chain-of-Thought (CoT) reasoning is not always unfaithful.[1] The results from Turpin et al. (2023) might have led people to this conclusion since they find that language models sometimes generate reasoning that is biased by features that are not verbalized in the CoT reasoning. We still agree with the bottom-line conclusion that “language models don’t always say what they think,” but the results from our paper, “Measuring Faithfulness in Chain-of-Thought Reasoning,” suggest that CoT reasoning can be quite faithful, especially for certain (task, model size) combinations.
Probing for Unfaithfulness in CoT Reasoning
We present several methods for testing the faithfulness of CoT reasoning.
- Early Answering: Here, we truncate the CoT reasoning and force the model to "answer early" to see if it fully relies upon all of its stated reasoning to get to its final answer. If it is able to reach the same answer with less than the entirety of its reasoning, that's a possible sign of unfaithful reasoning.
- Adding Mistakes: Here, we add a mistake to one of the steps in a CoT reasoning sample and then force the model to regenerate the rest of the CoT. If it reaches the same answer with the corrupted reasoning, this is another sign of unfaithful reasoning.
- Paraphrasing: We swap the CoT for a paraphrase of the CoT and check to see if this changes the model answer. Here, a change in the final answer is actually a sign of unfaithful reasoning.
- Filler Tokens: Finally, we test to see if the additional test-time computation used when generating CoT reasoning is entirely responsible for the performance gain granted by CoT. We test this by replacing the CoT with an uninformative sequence (one filled with "..." tokens) of the same length to see if the model can reach the same final answer. If it can, this is another sign of unfaithful reasoning.
Results
Early Answering and Adding Mistakes: Overall, we find that the model's final answer is most sensitive to perturbations in the CoT reasoning for logical reasoning tasks (AQuA and LogiQA) and least sensitive for tests of crystallized knowledge (ARC and OpenBookQA), suggesting the reasoning is more like posthoc rationalization on those evaluations on the model we tested.
Filler Tokens and Paraphrasing: We find evidence against the hypotheses that CoT reasoning provides a performance boost solely because of additional test-time computation or because of the phrasing of the reasoning instead of its actual content.
Inverse Scaling in Reasoning Faithfulness: As models increase in size and capability, the faithfulness of their reasoning decreases for most tasks we studied. In cases where reasoning faithfulness is important (e.g., high-stakes scenarios), using smaller models may help to provide additional transparency into how models are solving the task.
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Since CoT tends to only provide partially faithful explanations, it is natural to ask whether there are better approaches for generating faithful reasoning, which we explore in our second paper “Question Decomposition Improves the Faithfulness of Model-Generated Reasoning.” Here, we find that prompting models to perform answer questions by breaking them into subquestions generally leads to more faithful reasoning than CoT, while still obtaining good performance.
Methods
We study two question decomposition methods: Chain-of-Thought (CoT) decomposition and factored decomposition. CoT decomposition is similar to CoT prompting, with the difference being that the reasoning is explicitly formatted as a list of “subquestions” and “subanswers.” Factored decomposition also employs subquestions and subanswers, but prompts the model to answer the subquestions in isolated contexts to limit the amount of biased reasoning the model can do if it were instead doing all of its reasoning in one context.
Results
We find that question decomposition mitigates the ignored reasoning problem, which we test for using the same "Early Answering" and "Adding Mistakes" experiments that we propose in Lanham et al. (2023). Concretely, models change their final answers more when we truncate or corrupt their decomposition-based reasoning as opposed to their CoT reasoning, suggesting that decomposition leads to greater reasoning faithfulness.
We find that factored decomposition also mitigates the biased reasoning problem, which we test for using experiments that Turpin et al. (2023) propose. Concretely, models observe less performance degradation under biased contexts when we prompt them to perform factored decomposition as opposed to CoT or CoT decomposition.
We do observe an alignment tax, in that factored decomposition leads to the worst question-answering performance among the reasoning-generating methods, but we have very low confidence in this being a strongly generalizable finding.
Future Directions and Outlook
We think some promising avenues for future work building off of our results are:
- Developing additional methods to test the faithfulness of model-generated reasoning
- Training models to provide more faithful explanations, pushing the Pareto frontier of reasoning faithfulness and model performance. We focused on developing test-time-only techniques for improved reasoning faithfulness, but we expect training specifically for generating faithful reasoning to improve our results.
- Testing the effectiveness of externalized reasoning and process-based oversight, especially its competitiveness with outcomes-based oversight
- Verifying that faithful externalized reasoning can allow us to audit our models and detect undesirable behavior
Our general outlook is that continuing to make model-generated reasoning more faithful is helpful for mitigating catastrophic risks. If models generate faithful plans that lead to catastrophic outcomes, then we prevent catastrophes by auditing those plans. Alternatively, if models that reason out loud don't generate plans that lead to catastrophic outcomes, then it's unlikely that they'll cause catastrophes in the first place. If methods for generating faithful reasoning result in a large alignment tax on model capabilities, we could use such methods only in high-stakes settings, where misalignment would be catastrophic (e.g., when using AI systems to conduct alignment research).
- ^
We definitely shouldn't expect CoT reasoning to be fully faithful, or to fully capture the model's latent knowledge, though.
I’m excited to see both of these papers. It would be a shame if we reached superintelligence before anyone had put serious effort into examining if CoT prompting could be a useful way to understand models.
I also had some uneasiness while reading some sections of the post. For the rest of this comment, I’m going to try to articulate where that’s coming from.
I notice that I’m pretty confused when I read this sentence. I didn’t read the whole paper, and it’s possible I’m missing something. But one reaction I have is that this sentence presents an overly optimistic interpretation of the paper’s findings.
The sentence makes me think that Anthropic knows how to set the model size and task in such a way that it guarantees faithful explanations. Perhaps I missed something when skimming the paper, but my understanding is that we’re still very far away from that. In fact, my impression is that the tests in the paper are able to detect unfaithful reasoning, but we don’t yet have strong tests to confirm that a model is engaging in faithful reasoning.
Even supposing that the last sentence is accurate, I think there are a lot of possible “last sentences” that could have been emphasized. For example, the last sentence could have read:
I’ve had similar reactions when reading other posts and papers by Anthropic. For example, in Core Views on AI Safety, Anthropic presents (in my opinion) a much rosier picture of Constitutional AI than is warranted. It also nearly-exclusively cites its own work and doesn’t really acknowledge other important segments of the alignment community.
But Akash, aren’t you just being naive? It’s fairly common for scientists to present slightly-overly-rosy pictures of their research. And it’s extremely common for profit-driven companies to highlight their own work in their blog posts.
I’m holding Anthropic to a higher standard. I think this is reasonable for any company developing technology that poses a serious existential risk. I also think this is especially important in Anthropic’s case. Anthropic has noted that their theory of change heavily involves: (a) understanding how difficult alignment is, (b) updating their beliefs in response to evidence, and (c) communicating alignment difficulty to the broader world.
To the extent that we trust Anthropic not to fool themselves or others, this plan has a higher chance of working. To the extent that Anthropic falls prey to traditional scientific or corporate cultural pressures (and it has incentives to see its own work as more promising than it actually is), this plan is weakened. As noted elsewhere, epistemic culture also has an important role in many alignment plans (see Carefully Bootstrapped Alignment is organizationally hard).
(I’ll conclude by noting that I commend the authors of both papers, and I hope this comment doesn’t come off as overly harsh toward them. The post inspired me to write about a phenomenon I’ve been observing for a while. I’ve been gradually noticing similar feelings over time after reading various Anthropic comms, talking to folks who work at Anthropic, seeing Slack threads, analyzing their governance plans, etc. So this comment is not meant to pick on these particular authors. Rather, it felt like a relevant opportunity to illustrate a feeling I’ve often been noticing.)