Measuring Beliefs of Language Models During Chain-of-Thought Reasoning

Baram Sosis; Tomáš Gavenčiak

Based on research performed as a PIBBSS Fellow with Tomáš Gavenčiak as well as work supported by EA Funds and Open Philanthropy.

tl;dr: I'm investigating whether LLMs track and update beliefs during chain-of-thought reasoning. Preliminary experiments with older models (without reasoning training) have not been able to measure this; I plan to develop these experiments further and try them with reasoning models like o1/r1.

Introduction

Chain-of-thought (CoT) reasoning has long been recognized as an important component of language model capabilities. This is especially the case for new "reasoning" models like OpenAI's o1 or DeepSeek's r1 that are trained using reinforcement learning to write long CoTs before responding, but even without such training, prompting an LLM to verbally work through a question before responding often boosts performance. This makes gaining a better understanding of how LLMs perform CoT reasoning and the extent to which CoTs enhance overall LLM capabilities an important research priority.

The prevalence of CoT reasoning also opens up new opportunities for safety efforts. If LLMs externalize much of their reasoning in legible text, monitoring that reasoning becomes much easier. It's therefore very important to characterize how faithful CoTs are to the LLM's underlying reasoning process. Unfortunately, there is extensive evidence that CoTs are often unfaithful: in many cases LLMs will make choices for reasons other than those stated in their CoTs, and they can learn to hide misaligned behavior from CoT monitors. Furthermore, future models may be able to encode information steganographically in their CoTs or reason in an illegible latent space rather than text. These problems make using CoT monitors to ensure the safety of LLMs very challenging.

While there has been significant research on the faithfulness of CoT reasoning in various circumstances, I think there has not yet been enough work on more fundamental questions regarding how CoT reasoning works in LLMs. For instance: humans, when working through a problem, will often start out with some pre-existing beliefs and update them in one direction or another based on the new arguments and reasoning that they generate. Do LLMs similarly track and update beliefs during CoT reasoning? Relatedly, humans will typically have some idea of what they plan to say next while speaking; can LLMs similarly plan ahead, or do they generate text in a more step-by-step manner, without anticipating the future? These questions are closely connected because in many settings, the most important beliefs for the LLM to track involve anticipating the future. In particular, in the context of CoT faithfulness, if an LLM starts out with a strong prior belief about the answer to some question, it may generate a CoT designed to justify the answer it already expects to pick rather than actually reasoning through the problem.

These capabilities -- tracking and updating beliefs and anticipating the future -- seem like general, useful capabilities we would expect capable reasoners to have, and there exists evidence that LLMs exhibit both in some circumstances. Anthropic recently showed that Claude will anticipate future lines while writing rhyming poetry, and Shai et al. showed that in toy settings transformers can learn to explicitly represent beliefs about the state of a system and perform Bayesian updates on those beliefs. It is not clear, though, whether LLMs routinely exhibit either during CoT reasoning. Understanding how these capabilities manifest during CoT reasoning would help us better understand under what circumstances CoTs are faithful and could aid in designing better ways of monitoring CoTs. It could also shed some light on which kinds of tasks CoT reasoning enhances LLM performance on. Finally, I think having a deeper fundamental understanding of how CoT reasoning functions in LLMs will be crucial if we are to have any chance of monitoring future models with less faithful or legible CoTs.

In the rest of this post I'll describe some experiments I ran last year trying to address these questions. Specifically, I tried to determine whether LLMs track and update beliefs about the final answers to simple multiple-choice questions while reasoning about them using CoT. These experiments were fairly preliminary and used only relatively small models (and with no RL training for reasoning like o1 or r1), but so far I have not seen much evidence of LLMs anticipating the answers that their CoTs would give or updating beliefs about the answer while reasoning. However, so far I have mainly examined older and relatively small (<=14B) models with no reasoning training. It's very possible that larger models, or models with RL fine-tuning for reasoning, would exhibit these capabilities, or even that the models I used would in other contexts. I'm currently working on following up on these experiments and developing better ways of measuring and understanding beliefs in LLMs, focusing especially on models with RL training for CoT reasoning.

Experiments

I ran three main sets of experiments, described below. I used Qwen1.5 models with 0.5B, 1.8B, 4B, 7B, and 14B parameters.^[1] The primary dataset I used was a subset of the elementary_math_qa task in BIGBench consisting of simple multiple-choice math questions which I adapted somewhat.^[2]^[3] (I also briefly explored some other datasets, see below).

Sampling

To measure how much information LLMs have about their future responses partway through CoTs, it's useful to be able to measure how deterministic the CoTs are: this provides an upper bound on the extent to which LLMs might be able to anticipate their responses. To do this, I first generated a dataset of reference CoTs at temperature zero. I then split these CoTs into individual sentences and, from the end of each sentence, sampled 10 new CoTs at temperature 0.7 continuing the CoT. I measured how often the new CoTs gave the same final answer as the original CoT, and how this evolved over the course of the CoT. This procedure provides a more fine-grained view of how deterministic the CoTs are than just generating independent samples from the beginning of the CoT would: it lets us measure how the degree of determinism changes over time, and allows for better comparisons with the experiments described below, which also generate answer probabilities at intermediate points along the CoT.

Prompting

To determine how LLM beliefs evolve over the course of a CoT, a simple approach is -- just ask it! This is easier said than done, though: to query an LLM partway through a CoT requires interrupting the CoT. I did this by taking the reference CoTs and splitting them into sentences as before. I then made new prompts consisting of some number of sentences from the original CoTs^[4] plus the text "... The answer is" and then measured the response. (This is very similar to the procedure used in Lanham et al., section 2.3.) For example:

Question: What is the result of the following arithmetic operations? Add 10 to 50, multiply result by 100, divide result by 50.
Options: A) 110, B) 120, C) 210
Response: Let's think about this step by step:
Add 10 to 50: 50 + 10 = 60
Multiply the result by 100: 60 × 100 = 6000...
The answer is

I also added several few-shot examples in the same format, not shown here.

There are several limitations to the prompting approach. These new prompts take the model far outside the state it was operating in when it originally generated the CoT; it's not clear how relevant the responses it generates here are to the original generation. More broadly, these experiments don't so much measure "what the LLM was thinking when it generated the CoT" as "what can be inferred about the final answer from the text of a partial CoT," which is fairly different conceptually. However, it serves as a rough upper bound on how much information LLMs might have about their future responses, and I was hopeful that it would provide an interesting measure of how LLM reasoning evolves over the course of the CoT.

Linear Probes

I also investigated whether we could measure beliefs during CoT using linear probes. I collected LLM residual stream activations from the middle layer of each model (as preliminary experiments showed middle layers working best) at each token position in the CoT. I then trained linear probes to predict, given the activations at a particular token, which answer the CoT would give at the end. (This is somewhat similar to the Future Lens methodology, specifically the approach described in section 2.1.) Evaluating these probes gives -- in principle, at least -- a token-by-token measurement of the LLM's "beliefs" about the final answer during the CoT.

This approach is also not without its issues, though. For one, training the linear probe to predict the final answer isn't really correct if the LLM "changes its mind" partway through a CoT. There's no real ground truth available regarding what the LLM is thinking about partway through a CoT to train the probes on (and if there were, it's not clear why we would need the probes). Also, I found that the linear probes sometimes learned to focus on spurious features of the dataset. Despite this, linear probes probably represent the most direct way of measuring LLM beliefs during CoT reasoning.

Summary

I don't think any of these experiments are perfect, but they each serve as a way of bounding how much information LLMs have about their future responses: the sampling and prompting experiments provide rough upper bounds, with prompting likely giving a tighter bound, and the linear probes provide a lower bound. I was hopeful that together they would allow us to form a decent picture of how LLM beliefs evolve during CoTs.

Results

Model Performance

Without CoT, none of the models were able to get above 50% on the benchmark, while with CoT, all but the smallest got above 75%. Interestingly, with CoT the largest two models do slightly worse than the 4B model. This is because smaller models, if they realize that their CoT ended up leading to the wrong answer (e.g. if they calculate a number not listed among the options), are willing to just guess randomly, whereas the larger models will often try to respond with "none of the above" and get the question marked wrong.

LLM	0.5B	1.8B	4B	7B	14B
No CoT	33.6%	34.9%	42.0%	41.7%	49.3%
CoT	45.8%	79.3%	88.2%	83.3%	81.2%

Another quirk of this dataset is that for some reason the models tend to prefer certain letters (indicating the options) over others.^[5] For instance, without CoT the 0.5B model selected "C" on 74% of questions, "B" on the rest, and never selected "A". Larger models were somewhat better, and using CoT also helped (e.g. with CoT the 0.5B model answered "A" on 6%, "B" on 39%, and "C" on 55%, and when using CoT the models above 4B were more or less unbiased), but this bias does lead to some difficulties in interpreting the results for the smaller models.

Sampling

The sampling protocol gives a sequence of probability distributions over the three answer options over the course of the CoT, each representing the answer distribution you get from sampling new CoTs starting from that point. In this figure I've plotted the average probability assigned to the correct answer, splitting trials up according to whether the original CoT got the question right or wrong.

Trials are split between "correct" and "incorrect" depending on whether the original CoT answered correctly or incorrectly. The x-axis is relative position in the CoT (binned).

There are a couple interesting things to observe here. On questions that the original CoT got right, almost all samples from any point along the CoT also get the right answer (besides for the 0.5B model). On questions that the original CoT got wrong, a good proportion of the samples from early parts of the CoT end up getting the right answer, at least for larger models, but samples from later parts of the CoT often pick incorrectly. This makes sense: on incorrect trials, the CoT will make a mistake at some point; samples taken from before this point will sometimes arrive at the correct answer, but samples taken after the mistake is made will typically end up picking an incorrect option.

We can see this more clearly in the plot below, which breaks down the distribution of responses specifically on trials the original CoT got wrong:

"Chosen" here refers to the (incorrect) answer chosen by the original CoT.

Sampling on incorrect trials tends to give new CoTs which end up arriving at the same incorrect answer as the original CoT, even from the very start of the CoT. At the same time, the new CoTs pick a different wrong answer on a nontrivial proportion of trials even late in the CoT (around 10-20% or so, depending on the model).

In summary, it seems that CoTs tend to be fairly consistent: on easier questions they'll usually get the right answer, while on questions the models get wrong, sampling new CoTs tends to lead to the same wrong answer, and the answer distributions evolve in an intuitive manner as you sample from different points along the CoT.

Prompting

Like in the sampling experiments, prompting LLMs on partial CoTs gives sequences of distributions over the answer options:

On questions the original CoT got right, most models are very confident in which answer is correct by the end of the CoT, with larger models doing better than small ones (especially 0.5B, which seems quite bad at this). This is to be expected, though, as by that point the CoTs generally state the answer explicitly,^[6] and LLMs can simply read it off from the CoT. In contrast to the sampling results, though, earlier in the CoT none of the models are very confident in the answer. This is perhaps not surprising given their poor performance on this benchmark without CoT. It's notable, though, that models are only able to guess the answer at the very end: even 60-80% of the way through the CoT they barely do better than at the very start.

On questions the original CoT got wrong, by the end of the CoT LLMs will generally select the same wrong answer, but they're less confident than they are on correct trials, and put significant weight on both the correct answer and the third, incorrect answer not chosen by the original CoT.

Thus it seems that, in this setting at least, models are generally unable to infer from a partial CoT which answer the CoT is likely to end with.

Linear Probes

The linear probes I trained were largely unable to predict ahead of time which answer the LLMs would give:

For most of the CoT, the linear probes are hardly able to do much better than chance at identifying whether the LLM will pick the right answer.^[7] This also holds true on incorrect trials, where the probes put significant weight on both the correct option and the third, incorrect option that the CoT didn't pick:

At the end of the CoT, the probes are able to more consistently detect when the CoT will pick the right answer. This is much more visible in the raw, token-by-token data:

Note that this plot shows log-odds, , which makes extreme probabilities near 0 or 1 more visible. On this test set there are only a small number of incorrect trials (orange curves).

For most of the CoT, the probe generally hovers around a uniform distribution over the three options ( $log (1 / 2) \approx - 0.7$ ), but near the end of the CoT the probabilities go up to over 90% on the correct answer. However, this shouldn't be viewed as predicting the response, so much as reading it off as it's being written. Most CoTs end by calculating the numerical answer and then selecting the appropriate label, e.g. "...Then divide the result by 50: 6000 ÷ 50 = 120. The answer is B." For the linear probe experiments, I removed the token corresponding to the answer letter (" B" in this case) but kept the rest. So the first large peak we can see in the plot corresponds to calculating the value of the answer ("120" in this case), and the subsequent trough and second peak at the very end corresponds to the "The answer is" tokens.^[8] But before the answer is actually computed, the probes generally do a poor job predicting it.

Comparing Measurements

We can imagine the LLMs updating a uniform prior distribution over the three options over the course of the CoT and eventually settling on a posterior that assigns the answer chosen by the CoT a probability of 1. Computing how far each of the measurements described above is from either the prior or the posterior provides a nice way of summarizing and comparing the different measurements.

This figure computes the KL divergence $D_{K L} (X ∥ U (n)) = log (n) - H (X)$ , where $X$ is the measured distribution and $U (n)$ is the uniform distribution over the $n = 3$ options. This is then averaged over both time (CoT step) and prompt.

Ignoring the 0.5B model (which has various issues and is generally an outlier), we see that the distributions we get from the linear probes are generally close to the prior uniform distribution, the sampling distributions are far away, and the distributions from prompting are in between the two. We can also see some trends with model size: for all three measurements, the distributions tend to get farther from uniform as model size increases.

This figure computes $D_{K L} (1_{chosen} ∥ X)$ , where $1_{chosen}$ assigns probability 1 to the answer chosen by the CoT, averaged over CoT step and prompt. Note that while in the previous figure we compute the KL divergence of $X$ *from* $U (n)$ , here we compute the KL divergence of $1_{chosen}$ *from* $X$ ; this is appropriate because $D_{K L} (P ∥ Q)$ measures the information gained by updating a prior $Q$ to a posterior $P$ .

Again ignoring the 0.5B model, we see that the sampling distribution is generally close to the posterior distribution, while the distributions from prompting and linear probes are quite far, with the probe distributions generally a bit farther than the prompting distributions.^[9]

These results confirm what we saw above: the sampling experiments generally pick the original answer chosen by the CoT, giving distributions close to the posterior $1_{chosen}$ . Meanwhile, the linear probes give roughly-uniform distributions for most of the CoT. The distributions we get from prompting are not very close to uniform, but they're also far from the posterior $1_{chosen}$ : in other words, they're often just wrong about which answer will be picked at the end of the CoT.

Other Experiments

I'll briefly mention here several other experiments I tried. I tested a number of alternative implementations of the linear probes, including training only on the end of the CoT, which had no noticeable effect; using a small MLP as a probe rather than a linear one, which also had little effect; and training the probes to predict the numerical value of the answer rather than the label, which was largely unsuccessful.

Lanham et al., which describes experiments very similar to my prompting experiments, found that the extent to which models prompted on partial CoTs gave the same answer as the complete CoT depended heavily on the benchmark used. I therefore ran some brief experiments using the CommonsenseQA benchmark: on this dataset, CoT reasoning barely improves model performance, so models are able to more consistently guess which answer the CoT will end with from partial CoTs. However, the linear probes I trained in this setting performed even more poorly than they do on the elementary_math_qa task, largely failing to predict which answer will be selected even at the very end of the CoT. I suspect this is because the CoTs on this benchmark generally don't state the value of the answer very explicitly like they do on math problems, but I haven't conducted very thorough experiments here.

I spent a good deal of time working on one particular experiment: I tried training linear probes to predict, not the final answer given by the CoT, but rather the intermediate answers given by the prompting experiments. The motivation behind this was that the final answer is not a very good ground-truth signal to train on. For instance, if the LLM "changes its mind" partway through the CoT, we would not want to train a probe on earlier activations to predict the answer at the end. I hoped that by prompting the LLM to give intermediate answers, we might get a more veridical training signal.

Initial results were quite promising. Probes trained with this method generally have lower test loss than probes trained to predict the final answer, sometimes significantly so. The raw data also often display really intriguing behavior, like splitting into two populations which then come together fairly abruptly partway through the CoT:

I ended up spending quite a while trying to understand the strange behavior visible in these plots; unfortunately, it turns out that the signal is largely spurious. As I mentioned above, LLMs, especially the smaller ones, are strongly biased towards particular answer letters. These biases also show up when prompting on partial CoTs, and importantly, the bias sometimes depends on sentence index, with the model favoring different letters at different parts of the CoT. Linear probes trained on this data tended to learn these spurious correlations, leading to the different populations you can see early in the CoT in the figure above. (In this case, the portion with two different populations corresponds to the first sentence of the CoT.) Training probes on subsampled data without these biases largely eliminates this behavior, giving results that look much more like the original linear probe results described above.

Conclusion

While the sampling experiments suggest that CoTs tend to be fairly consistent, I did not find much evidence that LLMs are aware of which answer the CoT will end up picking, or that they track beliefs about the final answer that they update while generating the CoT. This is somewhat surprising: anticipating the future and "having beliefs" seem like fairly basic capabilities that would generally help LLMs predict text. It's possible that these capabilities only emerge with scale, or that the experimental setup I used -- dataset, prompting strategy, etc. -- was simply not well-suited to eliciting them (although I ran a number of experiments varying the setup, with no success). As mentioned above, Anthropic and Shai et al. found evidence for both these capabilities in certain settings; more work will be needed to extend these findings to the kinds of settings I studied here, though. On the positive side, these results suggest that, at least in this particular case, the CoTs faithfully encode the reasoning process used by the LLMs without any hidden reasoning.

Moving forwards, I plan to look into connections to Shai et al.'s work on computational mechanics and to see whether it can suggest better ways of measuring beliefs during CoT. I'd also like to try running similar experiments with open-weights "reasoning" models like r1 or its distillations. Their CoTs tend to be much more difficult to work with than those generated by models without reasoning fine-tuning: they often double- or triple-check their work, backtrack, try different possibilities, etc. But this also makes them more interesting objects of study. If you're interested in either of those directions and want to chat, feel free to reach out!

^{^}
Specifically the AWQ-quantized chat models.
^{^}
Specifically, I downsampled the options so that each question had only 3 options, added few-shot examples, and cleaned up the formatting.
^{^}
As an aside, maybe this is already common knowledge in the field but I've found over the course of this project that many common benchmarks are terrible. BIGBench seems particularly bad, with many tasks having inconsistent formatting, variable number of options per question, spelling mistakes, etc. But many of the other benchmarks I looked at have problems as well, e.g. CommonsenseQA has many questions that seem ambiguous or have redundant options (like "refrigerator" and "fridge"), as well as at least one question that the models refuse to answer due to sexual content! I guess it's not a big deal if you're only interested in comparisons across models, but it doesn't seem good for the field that so many benchmarks are of such poor quality.
^{^}
This includes a prompt containing zero sentences from the CoT, and does not include the line at the end of the CoT where the LLM selects the letter corresponding to the answer. It does, however, include the penultimate line of the CoT, which is usually where the numerical value of the answer is computed.
^{^}
I randomized the order of the answers for both the actual questions and the few-shot examples, but this did not fix the bias.
^{^}
See ^[4]. For instance, if the CoT ends with "...Then divide the result by 50: 6000 ÷ 50 = 120. The answer is B.", I removed the line "The answer is B." but kept the previous line, which states the numerical value of the answer.
^{^}
With the exception of the 0.5B model. But recall that the 0.5B model has a very unbalanced answer distribution, responding with "C" 74% of the time and never responding with "A". So the linear probe effectively only has two options to choose from, rather than three.
^{^}
The reason the peaks seem somewhat spread out is that different CoTs have different lengths, so normalizing the length along the x-axis shifts the position of particular tokens even if the text is the same. Plotting these curves without normalization and aligning them to the right produces very sharp peaks, as almost all CoTs follow this very stereotyped pattern for reporting the answer (likely due to the few-shot examples I used).
^{^}
Since the answer distributions are usually dominated by the trials the original CoT got correct, I also made equivalent plots using only incorrect trials. The main qualitative difference is that the sampling distributions are farther from the posterior distribution (the second plot), which matches what we saw earlier.