One consideration that is pretty important for AI safety is understanding the extent to which a model's outputs are aligned with its chain of thought.

This paper (Twitter thread linked) provides some relevant evidence. It demonstrates that it is possible for a model to achieve performance comparable to chain-of-thought with dots replacing the chain of thought under some circumstances. In particular, the model can't just be trained with sequences like "QUESTION..............ANSWER", but sequences need to also be mixed in where there is a "parallelisable" chain of thought. Here "parallelisable" means that different components of the chain of thought can be calculated in parallel rather than all separately.

In terms of how this is relevant to AI safety, this provides an empirical demonstration that a model is capable of very effectively engaging in background computation under certain circumstance. It shows that the model is much better at doing background parallelizable tasks than non-parallelizable tasks. In other words, the chain of thought is less binding than it might have been because the model is free to perform some of the computations necessary for future tokens in the background.

New Comment