Architectures for Increased Externalisation of Reasoning
TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us! Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream Architectures for Externalisation of Cognition Many AI safety researchers are converging around the idea that the chain-of-thought (CoT) of models may be a valuable tool for monitoring the actions they may take before they do so. Models struggle to perform complex tasks without using their CoT to perform serial reasoning steps. As such, the hope is that, in order for a model to scheme sufficiently effectively against us, it would need to reason about its plans within the CoT, providing us with an opportunity to catch it. There are a number of imminent threats to the use of CoT monitoring for control (see here for a more comprehensive breakdown): * Unfaithfulness: The CoT may fail to reflect the actual cognition performed by the model. In this case, the CoT may not accurately convey the scheming that the model is doing. * Obfuscation from training pressures: If training pressure is placed on the CoT of a model, it may force scheming reasoning “internally”. In particular, pressure placed on CoTs to “look nice” may remove overt traces of scheming, without preventing the model from using the CoT to scheme. * Neuralese: Finally, it is possible that developers will soon switch to an architecture which forgoes the use of CoT at all, with reasoning instead occurring purely within the activation space of the model. Such reasonin