jan_bauer

Message

Linear steerability in continuous chain-of-thought reasoning

(This project was done as a ~20h application project to Neel Nanda's MATS stream, and is posted here with only minimal edits. The results seem strange, I'd be curious if there's any insights.) Summary Motivation Continuous-valued chain-of-thought (CCoT) is a likely prospective paradigm for reasoning models due to computational advantages,...

Jan 3010

Towards a Unified Interpretability of Artificial and Biological Neural Networks

Neuroscience and mechanistic interpretability share a common goal: understanding neural networks, either biological or artificial. This is reflected in the convergent evolution of these domains – from interpreting single neurons to abstract features, and more recently, to functional representations. Yet, a significant number of approaches in either field remain unknown...

Dec 21, 20242

LESSWRONG
LW

LESSWRONG
LW

jan_bauer

jan_bauer

Linear steerability in continuous chain-of-thought reasoning

Towards a Unified Interpretability of Artificial and Biological Neural Networks

jan_bauer

jan_bauer

Linear steerability in continuous chain-of-thought reasoning

Towards a Unified Interpretability of Artificial and Biological Neural Networks

Summary

Motivation