Thanks to Jasmina Urdshals, Xavier Poncini, and Justis Mills for comments.
Introduction
At Simplex our mission is to develop a principled science of the representations and emergent behaviors of AI systems. Our initial work showed that transformers linearly represent belief state geometries in their residual streams. We think of that work as providing the first steps into an understanding of what fundamentally we are training AI systems to do, and what representations we are training them to have.
Since that time, we have used that framework to make progress in a number of directions, which we will present in the sections below. The projects ask, and provide answers to, the following questions:
- How, mechanistically, do transformers use
... (read 4450 more words →)