Toward A Mathematical Framework for Computation in Superposition
Author order randomized. Authors contributed roughly equally — see attribution section for details. Update as of July 2024: we have collaborated with @LawrenceC to expand section 1 of this post into an arXiv paper, which culminates in a formal proof that computation in superposition can be leveraged to emulate sparse boolean circuits of arbitrary depth in small neural networks. What kind of document is this? What you have in front of you is so far a rough writeup rather than a clean text. As we realized that our work is currently highly relevant to recent questions posed by interpretability researchers, we put together a lightly edited version of private notes we've written over the last ~4 months. If you'd be interested in writing up a cleaner version, get in touch, or just do it. We're making these notes public before we're done with the project because of some combination of (1) seeing others think along similar lines and wanting to make it less likely that people (including us) spend time duplicating work, (2) providing a frame which we think provides plenty of concrete immediate problems for people to independently work on[1] (3) seeking feedback to decrease the chance we spend a bunch of time on nonsense. 1 minute summary Superposition is a mechanism that might allow neural networks to represent the values of many more features than they have neurons, provided that those features are present sparsely in the dataset. However, until now, an understanding of how computation can be done in a compressed way directly on these stored features has been limited to a few very specific tasks (for example here). The goal of this post is to lay the groundwork for a picture of how computation in superposition can be done in general. We hope this will enable future research to build interpretability techniques for reverse engineering circuits that are manifestly in superposition. Our main contributions are: 1. Formalisation of some tasks performed by MLPs and attenti
Not sure, but I have definitely noticed that llms have subtle "nuance sycophancy" for me. If I feel like there's some crucial nuance missing I'll sometimes ask and LLM in a way that tracks as first-order unbiased and get confirmation of my nuanced position. But at some point I noticed this in a situation where there were two opposing nuanced interpretations and tried modeling myself as asking "first-order-unbiased" questions having opposite views. And I got both views confirmed as expected. I've since been paranoid about this.
Generally I recommend this move of trying two opposing instances of "directional nuance" a few times. Basically I ask something like "the conventional view is X. Is the conventional view considered correct by modern historians?" Where X was formulated in a way that can naturally lead to a rebuttal Y. And then for sufficiently ambiguous and interpretation-dependent pairs of X and X', with fully opposing "nuanced corrections" Y and ¬Y. I've been pretty successful at this several times I think