The following work was done between January and March 2024 as part of my PhD rotation with Prof Surya Ganguli and Prof Noah Goodman. One aspect of sparse autoencoders that has put them at the center of attention in mechanistic interpretability is the notion of monosemanticity. In this post, we...
Communication is a fascinating subject. It’s a way of transferring information from one place to the next (or, to invert a quote by Chris Fields, communication is physics). Now you may have noticed that I used the word “communication” in two slightly different ways. In the first sentence you might...
Epistemic status: Exploratory, speculative, half-baked thought It’s a worldwide optimization problem. What content to consume under what conditions to reach a particular goal? Taking a step back and looking only at academia, and constraining ourselves to only academic papers (and preprints for simplicity), the question boils down to What papers...
I recently began exploring the field of mechanistic interpretability. It’s fascinating in a very similar way as the field of active inference ~1-2 years ago (let’s hope it maintains that in the way as actinf has), as in it seems to attract a lot of researchers from multiple disciplines, and...
Imagine looking at flower. Depending on the lighting conditions and some other factors, you see the flower with different levels of detail. If you take a picture and project it onto a low-resolution screen, instead of a flower you might just see colored boxes that barely capture the flower’s features....