Searching for a model's concepts by their shape – a theoretical framework
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort Introduction I think that Discovering Latent Knowledge in Language Models Without Supervision (DLK; Burns, Ye, Klein, & Steinhardt, 2022) is a very cool paper – it proposes a way to do unsupervised mind reading[1] –...
Feb 23, 202351