Searching for a model's concepts by their shape – a theoretical framework
Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort Introduction I think that Discovering Latent Knowledge in Language Models Without Supervision (DLK; Burns, Ye, Klein, & Steinhardt, 2022) is a very cool paper – it proposes a way to do unsupervised mind reading[1] –...
I'm not sure I understand this correctly. Are you saying that one of the main reasons for optimism is that more competent models will be easier to align because we just need to give them "the right incentives"?
What exactly do you mean by "the right incentives"?
Can you illustrate this by means of an example?