After talking with a lot of people in Alignment, I think there is still a lot of good to be done for idea diffusion at the object/technical level. We seem to have done a lot of outreach presenting the philosophical arguments, but less so on the technical ground.
Since the field of Alignment is quite diverse and nuanced, we think that it would be good to present how different people approach this problem on different frontiers. For example, Anthropic's empirical approach might be very different from say, Christiano's theoretical thinking on ELK. Therefore, navigating through the landscape of alignment would be essential for building the inside view. I suppose having a good grasp/inside... (read more)
This is an interesting point--when we did our causality studies across layers, we also found that the board state features in the middle layers are mostly used causally--not the deep layers. However, the probe accuracy does increase with depth.
I don't know how this translates to the fact that SAEs also find more of these features in the middle layers. Like, the "natural features" in some sense in the last few layers found by the SAEs do not have to contain much information about the board state but just partial information to make the decision.