Executive Overview
As models grow increasingly sophisticated, they will surpass human expertise. It is a fundamentally difficult challenge to make sure that those models are robustly aligned (Bowman et al., 2022). For example, we might hope to reliably know whether a model is being deceptive in order to achieve an instrumental goal. Importantly, deceptive alignment and robust alignment are behaviorally indistinguishable (Hubinger et al., 2024). Current predominant alignment methods only control for the output of the model while leaving the internals of the model unexamined (black-box access, Casper et al., 2024). However, recent literature has started to demonstrate the value of examining the internal of the models that provide additional predictive power that is not affordable by... (read 6249 more words →)