Concrete empirical research projects in mechanistic anomaly detection
Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction. Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar...