Thanks to Jordan Taylor, Mark Xu, Alex Mallen, and Lawrence Chan for feedback on a draft! This post was mostly written by Erik, but we're all currently collaborating on this research direction.
Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.
As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD... (read 2760 more words →)
These april fools posts are dangerous off-season. With my experience of LessWrong this read as on the tail end of what someone would do, rather than out-of-distribution.