This work was supported by the Effective Ventures Foundation USA through their EA Funds program. It was started as part of the MATS (ML Alignment and Theory Scholars) program, with mentorship from Julian Michael and research management by McKenna Fitzgerald.
Code for this project is available on GitHub. Explore samples from the different training runs at our interactive website.
Introduction
Scalable oversight research paradigm
Scalable oversight is the challenge of overseeing or training AI systems to achieve goals that are hard for humans to specify. By definition, studying scalable oversight directly on the tasks we care about is infeasible, because we would not know if the goals were achieved correctly. So instead we use sandwiching — using... (read 4426 more words →)