From Is Deep Learning Actually Hitting a Wall? Evaluating Ilya Sutskever's Recent Claims:
But what we seem to be seeing is a bit different from deep learning broadly hitting a wall. More specifically it appears to be: returns to scaling up model pretraining are plateauing.
The paper AI Control: Improving Safety Despite Intentional Subversion is a practical, important step in the right direction. It demonstrates various protocols for aiming for safety even with malicious models that know they are suspected of being dangerous.
Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger Proceedings of the 41st International Conference on Machine Learning, PMLR 235:16295-16336, 2024.
As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. To do so, safety measures either aim at making LLMs try to avoid harmful outcomes or aim at preventing LLMs from causing harmful outcomes, even if they try to cause them. In this paper, we focus on this second layer of defense. We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion - an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrusted-model code or using the untrusted model as a monitor substantially improves on simple baselines.
Related Video by Robert Miles: I highly recommend Using Dangerous AI, But Safely? released on Nov. 15, 2024.