What are some positive developments in AI safety in 2024?

But what we seem to be seeing is a bit different from deep learning broadly hitting a wall. More specifically it appears to be: returns to scaling up model pretraining are plateauing.

Reply

David James

Nov 22, 2024

31

The paper AI Control: Improving Safety Despite Intentional Subversion is a practical, important step in the right direction. It demonstrates various protocols for aiming for safety even with malicious models that know they are suspected of being dangerous.

Ryan Greenblatt, Buck Shlegeris, Kshitij Sachan, Fabien Roger Proceedings of the 41st International Conference on Machine Learning, PMLR 235:16295-16336, 2024.

As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. To do so, safety measures either aim at making LLMs try to avoid harmful outcomes or aim at preventing LLMs from causing harmful outcomes, even if they try to cause them. In this paper, we focus on this second layer of defense. We develop and evaluate pipelines of safety techniques (protocols) that try to ensure safety despite intentional subversion - an approach we call AI control. We investigate a setting in which we want to solve a sequence of programming problems without ever submitting subtly wrong code, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to high-quality trusted labor. We investigate a range of protocols and red-team them by exploring strategies that the untrusted model could use to subvert them. We find that using the trusted model to edit untrusted-model code or using the untrusted model as a monitor substantially improves on simple baselines.

Related Video by Robert Miles: I highly recommend Using Dangerous AI, But Safely? released on Nov. 15, 2024.

David James

Nov 22, 2024

21

NIST's AI Safety Institute (AISI) hired Paul Christiano as its Head of AI Safety.

10

[ Question ]

What are some positive developments in AI safety in 2024?

10

10

3 Answers sorted by top scoring

Nov 21, 2024

Nov 22, 2024

Nov 22, 2024

3 Answers sorted by
top scoring