Review

Background

In November 2022, I was thinking a lot about the possible paths to getting powerful and dangerous AI models. My intuition was that an AI that can generate new science, i.e., it can write down a series of hypotheses, select one or more of them, design experiments to test them, and evaluate the results of all this, would be 1) situationally aware, 2) goal-directed, 3) it would have long-ish memory (or be able to use external tools that give it memory), and 3) be human-level smart or beyond. 

I didn’t have good mechanistic explanations for how these behaviors would arise in AI systems; I expected something along the lines of the training algorithm would attempt to minimize the loss function, and as a result of instrumental convergence, it would discover a consequentialist to improve performance across these tasks within a reinforcement learning environment. So, I started a project to study science models hoping to get a better idea of what these models are, how they work, and how likely it is that my speculations were true. What we studied with my team can be found here

In a nutshell, I thought there was something special about doing science, something that would lead to agentic behavior, something that did not exist when training models to generate poetry or help you plan a summer vacation. 

Confusion and Update 1: 

There probably isn’t anything special about doing science computationally speaking, or at least, I overestimated how likely that would be. I’d still be curious to know how much credence I should assign to that speculation. Nevertheless, I think the gist of The Bitter Lesson transfers to the case of AI science models. I thought that there would be something deep about the discovering process other than “scaling computing by search and learning” that would make a STEM AI more likely to develop dangerous properties, e.g., situational awareness. At the same time, I underestimated how outputting science is dangerous enough on its own. 

To explain my speculation a bit more, I thought there is a higher probability for the model to develop agency somehow because I assumed that science is hard, or at least harder than many other activities. And after thinking about it for quite some time, I ended up not having a satisfactory reply to questions such as “but why wouldn’t you expect an AI that has to execute a cooking recipe to develop agentic properties since cooking seems to require a sense of continuity and robustness similar to that necessary for executing a scientific experiment?” So, having no good reason to think that, I had to admit that absence of evidence must imply evidence of absence.  

Confusion and Update 2: 

For quite some time, I was stuck thinking about Yudkowsky’s and Ngo’s debate on what capability comes first, doing science or taking over the world from the MIRI conversations. Specifically, I thought the following was revealing something very valuable about the development of cognition:

Human brains have been optimised very little for doing science.

This suggests that building an AI which is Einstein-level at doing science is significantly easier than building an AI which is Einstein-level at taking over the world (or other things which humans evolved to do).

But capabilities don’t come in distinct packages the way it is implied in this statement so that we first get the skillset that enables an agent to take over the world and then another distinct skillset that makes the agent generate general and special relativity. It seems to be more of a spectrum that generalizes and applies to various tasks such as driving red cars and blue cars. The spectrum of capabilities necessary for complex tasks such as science and taking over the world seems to entail problem-solving capabilities and in that sense makes both tasks look very similar in what they require cognitively analogously to driving red and blue cars. 

Confusion and Update 3: 

Could we live in a world where we get exceptionally helpful AI scientists that are not dangerous? I was more positive about that scenario until understanding what this STEM AI proposal entails and becoming pretty confident that we don’t live in a world where we just trained AIs on formal language data sets and made sure the model remained inside its STEM sandbox. It’s clear to me now that we live in the exact opposite world: current LLMs have read the internet at least once and plausibly simulate the physical world and human reality. So the scenario where AIs become helpful science assistants in the box is no longer possible as far as I understand unless there's a dramatic change in the current AI paradigm.