David Krueger on AI Alignment in Academia, Coordination and Testing Intuitions

Michaël Trazzi

David Krueger is an assistant professor at the University of Cambridge and got his PhD from Mila. His research group focuses on aligning deep learning systems, but he is also interested in governance and global coordination. He does not have an AI alignment research agenda per se, and instead tries to enable his seven PhD students to drive their own research.

Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript.

On Academia and Coordination

Building A Research Team, Not Following An Agenda

"I think agenda is a very grandiose term to me. It's oftentimes, I think people who are at my level of seniority or even more senior in machine learning would say, "oh, I'm pursuing a few research directions." And they wouldn't say, "I have this big agenda." And so I think my philosophy or mentality, I should say, when I set up this group and started hiring people was like, let's get talented people. Let's get people who understand and care about the problem. Let's get people who understand machine learning. Let's put them all together and just see what happens and try and find people who I want to work with, who I think are going to be nice people to have in the group who have good personalities, pro-social, who seem to really understand and care and all that stuff." (full context)

On Coordination Between Academia And The Broader World

"There's a lack of understanding and appreciation of the perspective of people in machine learning within the existential safety community and vice versa. And I think that's really important to address, especially because I'm pretty pessimistic about the technical approaches. I don't think alignment is a problem that can be solved. I think we can do better and better. But to have it be existentially safe, the bar seems really, really high and I don't think we're going to get there. So we're going to need to have some ability to coordinate and say let's not pursue this development path or let's not deploy these kinds of systems right now. And for that, I think to have a high level of coordination around that, we're going to need to have a lot of people on board with that in academia and in the broader world. So I don't think this is a problem that we can solve just with the die hard people who are already out there convinced of it and trying to do it." (full context)

Most of the risk comes from safety-performance trade-offs in the development and deployment process

"A lot of people are worried about us under-investing in research and that's where the safety-performance trade-offs are most salient for them. But I'm worried about the development and deployment process. I think where most of the risk actually comes from is from safety-performance trade-offs in the development and the deployment process. For whatever level of research we have developed on alignment and safety, I think it's not going to be the case that those trade-offs just go away." (full context)

On AI Rapidly Acquiring World Models

Testing Our Intuitions About Reverse Engineering The World

"This is something that's a really interesting research question and is really important for safety because people have very different intuitions about this. Some people have these stories where just through this carefully controlled text interaction, maybe we just ask this thing one yes or no question a day and that's it. And that's the only interaction it has with the world. But it's going to look at the floating point errors on the hardware it's running on. And it's somehow going to become aware of that. And from that it's going to reverse engineer the entire outside world and figure out some plan to trick everybody and get out. And this is the thing that people talk about on LessWrong classically.
We don't know how smart the superintelligence is going to be, so let's just assume it's arbitrarily smart, basically. And obviously, a lot of people take issue with that. It's not clear how representative that is of anybody's actual beliefs but there are definitely people who have beliefs more towards that end where they think that AI systems are going to be able to understand a lot about the world, even from very limited information and maybe in very limited modality. My intuition is not that way. The important thing is to test the intuitions and actually try and figure out at what point can your AI system reverse engineer the world or at least reverse engineer a distribution of worlds or a set of worlds that includes the real world based on this really limited kind of data interaction." (full context)

An AI Could Unleash Its Potential To Rapidly Learn By Disambiguating Between Worlds By Going For a Walk

"You could have something that actually is very intelligent in some way, has a lot of potential to rapidly learn from new data or maybe has a lot of concepts, a lot of different possible models for how the world could work. It's not able to disambiguate because it hasn't had access to the data that can disambiguate those. And in some sense, you can say it's in a box. It's maybe in the text interface box or maybe it's in the household robot box and it's never been outside the house and it doesn't know what's outside the house and these sorts of things. [...]
It doesn't have to be an Oracle but just some AI system that is doing something narrow and really only understands that domain. And then it can get out of the box either because it decides it wants to go out and explore or because somebody makes a mistake or somebody deliberately releases it. You can suddenly go from it didn't really know anything about anything outside of the domain that it's working in, to all of a sudden starts to get a bunch more information about that and then it could become much more intelligent very quickly for that reason." (full context)

13