A Defense of Work on Mathematical AI Safety

Davidmanheim

This is a linkpost for https://forum.effectivealtruism.org/posts/NKMxC2nA47uuhFm8x/a-defense-of-work-on-mathematical-ai-safety

AI Safety was, a decade ago, nearly synonymous with obscure mathematical investigations of hypothetical agentic systems. Fortunately or unfortunately, this has largely been overtaken by events; the successes of machine learning and the promise, or threat, of large language models has pushed thoughts of mathematics aside for many in the “AI Safety” community. The once pre-eminent advocate of this class of “agent foundations” research for AI safety, Eliezer Yudkowsky, has more recently said that timelines are too short to allow this agenda to have a significant impact. This conclusion seems at best premature.

Foundational research is useful for prosaic alignment

First, the value of foundational and mathematical research can be synergistic with both technical progress on safety, and with insight into how and where safety is critical. Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research. Current mathematical research could play a similar role in the coming years, as more funding and research are increasingly available for safety. We have also repeatedly seen the importance of foundational research arguments in discussions of policy, from Bostrom’s book to policy discussions at OpenAI, Anthropic, and DeepMind. These connections may be more conceptual than direct, but they are still relevant.

Long timelines are possible

Second, timelines are uncertain. If timelines based on technical progress are short, many claim that we have years not decades until safety must be solved. But this assumes that policy and governance approaches fail, and that we therefore need a full technical solution in the short term. It also seems likely that short timelines make all approaches less likely to succeed. On the other hand, if timelines for technical progress are longer, fundamental advances in understanding, such as those provided by more foundational research, are even more likely to assist in finding or building more technical routes toward safer systems.

Aligning AGI ≠ aligning ASI

Third, even if safety research is successful at “aligning” AGI systems, both via policy and technical solutions, the challenges of ASI (Artificial SuperIntelligence) still loom large. One critical claim of AI-risk skeptics is that recursive self-improvement is speculative, so we do not need to worry about ASI, at least yet. They also often assume that policy and prosaic alignment is sufficient, or that approximate alignment of near-AGI systems will allow them to approximately align more powerful systems. Given any of those assumptions, they imagine a world where humans and AGI will coexist, so that even if AGI captures an increasing fraction of economic value, it won’t be fundamentally uncontrollable. And even according to so-called Doomers, in that scenario, for some period of time it is likely policy changes, governance, limited AGI deployment, and human-in-the-loop and similar oversight methods to limit or detect misalignment will be enough to keep AGI in check. This provides a stop-gap solution, optimistically for a decade or even two - a critical period - but is insufficient later. And despite OpenAI’s recent announcement that they plan to solve Superalignment, there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment, and policy-centric approaches will not allow control.

Resource Allocation

Given the above claims, a final objection is based on resource allocation, in two parts. First, if language model safety was still strongly funding constrained, those areas would be higher leverage, and avenues of foundational and mathematical research would be less marginally beneficial routes for spending. Similarly, if the individuals likely to contribute to mathematical AI safety were all just as well suited to computational deep learning safety research, their skills might be better directed towards machine learning safety. Neither of these is the case.

Of course, investments in agent foundations research are unlikely to directly lead to safety within a few years, and it would be foolish to abandon or short-change efforts that are critical to the coming decade. But even in the short term, these approaches may continue to have important indirect effects, including both deconfusion, and informing other approaches.

As a final point, pessimistically, these types of research are among the least capabilities-relevant AI safety work being considered, so they are low risk. Optimistically, this type of research is very useful in the intermediate term future, and is invaluable should we manage to partially align language models, and need to consider what is next for alignment.

Thank you to Vanessa Kosoy and Edo Arad for helpful suggestions and feedback. All errors are, of course, my own.

I think it would help a lot to provide people with examples. For example, here

Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research.

You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.

Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.

You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.

I don't imagine readers doing that if they are familiar with the field, but I'm happy to give a couple of simple examples for this, at least. Large parts of mechanistic interpretability is to address concerns with future deceptive behaviors. Goal misgeneralization was predicted and investigated as Goodhart's law.

Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.

Who are the people you're referring to? I'm happy to point to this paper for an explanation, but I didn't think anyone on Lesswrong would really not understand why people thought agent foundations were potentially important as a research agenda - I was defending the further point that we should invest in them even if we think other approaches are critical near term.

I'm still not sure I buy the examples. In the early parts of the post you seem to contrast 'machine learning research agendas' with 'foundational and mathematical'/'agent foundations' type stuff. Mechanistic interpretability can be quite mathematical but surely it falls into the former category? i.e. it is essentially ML work as opposed to constituting an example of people doing "mathematical and foundational" work.

I can't say much about the Goodhart's Law comment but it seems at best unclear that its link to goal misgeneralization is an example of the kind you are looking for (i.e. in the absence of much more concrete examples, I have approximately no reason to believe it has anything to do with what one would call mathematical work).

I'm not really clear what you mean by not buying the example. You certainly seem to understand the distinction I'm drawing - mechanistic interpretability is definitely not what I mean by "mathematical AI safety," though I agree there is math involved.

And I think the work on goal misgeneralization was conceptualized in ways directly related to Goodhart, and this type of problem inspired a number of research projects, including quantilizers, which is certainly agent-foundations work. I'll point here for more places the agents foundations people think it is relevant.

Ah OK, I think I've worked out where some of my confusion is coming from: I don't really see any argument for why mathematical work may be useful, relative to other kinds of foundational conceptual work. e.g. you write (with my emphasis): "Current mathematical research could play a similar role in the coming years..." But why might it? Isn't that where you need to be arguing?

The examples seem to be of cases where people have done some kind of conceptual foundational work which has later gone on to influence/inspire ML work. But early work on deception or goodhart was not mathematical work, that's why I don't understand how these are examples.

I think th dispute here is that you're interpreting mathematical too narrowly, and almost all of the work happening in agent foundations and similar is exactly what was being worked on by "mathematical AI research" 5-7 years ago. The argument was that those approaches have been fruitful, and we should expect them to continue to be so - if you want to call that "foundational conceptual research" instead of "Mathematical AI research," that's fine..

Strongly upvoted.

I roughly think that a few examples showing that this statement is true will 100% make OP's case. And that without such examples, it's very easy to remain skeptical.

Unrelatedly, why not make this a cross-post rather than a link-post?

Because I didn't know that was a thing and cannot find an option to do it...

Thanks for the article. For what it's worth, here's the defence I give of Agent Foundations and associated research, when I am asked about it (for background, I'm a mathematician, now working on mathematical aspects of AI safety different from Agent Foundations). I'd be interested if you disagree with this framing.

We can imagine the alignment problem coming in waves. Success in each wave merely buys you the chance to solve the next. The first wave is the problem we see in front of us right now, of getting LLMs to Not Say Naughty Things, and we can glimpse a couple of waves after that. We don't know how many waves there are, but it is reasonable to expect that beyond the early waves our intuitions probably aren't worth much.

That's not a surprise! As physics probed smaller scales, at some point our intuitions stopped being worth anything, and we switched to relying heavily on abstract mathematics (which became a source of different, more hard-won intuitions). Similarly, we can expect that as we scale up our learning machines, we will enter a regime where current intuitions fail to be useful. At the same time, the systems may be approaching more optimal agents, and theories like Agent Foundations start to provide a very useful framework for reasoning about the nature of the alignment problem.

So in short I think of Agent Foundations as like quantum mechanics: a bit strange perhaps, but when push comes to shove, one of the few sources of intuition we have about waves 4, 5, 6, ... of the alignment problem. It would be foolish to bet everything on solving waves 1, 2, 3 and then be empty handed when wave 4 arrives.

It's an interesting framing, Dan. Agent foundations for Quantum superintelligence. To me, motivation for Agent Foundations mostly comes from different considerations. Let me explain.

To my mind, agent foundations is not primarily about some mysterious future quantum superintelligences (though hopefully it will help us when they arrive!) - but about real agents in This world, TODAY.

That means humans and animals but also many systems that are agentic to a degree like markets, large organizations, Large Language models etc. One could call these pseudo-agents or pre-agents or egregores but at the moment there is no accepted terminology for not-quite-agents which may contribute to the persistent confusion that agent foundations is only concerned with expected utility maximizers.

The reason that so far research in Agent Foundations has mostly restricted itself to highly ideal 'optimal' agents is primarily because of mathematical tractability. Focusing on highly ideal agents also make sense from the point of view where we are focused on 'reflectively stable agents' i.e. we'd like to know what agents converge to upon-reflection. But primarily the reason we don't much study more complicated, complex realistic models of real-life agents is that the mathematics simply isn't there yet.

A different perspective on agent foundations is primarily that of deconfusion: we are at present confused about many of the key concepts of aligning future superintelligent agents. We need to be less confused.

Another point of view on the importance of Agent Foundations: Ultimately, it is inevitable that humanity will delegate more and more power to AIs. Ensuring the continued surviving and flourishing of the human species is then less about interpretability, more about engineering reflectively stable well-steered superintelligent systems. This is more about decision theory & (relatively) precise engineering, less about the online neuroscience of mechInterp. Perhaps this is what you meant by the waves of AI alignment.

there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment

In which section of the linked paper is the strong argument for this conclusion to be found? I had a quick read of it but could not see it - I skipped the long sections of quotes, as the few I read were claims rather than arguments.

I'm not going to try to summarize the arguments here, but it's been discussed on this site for a decade. And those quoted bits of the paper were citing the extensive discussions about this point - that's why there were several hundred citations, many of which were to Lesswrong posts.

I think it would help a lot to provide people with examples. For example, here

Many machine learning research agendas for safety are investigating issues identified years earlier by foundational research, and are at least partly informed by that research.

You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.

Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.

You say that, but then don't provide any examples. I imagine readers just not thinking of any, and then moving on without feeling any more convince.

Overall, I think that it's hard for people to believe agent foundations will be useful because they're not visualizing any compelling concrete path where it makes a big difference.

Strongly upvoted.

I roughly think that a few examples showing that this statement is true will 100% make OP's case. And that without such examples, it's very easy to remain skeptical.

Unrelatedly, why not make this a cross-post rather than a link-post?

Because I didn't know that was a thing and cannot find an option to do it...

It's an interesting framing, Dan. Agent foundations for Quantum superintelligence. To me, motivation for Agent Foundations mostly comes from different considerations. Let me explain.

there are strong arguments that control of strongly superhuman AI systems will not be amenable to prosaic alignment

29

A Defense of Work on Mathematical AI Safety

29

Foundational research is useful for prosaic alignment

Long timelines are possible

Aligning AGI ≠ aligning ASI

Resource Allocation

29

29