Countering arguments against working on AI safety

Rauno Arike

This is a crosspost from my personal website.

One of the questions I’ve grappled with the most throughout the time I’ve been considering working on AI safety is whether we know enough about what a superintelligent system will look like to be able to do any useful research on aligning it. This concern has been expressed by several public intellectuals I respect. For instance, here’s an excerpt from Scott Aaronson’s superb essay „The Ghost in the Quantum Turing Machine“:

"For example, suppose we conclude—as many Singularitarians have—that the greatest problem facing humanity today is how to ensure that, when superhuman AIs are finally built, those AIs will be “friendly” to human concerns. The difficulty is: given our current ignorance about AI, how on earth should we act on that conclusion? Indeed, how could we have any confidence that whatever steps we did take wouldn’t backfire, and increase the probability of an unfriendly AI?"

And here’s Ben Garfinkel on the 80,000 Hours podcast:

"So just because something has some long run significance doesn’t necessarily mean that you can do much to affect the trajectory. So just a quick analogy; it’s quite clear that the industrial revolution was very important. /.../ At the same time though, if you’re living in 1750 or something and you’re trying to think, "How do I make the industrial revolution go well? How do I make the world better… let’s say in the year 2000, or even after that", knowing that industrialization is going to be very important… it’s not really clear what you do to make things go different in a foreseeably positive way."

Note that Aaronson probably doesn’t endorse his argument anymore, at least in the same form: he recently joined OpenAI and wrote in the post announcing his decision that he has become significantly more optimistic about our ability to attack this problem over time. However, as I’ve seen a similar argument thrown around many times elsewhere, I still feel the need to address it.

So, the argument that I’m trying to counter is "Okay, I buy that AI is possibly the biggest problem we’re facing, but so what? If I don’t have the slightest idea about how to tackle this problem, I might as well work on some other important thing – power-seeking AI is not the only existential risk in the world." I believe that even if one assumes we know very little about what a future AGI system will look like, there's still a strong case for working on AI safety. Without further ado, here’s a list of arguments that have convinced me to keep looking into the field of AI safety.

Counterargument 1: We don't know how long it will take to build AGI

Provided that there’s even a very small tail risk that a generally capable AI system which would kill us all unless aligned will be built in the next 20-30 years, we don’t have the option of waiting – it’s better to do something based on the limited information that we have than to sit on the couch and do nothing. Every bit of preparation, even if largely clueless, is sorely needed.

This rests on the assumption that AGI within the next 20-30 years is within the realm of possibility, but at this point, it seems almost impossible to argue otherwise. In 2014, the median prediction of the 100 top experts of the field was a 50% chance of AGI by 2050, and my feeling is that many experts have updated towards shorter timelines in the meantime. You can find a huge list of claims made about AGI timelines here. One might argue that AI researchers suffer from planning fallacy and that predictions made 30 years ago look awfully similar to predictions made today, but the experts' predictions seem reasonable when looking at biological anchors and extrapolating computing trends. Even if one is significantly more skeptical about AI timelines than the median researcher, it seems incredibly difficult to justify predicting virtually zero chance of AGI in the next 20-30 years.

We may perhaps be better able to do AI safety research at some point in the future, but exactly how close to AGI do we need to get before deciding that the problem is urgent enough? Stuart Russell has articulated this point well in Human Compatible: "[I]magine what would happen if we received notice from a superior alien civilization that they would arrive on Earth in thirty to fifty years. The word pandemonium doesn’t begin to describe it. Yet our response to the anticipated arrival of superintelligent AI has been . . . well, underwhelming begins to describe it."

I’d further note that if AGI is only 20-30 years away, which we cannot rule out, then the AI systems that we have now are much more likely to be at least somewhat similar to the system that eventually becomes our first AGI. In that case, we’re probably not that clueless.

Counterargument 2: On the margin, AI safety needs additional people more than many other fields

Say for the sake of the argument that we’re indeed sure that AGI is definitely more than 50 years away. We’ll never be, but assume we will for the sake of the argument. Even in this case, I think that we need significantly more AI safety researchers on the margin.

Although we’re not in 2014 anymore when there probably weren’t even 10 full-time researchers focusing on AI safety, it still seems that there are up to a hundred similarly qualified AI capabilities researchers for every AI safety researcher. The field remains in a state of searching for a common approach to attack the problem and there’s still a lot of disagreement about which paradigms are worth pursuing. Contrast this to a field like particle physics. Although that too is a field still in search of its ultimate flawless theory, there's a broad agreement that the Standard Model gives us predictions that are good enough for nearly every practical problem, and there's no looming deadline when we either have the Theory of Everything or go extinct.

Even 20 years is a fairly long time, so I don't think it's time to give up all hope about us succeeding with alignment. Rather, there’s an enormous opportunity to give a huge contribution to solving one of the most impactful problems of our time for anyone who can contribute original ways of tackling the problem and good arguments for preferring one paradigm over the others.

Counterargument 3: There’s a gap between capabilities and safety that must be bridged

Continuing on a similar note to the previous argument, there’s currently a huge gap between our ability to build AI systems and our ability to understand them. Nevertheless, just about 2% of the papers at NeurIPS 2021 were safety-related, and as argued before, just a small fraction of the number of capabilities researchers is working on safety research.

If these proportions don’t change, it's difficult to imagine safety research catching up with our ability to build sophisticated systems by the right time. Of course, it’s possible that sufficiently good safety research will be possible without ever being funded as much as capabilities research, but even then, the current proportions still look way too much tilted in favor of capabilities. After all, safety has thus far looked anything but an easy problem solvable over the summer on a small budget by 10 graduate students having a reliable supply of Red Bull.

Counterargument 4: Research on less capable systems might generalize to more capable systems, even if the systems are not very similar

Even if the safety tools we’re building right now are useless for aligning a generally capable agent, we’ll have a huge stack of less useful tools and past mistakes to learn from once we get to aligning a true AGI. There is at least some evidence that tools built for a certain safety-related task help us solve other safety problems quicker. For example, my impression is that having already done similar research on computer vision models, Chris Olah’s team at Anthropic was able to make progress on interpreting transformer language models faster than they would have otherwise (I'm not affiliated with Anthropic in any way, though, so please correct me if I'm wrong on this). In the same way, it seems possible that ability to interpret less capable systems really well could help us build interpretability tools for more capable ones.

I would also expect highly theoretical work such as MIRI’s research on the foundations of agency and John Wentworth’s theory of natural abstractions to be useful for aligning an arbitrarily powerful system, provided that these theories have become sufficiently refined by the time we need them. In general, if one buys the argument of Rich Sutton's "The Bitter Lesson", it seems preferable to do safety research on simpler AI models and methods that look like they might generalize better (again, conditional on that it's possible to make progress on the problem quickly enough). Here, Dan Hendrycks and Thomas Woodside give a related overview of the dynamic of creative destruction of old methods in machine learning.

Counterargument 5: AI safety must have become an established field by the time AGI is closer

Hendrycks and Woodside write in their Pragmatic AI Safety sequence:

"Imagine you’re in the late 2000s and care about AI safety. It is very difficult to imagine that you could have developed any techniques or algorithms which would transfer to the present day. However, it might have been possible to develop datasets that would be used far into the future or amass safety researchers which could enable more safety research in the future. For instance, if more people had been focused on safety in 2009, we would likely have many more professors working on it in 2022, which would allow more students to be recruited to work on safety. In general, research ecosystems, safety culture, and datasets survive tsunamis."

Even if we’re pretty clueless about the best directions of research right now, it’s essential that we have a fully functioning field of AI safety and an industry prioritizing a safety-first culture ready by the time we get dangerously close to building an AGI. Since we don’t know when we’ll hit that point of extreme danger, it’s better to start getting ready now. Currently, we don't have enough people working on AI safety, as argued in counterargument 2, and, even though commercial incentives might make it impossible to make safety the principal concern in the field, safety culture can also certainly be improved.

Counterargument 6: Some less capable systems also require safety work

A system doesn’t need to have human-level intelligence to require alignment work. Even if building an AGI system takes a lot longer than expected, safety research can still be useful for improving narrower models. For example, the more we give important decisions concerning policy or our everyday lives to narrow AI systems, the bigger our need to interpret the way they make decisions. This can both help us spot algorithmic bias and get a better overview of the AI system’s considerations behind making some decision. In this way, interpretability research can both help with AGI alignment and with making narrower systems safer and more useful.

I also think that there’s a lot of policy work required to make narrow AIs safe. For example, given the way autonomous weapons can lower the costs of going into wars, good policies around their use are probably vital for keeping the trend of yearly decrease in the number of wars going. And there are probably many other ways alignment work can contribute to making narrow AI systems better that don’t come to my mind at the moment. Sure, none of the problems mentioned under this argument are existential risks, but preventing a single war through successful policies would already save thousands of lives.

[-]Steven Byrnes2y33

You set the article up with an argument along the lines of: “There is no technical work that we can do right now to increase the probability that AGI goes well.”

It seems to me that the only possible counterargument is “Actually, there is technical work that we can do right now to increase the probability that AGI goes well”.

I say that this is only possible counterargument, because how else can you possibly support doing technical AGI safety work? Having a bunch of people doing knowably-useless work—nobody wants that! It won't help anything!

Anyway, I think the evidence for “Actually, there is technical work that we can do right now to increase the probability that AGI goes well” is strong.

For example, we have at least three ways to reason about what AGI will look like—ML systems of today (“Prosaic AGI”), human brains [yay!], and “things that are fundamentally required for a system to qualify as AGI” (e.g. what is the fundamental nature of rational agency, what does “planning” mean, etc.). We also have various more specific existing lines of technical research, and we can try to evaluate whether they might be useful. We also have the possibility of finding new directions for technical research, even if all the existing ones are lousy.

Many of your counterarguments seem to be slightly missing the point to me—not directly addressing the argument at the top. Or maybe I'm misunderstanding.

[-]Rauno Arike2y10

I guess the main reason my arguments are not addressing the argument at the top is that I interpreted Aaronson's and Garfinkel's arguments as "It's highly uncertain whether any of the technical work we can do today will be useful" rather than as "There is no technical work that we can do right now to increase the probability that AGI goes well." I think that it's possible to respond to the former with "Even if it is so and this work really does have a high chance of being useless, there are many good reasons to nevertheless do it," while assuming the latter inevitably leads to the conclusion that one should do something else instead of this knowably-useless work.

My aim with this post was to take an agnostic standpoint towards whether that former argument is true and to argue that even if it is, there are still good reasons to work on AI safety. I chose this framing because I believe that for people new to the field who don't yet know enough about the field to make good guesses about how likely it is that AGI will be similar to ML systems of today or to human brains, it's useful to think about whether it's worth working on AI safety even if the chance that we'll build prosaic or brain-like AGI turns out to be low.

That being said, I could have definitely done a better job writing the post - for example by laying out the claim I'm arguing against more clearly at the start and by connecting argument 4 more directly to the argument that there's a significant chance we'll build a prosaic or brain-like AGI. It might also be that the quotes by Aaronson and Garfinkel convey the argument you thought I'm arguing against rather than what I interpreted them to convey. Thank you for the feedback and for helping me realize the post might have these problems!