The prospect of accelerated AI safety progress, including philosophical progress

Mitchell_Porter

This started life as a reaction to a post by Raemon, in which he calls on Anthropic to be more willing to halt its research if it enters a danger zone, and for them to take the relevance of technical philosophy much more seriously.

My personal opinion is that at this point, even slowing things down (I mean overall, not at Anthropic specifically) is not going to happen. It is a bare possibility that some kind of pause-AI or anti-AI movement will materialize and make a difference, but right now such a force, that is capable of making a difference, doesn't even exist, and the accelerationist path is being pursued by competing billion-dollar companies who have government backing. So my forecast is that we will go over the edge to superintelligence, as quickly as humanity can figure out how to do it.

Now, although we are in a race, the contenders are not completely blind to the danger of their creations. They don't just make them and hand them the keys to everything. But as Raemon says, they are following an approach of "iterative empiricism" where they keep pressing ahead and hope they can solve all the future challenges along the way.

But what are their actual attitudes (never mind plans) regarding the time when superintelligence arrives? Here I feel like the best data we have are the vague optimistic essays that a few of them have written. I believe Altman and Amodei have both written quasi-utopian essays about coexistence between humans and matured AIs, and probably Hassabis as well. As for Musk, maybe you could piece something together from his tweets - all I know is that he hopes superintelligent AI will spare us out of "curiosity". Sutskever's Twitter profile during 2023 said "towards a plurality of humanity loving AGIs". I have no idea what they think about this at DeepSeek.

What do they think in private? One must assume that some of them hope they personally can become as gods by being personally coupled to superintelligent AI. Elon Musk is by far the most visible candidate for such a fate - richest man in the world, embedded in the US government, controlling near-earth space, owning companies that work on humanoid robotics and brain-computer interfaces, as well as having his own frontier AI model. He already has all the pieces, he just needs to win the race to superintelligence. Though it's in the nature of superintelligence that if someone else gets there first, having all those other trappings shouldn't help Musk; the tactical and strategic superiority of superintelligence should be capable of neutralizing all the other contenders if it wishes, no matter what other assets they possess.

In any case, I think that, in our little fictional explorations of what an imminent singularity would be like, a singularity in which a human or humans are part of the "recursive self-improvement" is a neglected scenario. Someone should write a little fiction about Musk trying to fuse with Grok via Neuralink, and the self-transformations of the resulting "Musk-Grok" entity.

But back to the thesis of this post by Raemon. The proposition is that there are ways for superintelligence to turn out right, and ways for superintelligence to turn out wrong, and what if the only way for it to turn out right by design, is for humanity to make several decades of progress in areas we would now regard as the domain of technical philosophy? Raemon is focusing on Anthropic, maybe because he thinks that a pause or a pivotal act is our one hope and that Anthropic is the only frontier AI company that is even close to being that responsible.

As should be clear, my framing is a bit different. If Anthropic suspends its operations out of caution, that only guarantees that some other company will be first over the edge. But what I want to dig into a little, is trying to predict what is going to happen, if we do go over the edge under the current regime, of risk-taking competitive iterative empiricism.

A few years back, Eliezer argued for 99+% probability of doom in a situation like this. Part of the argument was the complexity and fragility of human value. The odds of doom are not so clear to me. I don't know what they are, our existing AIs have actually taken on board human concepts and values to some degree, it is quite conceivable that they will end up alien but not paperclipper-alien. What I now emphasize, in order to most convey the unacknowledged risk involved with our current path, is that with overwhelming likelihood it involves the loss of human "sovereignty" - the AIs may or may not kill us, but they will almost certainly be in control. As I said, we've neglected the scenario of human-AI fusion, so that's a kind of small print - we may end up ruled, not by completely nonhuman AIs, but by something that is still part human or used to be part human.

So perhaps we should be estimating p(AI takeover) and p(cyborg takeover) along with p(doom), and so on. I won't try to put numbers on any of that here. Instead I want to focus on the idea that there are right ways to create superintelligence, but they require intellectual progress that hasn't yet occurred. The position of people who want a multi-decade pause is that the intellectual progress in question, involves leaps as great and profound as anything that human thought has ever accomplished - that's why we need a few decades of grace.

My premise is, that will not happen. We're going over the edge in this decade, and that's just how it is. My interest is, what are the odds of getting it right within that period of time, and what can we do to to increase the odds of that happening?

Again, I'm not going to guesstimate actual probabilities. But I do think that the odds of getting it right may not be negligible, even if "decades of progress" are required. Two factors combine to make this conceivable. One is that we do actually know something about the world. We know a lot about physics, we know quite a lot about computation, and we have a lot of starting points available, even for "hard problems" like consciousness or morality or why anything exists. That's one factor, and the other is, that we are in a regime in which human-AI collaboration can be very powerful. The latter is what makes it possible that the "decades" can pass in months, for people fortunate enough to have a good enough starting conceptual framework, and the means to develop it.

We know that the run-up to superintelligence should feature many novel phenomena produced by lesser levels of AI - of course, we're already seeing that. All I'm arguing is that the ingredients are there, for those novel phenomena to include the dramatic progress required for a friendly singularity to occur via design rather than just by serendipity.

If that is true, what can we do to increase the odds of getting it right in time? For me, work on superalignment (if we use that name for safety progress relevant to superintelligent AI) that's public is very important. The competing AI labs will inevitably keep their most powerful work hidden away, but public literature on superalignment is something that all of them can see and draw on. That seems to be the most important theory-independent thing that I can emphasize. (I suppose I could also mention the desirability of having your best thinkers being able to give their best towards the solution of these problems.)

LESSWRONG
LW

11

The prospect of accelerated AI safety progress, including philosophical progress

11

11