If there was a textbook on building safe ASI with instructions that are sufficiently straightforward to execute, people would tend to build safe ASI rather than an extinction/disempowerment ASI. Some AI safety efforts could be thought of as contributions to this hypothetical textbook, making it marginally more real.
unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path
The point of an ASI ban/pause is to create the time to reduce this gap, until it's sufficiently narrow that competent people can walk across without falling through. If an unsafe ASI is artificially delayed despite technological feasibility, there is time to write the textbook, to make safe ASI about as easy to implement. And similarly for some AI safety efforts that don't involve an ASI ban/pause, which attack the gap from the other end.
(Scalable oversight agenda hopes AIs can write the textbook on their own, sufficiently quickly to win the race against the technical feasibility of unsafe ASI. I would feel a lot better about this plan if alignment of Mythos-level systems was pursued for 30 years before going further, and somehow there was a guarantee that Mythos instances won't be founding a country and declaring sovereignty in the meantime. This guarantee gets more believable if there are no Mythos-level systems at all yet.)
I think this comment is making too many simplifying assumptions that will shatter on contact with the real world.
From the point of view of any entity pursuing an ASI project, in a world with no global ban, you will always want to deploy too early and risk destroying the universe.
If you're allowed to run an ASI project, so are others. What does this mean for you?
For one thing...
The point of an ASI ban/pause is to create the time to reduce this gap
You can never know how big this gap is. Perhaps, in a world with much more advanced epistemology, you can get a usable estimate ahead of deployment; perhaps, in a world with much better coordination and strategy, you can gather enough information about it from smaller experiments and deployments without destroying the world.
But these worlds are very different from ours. We can write about them for fun, or as an intellectual exercise, but we should never forget that they are fantasy worlds. Any conclusion that starts from assuming we're in one of these worlds simply does not apply to ours, and we should not confuse this fanfiction for predictions.
From your point of view, you never know how far away you are from building safe ASI, and you should place an unreasonable (in terms of risk) amount of probability on the outcome that if someone else builds it using your state of the art approach, everyone dies.
Do you place absolute trust in all other entities capable of developing ASI to not try? Of course not. So you're going to cut corners.
And secondly...
Building ASI that is safe from your point of view is not just a technical problem. Other entities will have other views. In most cases, if a small group of people (compared to all 8 billion people on earth) gets an ASI that they are satisfied with, most of the world will not endorse the result.
You can see this concretely when US AI people espouse about the need to defeat China, or people from one lab talk about the need to defeat another lab. So again, you will naturally cut corners, and then everyone dies.
Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.
You are making a "logical jump" here, equating "friendly ASI" with "ASI one can control".
But this assumption is a point of well known contention. There is no consensus that the "control agenda" is the right way to approach this. Many people think that the approach aimed at achieving sustainable control of super-intelligent systems by ordinary humans is exactly the path to an almost certain ruin, for a number of fairly strong reasons.
(I very much doubt that if we burden an already very difficult problem of creating "a friendly world with ASIs" with the additional requirement of this kind of control, it would be possible to find a solution. So I'd like to see more studies of approaches not based on this particular kind of control.)
I interpreted "control" here as referring to control over the shape of the ASI, rather than corrigibility (ongoing ability to direct the ASI) or the "control agenda" (bribing or coercing an unaligned ASI not to kill you, somehow.)
(Whether or not that's the intended meaning, it is consistent with the argument outlined in OP. Suppose you want to create an ASI that hums with genuine lovingkindness for all sentient beings and wants to (say) make the universe objectively maximally good within the constraint of also making the universe much better from the perspective of human CEV. Or (whatever you prefer to specify.) Or (whichever one of several potential notkilleveryoneism-compatible ASIs is most feasible to build.) All of those are strictly narrower than just building ASI in general.)
One world where this model doesn't apply, I think, is one where there is a very large moral realist attractor basin, that ASIs converge on by default, unless human actors put a lot of effort into making them corrigible to Xi Jinping or Mark Zuckerberg or whomever, which in turn leads to the almost certain ruin you've noted. Here we can be pretty stupid about intelligence aside from how to make more of it, and have to actively try to fuck ourselves over.
Given uncertainty about what the attractor basin is like, though, gets us somewhere similar; we'd rather not create ASI prior to either understanding intelligence enough to know how to artificially align it to our values, or understanding intelligence enough to know that it would naturally align.
(I put some nontrivial credence into the idea that there is a large moral realist attractor basin that RL pushes most agents away from, but that's a discussion for another day, I think.)
Yes, this makes a lot of sense.
To me, the main dichotomy is whether we expect a unipolar world controlled by a singleton or whether we expect a multi-polar world with a lot of agents of varying nature and varying capabilities.
I think a lot of considerations are pointing towards a likely multi-polar diverse world, where one needs to have interests of various entities (that have radically different nature and radically different levels of capabilities) to be taken into account and protected. And so one needs a system of collective control which does that, and protects various entities from being steamrolled.
The technical aspects in regard to the ability of “the world” to constrain a single system from radical misbehavior are somewhat easier in that scenario (since “the world” is collectively very smart), but this is a small subtask of a much more complicated task of figuring out what kinds of invariant properties a self-modifying world of this kind should achieve and reliably maintain and how the collective task of figuring out those invariant properties and reaching the situation where these properties are achieved and reliably maintained should be approached.
maybe footnote 1 means that this post is not for me, but I believe that the world can survive the existence of misaligned/unsafe ASI as long as it is dominated (in terms of compute/intelligence) by aligned and safe ASI. See item 6 here https://windowsontheory.org/2025/01/24/six-thoughts-on-ai-safety/
I think the point of the post is, you can’t actually get to a world that has a dominant safe ASI without having first satisfied one of the listed conditions (absolute secrecy and control, complete technical orthogonality, or a global ban). Otherwise, you get an unsafe ASI first, and then it dominates the world, and since there is no safe ASI to be a defender, we lose.
I suppose you could argue that the early unsafe ASI takes long enough to establish dominance that we could invent safe ASI before the unsafe one finishes dominating. Then, if the safe ASI is stronger, it could complete its domination before the unsafe one does. This requires that:
-there be a significant time lag in domination in the first place,
-and the unsafe ASI is not able to sabotage safe ASI projects before establishing dominance,
-and an ASI built in that time window, under pressure, could be made safe,
-and it can be made stronger than the unsafe ASI is after it does its own self improvement.
If it can’t be made stronger, then humanity plus a weaker but safe ASI needs to be able to beat an unsafe ASI plus whatever resources it marshaled during the time window (including humans it swayed/blackmailed/etc). Or, at least, to be able to put up enough of a fight to bargain for a significant chunk of the lightcone (supposing that’s an acceptable outcome).
These are all specific requirements, though, which need to be debated on their own merits, if I got the framing right.
My rough heuristic is that intelligence scales with compute, so the crucial condition is the vast majority of FLOPs are deployed for safe intelligence. It seems that a lot of the arguments in the post are how there may be some leak of unsafe or misaligned ASI in one way or another but this doesn’t mean this ASI will have lots of compute at its disposal
I don’t think the post requires the unsafe ASI to breach containment. The frontier AI being run on the labs‘ own servers could itself be unsafe, successfully scheming without detection. Or it could be corrigible and still be an x/s-risk from humans enacting biorisk, human takeover using the AI, or gradual disempowerment.
Even if a leaked unsafe ASI has extremely constrained compute, it still can enact asymmetric attacks that work within those constraints. How much compute would it need to manipulate someone into - or just help them to - obtain and release smallpox? How much would it need to effectively sabotage safety research or the creation of a safe ASI?
And even if you are actually succeeding at building a safe ASI, it is still very likely to be harder and to take more time to build than an unsafe one. In the time window between when the leading lab could’ve built an unsafe ASI and them building a safe ASI, their competitors may just go ahead and build the unsafe ASI themselves, letting it run on their servers. And then the unsafe ASI has first mover advantage, with plenty of compute to spare, and probably wins.
the aligned and safe asi would have to be actually trying hard to patch vulnerabilities of all kinds that the malicious/misaligned/unsafe ai is trying to attack via, which laudibly is currently being attempted in cybersecurity, but the jury still seems out on institutional/social/biological/manipulation/epistemology/market.
Yes, as I wrote in my post, aligned ASI's would need to spend some fraction of their resources improving defender in the offense/defense balance. My main point was that the balance is not infinite and so if aligned resources vastly outnumber misaligned resources that should be enough.
Sometimes people make various suggestions that we should simply build "safe" artificial Superintelligence (ASI), rather than the presumably "unsafe" kind.[1]
There are various flavors of “safe” people suggest.
Now I could argue at lengths about why this is astronomically harder than people think it is, why their various proposals are almost universally unworkable, why even attempting this is insanely immoral[2], but that’s not the main point I want to make.
Instead, I want to make a simpler point:
Assume you have a research agenda that, if executed, results in a ASI-tier powerful software system that you can “control”.[3]
Punchline: On your way to figuring out how to build controllable ASI, you will have figured out how to build unsafe ASI, because unsafe ASI is vastly easier to build than controlled ASI, and is on the same tech path.
You can’t build a controlled ASI without knowing many, MANY things about intelligence and how to build it.
So this then bottlenecks the dual technical problems of “how to find an agenda that results in controllable ASI” and “how to execute on such an agenda” on “even if you had such an agenda, how do you execute it without accidentally, or due to some asshole leaving the project or reading your papers, building unsafe ASI along the way?”
No one I know pursuing various agendas of this type has answers to these questions. And lets be crystal clear: This is the fundamental question any sensible “safe ASI” project needs to answer before even being worth considering.
You would need to either have:
This means that the primary prerequisite to even considering starting to work on a safe ASI plan is to have a global ASI ban and powerful enforcement already in place.[4]
I’m assuming you already accept that “unsafe” ASI would be really, really bad. If not, this is not the post for you to read.
In short: If you unilaterally try to build ASI, you are directly and openly threatening the world with violent conquest. This is sometimes called a “pivotal action”, which is code word for “(insanely violent) unilateral action that forces the world into a state I think is good.”
For some hopefully meaningful definition of the word “control”
This is the rationale behind proposals such as MAGIC.