I don't see how you get default failure without a model. In fact, I don’t see how you get there without the standard model, where an accident means you get a super intelligence with a random goal from an unfriendly prior - but that’s precisely the model that is being contested!
I can kiiinda see default 50-50 as "model free", though I'm not sure if I buy it.
It's unclear to me what it would even mean to get a prediction without a "model". Not sure if you meant to imply that, but I'm not claiming that it makes sense to view AI safety as default-failure in absence of a model (ie in absence of details & reasons to think AI risk is default failure).
If I can make my point a bit more carefully: I don’t think this post successfully surfaces the bits of your model that hypothetical Bob doubts. The claim that “historical accidents are a good reference class for existential catastrophe” is the primary claim at issue. If they were a good reference class, very high risk would obviously be justified, in my view.
Given that your post misses this, I don’t think it succeeds as an defence of high P(doom).
I think a defence of high P(doom) that addresses the issue above would be quite valuable.
Also, for what it’s worth, I treat “I’ve gamed this out a lot and it seems likely to me” as very weak evidence except in domains where I have a track record of successful predictions or proving theorems that match my intuitions. Before I have learned to do either of these things, my intuitions are indeed pretty unreliable!
Yeah I don't think the arguments in this post on its own should convince that P(doom) is high you if you're skeptical. There's lots to say here that doesn't fit into the post, eg an object-level argument for why AI alignment is "default-failure" / "disjunctive".
Here's where I think the "doomers vs accelerationists" crux can collapse to.
On real computers built by humans, using real noisy data accessible to humans,
(1) how powerful in utility terms will an ASI be
(2) what will that ASI's advantage over carefully constrained, stateless ASIs be, that humans have on their side, who are unable to tell if their inputs come from the training set or if they are currently operating in the real world.
The crux in (1) comes from the current empirical observations of power laws, and just thinking about what intelligence is. It's not magic, as an agent in the real world, intelligence is just a Policy between inputs and outputs, with policy updates as part of the cycle.
Obviously the policy cannot operate on more bits of precision than the inputs. Obviously it can't emit more bits of precision than the actuator output resolution. This has real world consequences, see https://www.lesswrong.com/posts/qpgkttrxkvGrH9BRr/superintelligence-is-not-omniscience . And possibly the policy quality improves by the log of compute, and on an increasing number of problems, there is zero benefit to a smarter policy.
For example, on many medical questions, current human knowledge is so noisy and unreliable that the best policy known is a decision tree. The game 'tic tac toe' can be solved by a trivial policy, and an ASI will have no advantage on it. Intelligence doesn't give a benefit above a base level on an increasing set of problems that scales with the amount of intelligence an agent has.
This is the same principle as Amdahl's law, "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".
So if "improved part" means "above human intelligence", Amdahl's law applies.
The crux in (2) falls from 1. If intelligence has diminishing returns, then you can gain a large fraction of the benefits of increased intelligence with a system substantially stupider than the smartest one you can possibly build.
More empirical data can answer who's right, and assuming the accelerationists are correct, they will know they were correct for years. If the doomers were correct, well.
Your model assumes lot about the nature of AGI. Sure if you jump directly to “we’ve created coherent, agential, strategic strong AGI, what happens now?” you end up with a lot of default failure modes. The cruxes of disagreement are along what does AGI actually look like in practice and what are the circumstances around it’s creation?
Is it Agential? Does it have strategic planning capabilities that it tries to act on in the real world? Current systems don’t look like this.
Is it coherent? Even if it has the capability to strategically plan is it able to coherently pursue those goals over time? Current systems don’t even have the concept of time and there is some reason to believe that coherence and intelligence may have an inverse correlation.
Do we get successive chances to work on aligning a system? If “AGI” was derived from scaling LLMs and adding cognitive scaffolding doesn’t it seem highly likely they will both be interpretable and steerable given their use of natural language and ability to iterate on failures?
Is “kindness” truly completely orthogonal to intelligence? If there is even a slight positive correlation the future could look very different. Paul Christianio made an argument about this on a thread recently.
I think part of the challenge is that AGI is a very nebulous term and presupposing an agential, strategic, coherent AGI involves assuming a lot of steps in between. I think a lot of the disagreements rely on what the properties of the AGI are rather than specific claims about the likelihood of successful alignment. And there seems to be a lot of uncertainty on how this technology actually ends up developing that’s not accounted for in many of the standard AI X-Risk Models
One of the takehome lessons from ChaosGPT and AutoGPT is that there'll likely end up being agential AIs, even if the original AI wasn't particularly agentic.
AutoGPT is an excellent demonstration of the point. Ask someone on this forum 5 years ago whether they think AGI might be a series of next token predictors strung together with modular cognition occurring in English and they would have called you insane.
Yet if that is how we get something close to AGI it seems like a best case scenario since intrepretability is solved by default and you can measure alignment progress very easily.
Reality is weird in very unexpected ways.
To restate what other people have said- the uncertainty is with the assumptions, not the nature of the world that would result if the assumptions were true.
To analogize- it's like we're imagining a massive complex bomb could exist in the future made out of a hypothesized highly reactive chemical.
The uncertainty that influences p(DOOM) isn't 'maybe the bomb will actually be very easy to defuse,' or 'maybe nobody will touch the bomb and we can just leave it there,' it's 'maybe the chemical isn't manufacturable,' 'maybe the chemical couldn't be stored in the first place,' or 'maybe the chemical just wouldn't be reactive at all.'
So to transfer back from the analogy, you are saying the uncertainty is in "maybe it's not possible to create a God-like AI" and "maybe people won't create a God-like AI" and "maybe a God-like AI won't do anything"?
Another one, corresponding to the analogy in the chemical not being reactive at all, is the possbility that even very strong AIs are fundamentally very easy to align by default, for any number of reasons.
Subtitle: A partial defense of high-confidence AGI doom predictions.
Introduction
Consider these two kinds of accident scenarios:
See also: conjuctive vs disjunctive risk scenarios.
Default-success scenarios include most engineering tasks that we have lots of experience with and know how to do well: building bridges, building skyscrapers, etc. Default-failure scenarios, as far as I can tell, come in two kinds: scenarios in which we’re trying to do something for the first time (rocket test launches, prototypes, new technologies) and scenarios in which there is a competent adversary that is trying to break the system, as in computer security.[1]
Predictions on AGI risk
In the following, I use P(doom) to refer to the probability of an AGI takeover and / or human extinction due to the development of AGI.
I often encounter the following argument against predictions of AGI catastrophes:
Alice: We seem to be on track to build an AGI smarter than humans. We don’t know how to solve the technical problem of building an AGI we can control, or the political problem of convincing people to not build AGI. Every plausible scenario I’ve ever thought or heard of leads to AGI takeover. In my estimate, P(doom) is [high number].
Bob: I disagree. It’s overconfident to estimate high P(doom). Humans are usually bad at predicting the future, especially when it comes to novel technologies like AGI. When you account for how uncertain your predictions are, your estimate should be at most [low number].”
I'm being vague about the numbers because I've seen Bob's argument made in many different situations. In one recent conversation I witnessed, the Bob-Alice split was P(doom) 0.5% vs. ~10%, and in another discussion it was 10% vs. 90%.
My main claim is that Alice and Bob don’t actually disagree about how uncertain or hard to predict the future is―instead, they disagree about to what degree AGI risk is default-success vs. default-failure. If AGI risk is (mostly) default-failure, then uncertainty is a reason for pessimism rather than optimism, and Alice is right to predict failure.
In this sense I think Bob is missing the point. Bob claims that Alice is not sufficiently uncertain about her AI predictions, or has not integrated her uncertainty into her estimate well enough. This is not necessarily true; it may just be that Alice’s uncertainty about her reasoning doesn't make her much more optimistic.
Instead of trying to refute Alice from general principles, I think Bob should instead point to concrete reasons for optimism (for example, Bob could say “for reasons A, B, and C it is likely that we can coordinate on not building AGI for the next 40 years and solve alignment in the meantime”).
Uncertainty does not (necessarily) mean you should be more optimistic
Many people are skeptical of the ‘default-failure’ frame, so I'll give a bit more color here by listing some reasons why I think Bob's argument is wrong / unproductive. I won’t go into detail about why AGI risk specifically might be a default-failure scenario; you can find a summary of those arguments in Nate Soares’ post on why AGI ruin is likely.
Some things I’m not saying
This part is me hedging my claims. Feel free to skip if that seems like a boring thing to read.
I don’t personally estimate P(doom) above 90%.
I’m also not saying there are no reasons to be optimistic. I’m claiming that reasons for optimism should usually be concrete arguments about possible ways to avoid doom. For example, Paul Christiano argues for a somewhat lower than 90% P(doom) here, and I think the general shape of his argument makes sense, in contrast to Bob’s above.
I do think there is a correct version of the argument that, if your model says P(outcome) = 0.99, model uncertainty will generally be a reason to update downwards. I think people already take that into account when stating high P(doom) estimates. Here’s a sketch of a plausible reasoning (summarized and not my numbers, but I do have similar reasoning, and I don’t think the numbers are crazy):
I don’t mean to say that the reasoning here is the only reasonable version out there. It depends a lot on how likely you think various definitely-useful surprises are, like long timelines to AGI and slow progress after proto-AGI. But I do think it is wrong to call high P(doom) estimates overconfident without any further more detailed criticism.
Finally, I haven’t given an explicit argument for AGI risk; there’s a lot of that elsewhere.
Note how AGI somehow manages to satisfy both of these criteria at once.