By 2070, it will become possible and financially feasible to build APS (Advanced, Planning, Strategically-aware) systems (65%)
For what it's worth, my estimate for APS by 2035 is ~95%. I think there are a lot of gradually accelerating processes underway affecting AI R&D which have positive feedback cycles. Trying to estimate how fast AI R&D will lead to a given capability threshold without taking positive feedback loops into account is almost certainly going to result in a huge underestimate of the speed.
I am reviewing the work AI experts on what they think will happen with AI. This is a summary of Joe Carlsmith’s thoughts from his paper. AI risk scares me but often I feel pretty disconnected from it. This has helped me think about it.
Here are Carlsmith’s thoughts in brief (he no longer fully endorses these):
Here is an interactive version of his probability tree:
You can see all graphs I’ve done here: https://estimaker.app/ai.
You can watch a video he did here (unrelated to me, I haven't talked to joe about this):
https://www.youtube.com/watch?v=UbruBnv3pZU
Or 80k have done a similar write up here:
https://80000hours.org/problem-profiles/artificial-intelligence/#how-likely-is-an-AI-related-catastrophe
Longer explanations
What kind of AI is Carlsmith forecasting? What is APS?
Advanced capability: they outperform the best humans on some set of tasks which when performed at advanced levels grant significant power in today’s world (tasks like scientific research, business/military/political strategy, engineering, and persuasion/manipulation).
Agentic planning: they make and execute plans, in pursuit of objectives, on the basis of models of the world.
Strategic awareness: the models they use in making plans represent with reasonable accuracy the causal upshot of gaining and maintaining power over humans and the real-world environment.
Carlsmith talks about outperforming the best humans in a range of tasks - scientific research, business/military/political strategy and persuasion manipulation.
Metaculus requires a 2-hour adversarial turing test (current AIs may already pass this), 90% mean accuracy across some expert task set, 90% on an interview dataset and general robotic capabilities.
It seems to me that Carlsmith is just pointing at a much more competent thing, so it’s unsurprising that his confidence is lower in that timescale. Metaculus is currently at 2032 for its version.
I also give some weight to the possibility that available techniques will push towards the development of systems with these properties; and/or that they will emerge as byproducts of making our systems increasingly sophisticated (whether we want them to or not).
The 20% on false, here, comes centrally from the possibility that the combination of agentic planning and strategic awareness isn’t actually that useful or necessary for many tasks—including tasks that intuitively seem like they would require it…
For example, perhaps such tasks will mostly be performed using collections of modular/highly specialized systems that don’t together constitute an APS system.
Practically PS-aligned: Carlsmith calls systems “practically PS-aligned” if they don’t engage in misaligned power-seeking on any of the inputs it will in fact receive as opposed to fully PS aligned, which is on any inputs.
Full text: It will be much harder to develop APS systems that would be practically PS-aligned if deployed, than to develop APS systems that would be practically PS-misaligned if deployed (even if relevant decision-makers don’t know this), but which are at least superficially attractive to deploy anyway (Conditional on previous questions)
However, I find it much harder to think about the difficulty of creating a system that would be practically PS-aligned, if deployed, relative to the difficulty of creating a system that would be practically PS-misaligned, if deployed, but which is still superficially attractive to deploy...
[Carlsmith discusses some areas of the problem]
I expect problems with proxies and search to make controlling objectives harder; and I expect barriers to understanding (along with adversarial dynamics, if they arise pre-deployment) to exacerbate difficulties more generally; but even so, it also seems possible to me that it won’t be “that hard” (by the time we can build APS systems at all) to eliminate many tendencies towards misaligned power-seeking (for example, it seems plausible to me that selecting very strongly against (observable) misaligned power-seeking during training goes a long way), conditional on retaining realistic levels of control over a system’s post-deployment capabilities and circumstances (though how often one can retain this control is a further question).
Beyond this, though, I’m also unsure about the relative difficulty of creating practically PS-aligned systems, vs. creating systems that would be practically PS-misaligned, if deployed, but which are still superficially attractive to deploy. One commonly cited route to this is via a system actively pretending to be more aligned than it is. This seems possible, and predictable in some cases; but it’s also a fairly specific behavior, limited to systems with a particular pattern of incentives (for example, they need to be sufficiently non-myopic to care about getting deployed, there need to be sufficient benefits to deployment, and so on), and whose deception goes undetected. It’s not clear to me how common to expect this to be, especially given that we’ll likely be on the lookout for it. More generally, I expect decision-makers to face various incentives (economic/social backlash, regulation, liability, the threat of personal harm, and so forth) that reduce the attraction of deploying systems whose practical PS-alignment remains significantly uncertain. And absent active/successful deception, I expect default forms of testing to reveal many PS-alignment problems ahead of time.
That said, even absent active/successful deception, there are a variety of other ways that systems that would be practically PS-misaligned if deployed can end up superficially attractive to deploy anyway: for example, because they demonstrate extremely useful/profitable capabilities, and decision-makers are wrong about how well they can predict/control/incentivize the systems in question; and/or because externalities, dysfunctional competitive dynamics, and variations in caution/social responsibility lead to problematic degrees of willingness to knowingly deploy possibly or actually practically PS-misaligned systems (especially with increasingly powerful capabilities, and in increasingly uncontrolled and/or rapidly changing circumstances).
I agree I am not convinced by people who are really certain in either direction here.
(a) I expect us to have seen a good number of warning shots before we reach really high-impact practical PS-alignment failures, so this premise requires that we haven’t responded to those adequately,
(b) the time-horizons and capabilities of the relevant practically PS-misaligned systems might be limited in various ways, thereby reducing potential damage, and
(c) practical PS-alignment failures on the scale of trillions of dollars (in combination) are major mistakes, which relevant actors will have strong incentives, other things equal, to avoid/prevent (from market pressure, regulation, self-interested and altruistic concern, and so forth).
However, there are a lot of relevant actors in the world, with widely varying degrees of caution and social responsibility, and I currently feel pessimistic about prospects for international coordination (cf. climate change) or adequately internalizing externalities (especially since the biggest costs of PS-misalignment failures are to the long-term future). Conditional on 1-3 above, I expect the less responsible actors to start using APS systems even at the risk of PS-misalignment failure; and I expect there to be pressure on others to do the same, or get left behind.
To expand a bit, trying to cause a trillion $ of damage might just pose too great a risk when we can instead be paid off in food, human connection and computer games. Why risk it?
How this falls out in terms of possible worlds
We can’t build APS by 2070 (35%)
Many systems are possible in these worlds, including far more powerful ones than we have now, just not the ones that Carlsmith describes. Perhaps the last few percentage points of human capability are very hard to train, or perhaps LLMs don’t have world models capable of leading them to think strategically in new environments.
We can build APS but we choose not to (13%)
For some reason we choose not to build such systems, maybe because such powerful systems aren’t economically useful to us or because we ban their use.
It isn’t harder to deploy PS-aligned systems so we probably do (31%)
Here PS-aligned systems are as easy to build as those that aren’t. This is a great world. I hope we live in it. We build PS-aligned systems and they do what we want. Though now we just have to deal with human misalignment but at least they can’t leverage AI recursively increasing to bad ends.
This means that probably any person can call on systems more powerful than the best humans, with strategic planning. I guess it depends on how much they cost to run, but currently that would turn any reader into one of the most productive people they currently know. I find it hard to imagine how quickly the world would change (or how much new regulation would be created that only AIs could navigate)
Misaligned systems don’t cause $1tr of damage before 2070 (7%)
Unclear in such worlds why misaligned systems don’t spin out of control. Perhaps there is some deeper alignment (ie the orthogonality thesis is false), smaller crises cause regulation or APS systems counteract one-another.
These worlds are hard for me to imagine.
Huge damage doesn’t lead to permanent disempowerment (8%)
While 1$T is a worldwide crisis (a quick google suggests that the 2008 crisis caused a comparable loss of growth). I guess that here the crisis causes people to take risk from APS systems seriously and then we avoid permanent disempowerment. Alternatively somehow such systems were never on track to permanently disempower but just caused massive damage without extinction.
I guess these worlds look like the two above sets but just with an additional massive disaster at the front.
Permanent disempowerment somehow isn’t an existential catastrophe (.2%)
I find this hard to imagine, but I guess these are worlds where AI keeps us around in a way that isn’t a catastrophe for us. I wonder if Carlsmith would include a sort of benevolent AI dictator in this bucket. Dogs have a pretty good life, right?
Existential catastrophe (5%)
Whelp. Again it’s worth noting that Joe no longer fully endorses this.
How could these overviews be better?
We are still in early stages so I appreciate a lot of nitpicky feedback