I think there is something good about making a post that stands on its own like this, but I also think it's useful to directly link to a bunch of direct quotes from people who said the kinds of thing this post is arguing against. So here are some I remember:
Paul Christiano, in “Where I agree and disagree with Eliezer”:
“Eliezer often equivocates between ‘you have to get alignment right on the first ‘critical’ try’ and ‘you can’t learn anything about alignment from experimentation and failures before the critical try.’”
[...]
“But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology.”
Sam Marks, commenting on Paul’s post:
“Eliezer’s ‘first critical try’ framing downplays the importance of trial-and-error with non-critical tries.”
[...]
“Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.”
Joe Carlsmith, in “On first critical tries in AI alignment”:
“In AI alignment, you do still get to learn from non-existential failures.”
[...]
...“you might catch AIs attempting to take over, and learn fr
the things I link here do indeed all strawman the core thing in this post
I found this surprising. Why do you think this? All of these posts/comments seemed pretty reasonable to me. I don't see how they are strawmanning the point in this post? Edit: I think I understand what habryka meant, see here.
It seems like the view across all of these is "there is a first critical try, but we can learn from experimentation before" and/or "if misalignment of type X emerges late, then maybe we can use earlier AIs to get lots done (or possibly hand off successfully) while if misalignment of type X emerges early, we can study it (which might transfer through to the relevant regime or might not, but this is a quantitative question where details matter)". I don't really see how this post is even clearly arguing against these points? Like, it's got to be a quantitative disagreement about how much transfer we're talking about and I don't think this post makes arguments that could pin down the relevant quantative details (about e.g. the level of transfer in the relevant regimes) for AI.
(I tend to think the situation is also messier to discuss because most of the hope routes through effectively handi...
I found this surprising. Why do you think this? All of these posts/comments seemed pretty reasonable to me. I don't see how they are strawmanning the point in this post?
So, take Paul's quote, where he suggests that Eliezer sometimes says that "you can't learn anything about alignment from experimentation and failures before the critical try." I think Eliezer doesn't say this? I think it's possible to read Eliezer as saying this, or for his previous framings to make it harder to rule out that interpretation. But like with the WWI -> WWII example, the question is not whether you learn anything about alignment but whether you learn enough about alignment, and I think Eliezer has always been focused on the question of "enough" and swapping that out for "anything" is a central example of strawmanning.
Sam's and Joe's examples seem to be in the same vein. If Alice asks "will we sell enough paintings to cover rent this month?" and Bob responds "Alice is downplaying the importance of how earning revenue allows us to pay rent", it is clear that Bob has made some mistake here. The question is how the numbers compare, not whether or not there's a mechanism by which learning will work.
I actu...
Ok, I think I understand the point now: Paul and Sam Marks are both talking about what Eliezer is saying in list of lethalities and the thing they say about his perspective/framing isn't faithful to the description he gives in this post about irretrievability. So, they'd be strawmanning this post if these comments were a response to this post.
I don't see how Joe and Buck are strawmanning. (Joe isn't really even talking about what Eliezer thinks and it sounds like you and others agree Buck isn't strawmanning.)
I'm less sure Paul and Sam Marks are strawmanning Eliezer in general or strawmanning List of Lethalities.
Paul says:
Eliezer often equivocates between ‘you have to get alignment right on the first ‘critical’ try’ and ‘you can’t learn anything about alignment from experimentation and failures before the critical try.’
IMO, the description in list of lethalities mostly doesn't equivocate between these (though it does it a bit), but my cached understanding is that Eliezer does often seem to equivocate between ‘you have to get alignment right on the first ‘critical’ try’ and ‘it's very hard to learn much about alignment from experimentation and failures before the critical try’ ...
Lest the exegesis of my old comment continue, I'm happy to clarify my object-level view. I think that:
My guess is that they would implicitly consider this post to be motte-and-bailey-ing, but do strawman the position in this post (if this post is in fact the best representation of Eliezer's position).
In my opinion, this post is not actually making many hard claims. I mostly view it as gesturing at the existence of really difficult problems and presenting historical analogies. It argues that it is possible for problems to be very hard, even if they have a bunch of other nice properties, including the nice properties people attribute to the AI problem. However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.
However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.
I mean, my take on this is that around two decades ago Eliezer thought AI safety could be an incredibly hard problem, and then spent a lot of time checking, and now has lots of reasons to believe that it is an incredibly hard problem, and those reasons are spelled out elsewhere, with this post just trying to point at the problem of irretrievability.
Sorry, yes, they (almost)[1] all say[2]:
"Eliezer, you said we couldn't learn from experimentation! But we, the enlightened few, in contrast to those people trying to derive guaranteed conclusions from logical principles, understand that empiricism is a thing, and your concept of 'critical first try' is only harmful and misleads people about our ability to iterate and learn from earlier failures".
That is, as I understand it, one of the core points of this post. The concept of a first critical try is not in contrast to empirical iteration. "Please, why do people keep bringing it up as if it conflicts with it. It's a different point. Can you please stop sliding off of this point and just acknowledge it instead of trying to respond to this weird other strawman every time?".
I am most confused about Buck's exchange where it feels to me like Buck is kind of making a non-sequitur and Eliezer is also being weirdly dense about the point Buck is making and my guess is something kind of similar is going on but I wouldn't quite put it in the same bucket
Of course greatly exaggerated for rhetorical effect to reduce ambiguity and introduce levity
Look, if anyone is strawmanning and being condescending, it is obviously you and Eliezer. Which I don't think is that big a deal but it is frustrating that you are accusing people of being condescending in such a condescending manner.
Edit:
To expand on this more:
Eliezer believes that good theory would help a lot with aligning AI on the first critical try (despite not being sufficient or completely necessary) while believing that iteration without theory won't help that much (because the problem of aligning stupider less capable AIs just doesn't apply that well to aligning superintelligence).
It's annoying that I felt it necessary to put the parentheticals in there, because if I didn't I feel like I was going to be accused of strawmanning.
In any case, in contrast you can imagine someone who believes that theory will not help a lot, but that iteration will. I don't think putting forward such a view is strawmanning.
Eliezer believes that good theory would help a lot with aligning AI on the first critical try (despite not being sufficient or completely necessary) while believing that iteration without theory won't help that much (because the problem of aligning stupider less capable AIs just doesn't apply that well to aligning superintelligence).
Look, he might believe that, or he might not. I just don't think this post, or the general argument about "first critical try" is about that.
I am not saying everyone is strawmanning everything about Eliezer. People totally have valid arguments about the difficulty of alignment, and the value of empirical iteration, and of course hundreds of other aspects of the AI-risk situation, but on the specific narrow point of "you only get one critical try", people seem to repeatedly want to make it into a different strawmanned point, and then respond to that. Acknowledging this does not need to involve conceding any major kind of argument. It's really not a complicated point. We don't need to tie ourselves up in these knots.
You can then argue with Eliezer about whether this point is sufficient for high risk from AI (which is some of what this post is about), but...
I disagree because neither of them seems to somehow admit the first-critical-try nature of the problem into their subsequent arguments (in the relevant context). But I agree it's tricky and I am not saying it's obviously what's going on (that's why I call that part my "personal opinion" and have it in a footnote).
In any case, this post should be a welcome exposition to everyone involved since it makes it much harder for Eliezer to equivocate between the two. If Eliezer now says "getting it right on the first critical try means we can't learn anything about alignment from experimentation" then you get to link to this post and say "no, you said right here, yourself, that this is not what you mean, please cut it out". So even if you think Eliezer equivocated in the past, this post should help with that (this doesn't mean it doesn't make sense to litigate whether in the past equivocation happened, like, in as much as it did happen I think it would be good to hold Eliezer or others accountable for that, and if someone wants to provide receipts, I think that would be a reasonable thing to do).
When drawing these examples of alleged strawmen, we must remember that they are not responding to this 2026 post, but rather responding to, for example, List of Lethalities from June 2022. Of these four examples, Christiano, Marks, and Carlsmith are all directly responding to List of Lethalities. Buck is quoting Christiano's response to List of Lethalities. So let's go back to the source material.
List of Lethalities begins with this disclaimer:
Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.
Publishing a poorly organized list of individual rants was better than publishing nothing, I agree, good move. But rants are made of straw, responding to rants is responding to straw, and that's a natural consequence of ranting in public.
The "first critical try" issue is covered in List of Lethalities point 3 (LL3). This reads in part:
...We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but on
I discussed this a bit with Oliver and I think I understand better what the objection is.
I think the substantive disagreement is whether the actual empirical iteration you get on the current trajectory is a big enough deal that you believe that alignment is difficult (say, p(doom on current trajectory) > 80%) just due to oneshotness.
However many of the quotes above are instead saying something to the effect of "Eliezer thinks that empirical iteration is unimportant / provides no alignment-relevant info". This is in fact a different thing; one can consistently believe both (1) alignment is very difficult and won't be solved given the amount and quality of empirical iteration we get by default, and (2) empirical iteration is incredibly valuable.
And of course Eliezer believes both (1) and (2); (1) is just a statement of his most prominently known view, and for (2) the value of empirical iteration is blatantly obvious and it would be shocking if Eliezer disagreed with it.
I do pretty strongly disagree with the psychologizing that Eliezer does in the post if that is supposed to apply to the authors of the quotes above (as opposed to e.g. randos on Twitter), e.g.
...The opposing faction is
Here's my brief off the cuff attempt to synthesize:
I agree with all of the concerns you've stated; my list would be substantially longer, but you've well-stated the concerns you've stated.
Nice. I'll probably rework this comment eventually into a top-level post or something similar; if you jot down some bullet points here of additional concerns to add to the list, I'll consider incorporating them!
Yes. Much of my remaining hope lies in various forms of interpretability including mechanistic. It can convert a critical failure into just a regular failure, by catching things going off the rails before it's too late.
Perfect as enemy of the good etc; if useful I'm happy to commit some 20 man hours by EA Serbia senior members who I would trust in this and who have experience in either writing or game design to do the clean up and then send to you for review.
The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of "we only get one shot at ASI" is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever.
Honest question for anyone who agrees with this post: is there any extinction problem at all where you'd say we don't only have one shot to solve the problem? If so, why?
Consider a few examples:
1. A giant asteroid is hurtling toward the planet, and will arrive very soon. If we mess up and fail to deflect the asteroid, then we all die. This is presumably a classic one-shot scenario, and perhaps few people disagree with that assessment, but I'm not sure.
2. Global warming, if continued for a very, very long time, could heat up the planet to catastrophic levels and eliminate the viability of agriculture, killing everyone. Do we also only get one shot to avoid extinction here?
3. Genetically engineered humans, if made much smarter than ordinary humans, and if they are accidentally created as psychopaths, could conceivably coordinate a genocide against ordinary humans. Is this a one-shot...
I would describe a critical try as one where the act of trying is likely to prevent further attempts. Launching an ASI is a critical try because the ASI itself could likely stop you from launching more ASIs later on (e.g. by killing you).
If it's possible to send out missions to intercept the asteroid before it arrives, then it seems to me that the asteroid is better understood as a time limit than as a critical try. You could set the parameters of the asteroid scenario in such a way that you have time for exactly one try, but you could also set the parameters so that you have time to send up a mission to deflect the asteroid, observe its results, and then make a second try before the asteroid arrives. You could also set the parameters such that you have time for zero tries! The key consideration is how fast you can work vs how much time you have.
Contrariwise, if you assume that you are stopping the asteroid with a shield that is close to the earth, such that no matter how fast you build the shield you have to wait for the asteroid to arrive before you can see how well it works, then I'd call that a critical try, because the part of the plan where you wait for the asteroid to arrive...
1 - Yep.
2 - Hard for literally all humanity to die of global warming, but runaway methane clathrate release turning the planet into Venus would be legit irretrievable. More generally, while not extinction risk per se, and while potentially reversible with geoengineering, global warming is generally nontrivial to reverse and so has the quality of "ongoing life problem with things happening and no save points, but for the whole planet" rather than "engineers getting to try slightly different things over and over with no consequences". This is why people with nothing even worse to worry about will sometimes worry about global warming!
3 - I think this class of problems is significantly easier than AI problems; but it can have the oneshot quality for all humanity, just as much as any real-war is oneshot, if screwed up. Same with genetic engineering on any mass scale that will dissolve irretrievably into the general gene pool.
"I think you're distinguishing War from the ongoing struggle between police and criminals because you think that in War any pre-critical evidence we can gather to be almost inevitably sufficiently out of distribution that it's worth very little."
No! The thing that makes the Maginot Line different from police enforcement in a random city is that if the Maginot Line fails the country falls and you don't get to try again; not that War is changing much faster than criminal operations. War changes fast enough.
(Speaking for myself of course)
But what counts as a first try?
A given try is more firsty if it's less like all previous tries put together. A one-shot problem is one where you try a pretty firsty try, and that try is likely to kill humanity.
How many previous tries are in the same class (e.g. small asteroid / big asteroid, or GPT2 / GPTN) is relevant in that a priori more such tries might suggest that future tries are less firsty. But it's also perfectly plausible a priori to have lots of tries that you survive, and then a one-shot problem (lethal and very firsty).
You could even have a series of one-shot problems. Imagine for example that you have a lethal asteroid--but you saw it 10 years in advance, and it's small enough that you can stop it with nukes. It's one-shot (lethal and firsty), but maybe you survive. Then you have another similar asteroid, but you only saw it 6 months in advance--that might be another one-shot problem (do it all again but way faster). Then you have an asteroid that's so big, all the nukes in the world wouldn't stop it. One-shot again; there's totally new, crucial challenges to solve.
I think that with AI, you very likely get a one-shot problem in the ballpark of superhuman AGI. It's lethal, in that it would by default go on and extinct humanity, and very firsty, in that many core alignment difficulties first show up there.
I do generally find myself more concerned about genetic engineering for extreme intelligence than other people in this cluster seem to be.
Tangent of course, but happy to discuss, whether in private or on a podcast.
There was an actual theory of why the Chernobyl reactor was supposed to not explode, written down so that multiple people could read it, based on an understanding from first principles!
More specifically (and I don't think it's known outside of the Russian nuclear engineering-adjacent community), at least two people independently calculated and described in classified technical reports how RBMK could explode in the specific circumstances it actually exploded, and because the technical solution implemented after 1986 was at the time deemed too expensive for such a risk, the manuals strictly prohibited letting the reactor to get close to these circumstances. However, the control system didn't display a key value, the so-called operative reactivity margin, the operators needed to know to catch the moment when they might break the instructions: instead, it had to be calculated on a computer in a separate building (AFAIK, it's debated to this day what the exact value was at scram).
P. S.
An analogy I came up after writing this comment is the following: imagine a BEV which might blow up if the driver hits a brake in a narrow, uncommon range of battery voltages, and the instruction specifica...
See my other comment in this thread for actual AI alignment thoughts, but as a former aerospace engineer myself (albeit not a very good one), I thought it would be fun to speculate on "Would such a cheerful innocent ever succeed at landing a space probe? It seems crazy to assign them a real-world success probability as high as 10%."
In every war, both attack and defense have “oneshotness”. But obviously, one of the sides of a war can and often does succeed. In the OP, Germany’s “oneshot” Maginot line plan wound up working great!
(I’m not sure exactly what OP means by “curse”. Wars have “oneshotness” but are not particularly “cursed” if there’s a 50% chance of success on priors.)
So, I think the relevant factors that make it hard are mainly
(Plus numerous other factors outside the scope of this post.)
Of these, (1) is discussed in the OP. (“Someone could, conceivably, argue that the change to "there being enough machine superintelligence around that ASI could kill humanity if they tried", from "AIs being experimented-upon that couldn't ki...
I enjoyed and agreed with much of this post. But there were 1-2 things that I eagerly anticipated reading about in the "Q&A" / explainer section, which unfortunately didn't appear in the actual post. Namely:
If someone wants to someday want to understand what you sometimes do with math besides... turning the math into exact code... ...prove AI mathematically safe; which again, to be clear, is not a kind of thing that math can do in principle...
I want to push back against this some. (I'm not sure whether I'm arguing with the actual Yudkowsky, or with a plausible misinterpretation of Yudkowsky, but it seems worth saying either way.)
Some things with which I agree:
However, it is also true that:
There might be a relatively innocuous reason for SOME of the misunderstandings?
I have found in the past that I cannot use phrases like "oneshot" or "you only get one try" around most so-called "AI safety" people outside of MIRI.
To be clear, a Congressional representative or staffer or national security professional will often immediately understand what is being said, if they haven't been previously contaminated by misinterpretations and straw positions. It's mainly AI companies, some AI professionals with fewer citations than Yoshua Bengio, OpenPhil-funded groups, etcetera, who manage to be unable to hear what is being said.
This struck me as being a case where the problem might be that "oneshot" is a word that means a lot of things to a lot of people in technical contexts?
For example, in Machine Learning, "oneshot learning for task X" occurs when a model that wasn't trained on task X is able to be show ONE EXAMPLE of how to do task X, and then it gets task X right pretty much just from that. (If the model wasn't trained for task X but simply can do it from nothing but a request to do it the model has "zeroshotted" the task, and "fewshot" is when you might need to give the model a ...
This is a control theory problem obscured by terminology like "oneshotness".
I interpret the phenomenon EY is gesturing at as a stability margin failure. That is, a system going off course at a rate that exceeds a controller's ability to correct. Most of the disagreement is not about this model at a high level, but about how the interaction dynamics play out and what levels of uncertainty to apply.
Controlling the Viking failed immediately upon losing the only correction channel. The control rate going to zero means game over.
The Mars Observer failed slowly as vapor accumulated over 11 months with no sensor detecting it as a problem. Zero control rate for a different reason. This time, the drift off course wasn't even observed until too late.
The Maginot Line failed because France was miscalibrated on both rates. They assumed the Germans would advance ("off course") more slowly and that their mobilization ("correction") would be faster.
ASI fits the pattern but has increased levels of cursedness affecting both rates. An AI can act faster than humans can observe and respond, interfere with corrective mechanisms, and obfuscate observability (e.g., sandbagging and playing the training gam...
The defeat of the Maginot Line is somewhat misunderstood in general (but not in ways that undermine this argument). German technical overmatch played a significant role. There were two plans for defeating it. The first is best detailed in Adm McRaven's 1993 masters thesis on the theory of special operations: https://www.afsoc.af.mil/Portals/86/documents/history/AFD-051228-021.pdf
The fortress of Eben Emael in Belgium was the hardest part of the line. It had artillery, built into bunkers, pointed at a key bridge. The Germans invented a man portable explosive that could destroy the bunkers, and trained glider-borne forces that could take the fortress by surprise. The germans succeeded and drove across the bridge.
If the Germans had failed in their attempt to take the fortress, their backup plan was a direct assault on the Maginot line using shells filled with Chlorine Trifluoride to set the concrete on fire. https://www.chemeurope.com/en/encyclopedia/Chlorine_trifluoride.html
In terms of the overall thesis, I think it persuades in the opposite of the intended direction. A lot of political challenges are like this, whether it's the environment, certain construction projects, or pas...
I read this not knowing Eliezer had written it. I thought it was someone trying to imitate his style, and I kept thinking "Man, this style is off-putting" and "this could be edited to be half the length, if not less".
I have probably read everything Eliezer has written, including the amazing Mad Investor Chaos. Eliezer is the most important influence on my thinking. But the prose here is so unnecessarily condescending and at times somewhat precious, like the way things are named ("disaster monkeys", "Very Serious Engineer", "the great seriousness of a decent engineer", etc). So much of this post is loaded with judgment or pettiness.
The post feels rushed and closer to an unedited rant. Also, repetitive of other posts that made the point more succinctly (one-shotting is hard, people really don't grasp how hard).
This is the type of writing that will turn off most readers who are not already convinced.
Does the "curse of oneshotness" apply to the unaligned AGI/ASI attempting a takeover? If no, why? If yes, does that imply the first AI takeover attempt would probably fail, thus seemingly contradicting the applicability of "oneshotness" to humanity developing ASI?
As in my other comment, winning a war has “oneshotness” but is not especially hard or “cursed”, in the sense that you just need to botch it less than the other side which also has “oneshotness”.
(Actually, it’s worse than that, because I for one am very skeptical that a failed AI takeover attempt would in fact leads to a some durable prevention of future AI takeover attempts.)
This post reflects a popular misunderstanding of the Maginot Line. I don't think that this fatally undermines the argument, but it still seems worth correcting.
Epistemic status: I am not a military historian, so I am deferring to military historians who write publicly, rather than looking at the academic literature or (even better) the original sources.
Here's Bret Devereaux (emphasis in original):
As an aside, the purpose of the Maginot Line was to channel any attack at France through the Low Countries where it could be met head on with the flanks of the French defense anchored on the line to the right and the channel to the left. At this purpose, it succeeded; the failure was that the French army proceeded to lose the battle in the field. It was not the fixed fortifications, but the maneuvering field army which failed in its mission. One can argue that the French under-invested in that field army (though I’d argue the problem was as much doctrine than investment), but you can’t argue that the Maginot Line didn’t accomplish its goals – the problem is that those goals didn’t lead to victory.
And here's r/askhistorians:
...The Maginot line was a series of defensive emplacement built by Fra
The issue the Allied forces encountered was that German forces also attacked through the Ardennes forest south of the main Allied forced and through rapid movement broke through weak parts of the line and encircled the Allied armies.
This is the issue that turned up in war games, was counterargued and disregarded by high command, and which was sufficient to lose France the war. No?
Curated. A lot could be said about this post and a lot is being said about this post (see the 111 comments on it), but a thing I find neat about it is that it continues the debate around an old idea that's still important and perhaps cruxy for many.
Oneshotness is not new (not a new concept, not a new argument, not a new debate); yet it is a critical argument for alignment difficult, and is still contested, with implications for what people do. Having made many attempts to explain their side, I could imagine someone giving up and being like "the people who...
I'd like to offer my perspective on why this enlightening post, written in Eliezer's wonderful, super-clear style, can't possibly eliminate motivated thinking about this one-shot problem.
In my opinion, people can't see any realistic alternative solution. They see that the alternative is even worse, but maybe they can't articulate it clearly, even to themselves. Or they just refuse to express it out loud. The fact is that the proposed AI ban (or AI pause) is also a one-shot problem.
How can an AI ban be implemented technically? Not legislatively, but in prac...
try introducing it as the Irretrievability Problem rather than "oneshotness"
I have had success (talking with e.g. MPs and civil servants) discussing 'unpluggability' and its opposite, 'un-unpluggability'.
The battery-software update accidentally overwrote the antenna-pointing software.
With the lander's antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.
Out of curiosity, do we know more about how this particular mistake happened? When issuing a software update, not overwriting any critical part of the code would seem fairly high on my list of concerns. It seems like this sort of mistake should have been caught by integration or unit tests, or by testing that all components worked on a replica on earth. Were they under a lot of time pressure or something?
the CEOs of AI companies have been filtered to not be people who get that
I'd like to go on a bit of a tangent since I don't recall seeing this line of thinking here on LW: are we sure that this is the case, that they think that what they are doing is reasonably safe? This seems to me to be stranger than the case where they know it's not safe, and perhaps not safe at all. What I mean is that they surely are very motivated and they have strong incentives in pursuing AGI, but they don't strike me as... stupid/incompetent?
Maybe they think that this is for the ...
Dario Amodei thinks that real alignment theory is about "monomaniacal" AI and is therefore refuted by LLMs, which want more than one thing. If Amodei was the sort of person to let himself hear or understand better than that, he wouldn't have his job.
I have no reason to believe any other AI company executive has even heard that much about any alignment theory which leans toward slight pessimism, with the possible exception of Shane Legg at Deepmind.
We don't have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should.
- Sheer naked strawmanning[3] of what is being said when somebody tries to warn you of Murphy's curses upon your project; the chattering of disaster monkeys.
It seems to me like you often accuse people of strawmanning you when in fact they are paraphrasing your position accurately if slightly uncharitably, in a way that makes discourse with you difficult.
In this case, I think you absolutely do be...
I share your confusion at the idea of "mathematically proving AI safe" haha. This convo has made me realize I've conflated alignment pessimists in general with the provably safe AI people in particular too much in my mind.
If failure of alignment -> schemer that wants to seize power, then ASI alignment is one shot.
But if failure of alignment -> non-schemer misalignment (eg reward hacking, or flailing misgeneralisation), then we failure isn't existential
So I think p(scheming | alignment failure) is a crux here
maybe you should try introducing it as the Irretrievability Problem rather than “oneshotness”
I'd like to suggest "Ironman Mode" (or whatever its best-known synonym is) as possibly memetically useful here. It refers to a difficulty modifier in certain games that prevents the equivalent of saying "oops" and restoring to 1929. Mistakes become permament, at least for that playthrough. The term isn't a perfect match, because you can try again, but only by starting from scratch, nethack-style.
("roguelike" was once a similar concept, but has been badly diluted...
This is why one shouldn't argue on the X-parrot, and why David Brin's "disputation arenas" proposal includes a step in which each party must, to the satisfaction of the other and/or the judgea, explain the other side's argument in their own words to prove that they actually understand it.
Hey Eliezer, I am a big fan of your work. I did have one idea about this and would be interested in hearing your thoughts. I liked your example about the Maginot line, if one German division crosses into France, you do not have time to build a new better Maginot line before the next division arrives. This is both true and pretty funny. Here, the reason why you cannot rely on the "gradualness of the problem" as a source of hope is that the intervention technique (building the Maginot line) takes orders of magnitude more time to implement than the developing...
In a cosmic scope, everything is almost never oneshot in its nature - in the narrowing scope of a thing in question, it's almost always oneshot (our daily lives being somewhere in the middle of these macro and micro scales) - and the severity of the scope is a wobbly line drawn somewhere in its spectrum of worst case scenarios arranged by diminishing probability. For ASI, even the most lenient line is at an astoundingly high level of probability. It's only not oneshot in the fact that a future non-human civilization can try again after learning from our outcomes.
Personally, I blindly and desperately hope that the first ASI that we engineer considers humanity as part of its self, which I do recognize is hoping for the first launched solar sail to open and catch a solar flare with perfect timing...
I’m afraid to admit that I am one of the people who do not understand the point being made other than arguing against boundary experimentation.
there was an actual theory of why the Chernobyl reactor was supposed to not explode,
Is this, like, available online? What is it called?
I think notion of "one-shot-ness" introduces a counterproductive dichotomy. A lot of people, even those aligned with the general AI-risk position of the writer, react to this framing as downplaying or dismissing the role of empirical research, trials with smaller AIs, etc. (Oliver's quotes as evidence). I think it doesn't necessitate strawmaning to reply in this way! And I found confusion around "Yudkowsky wants us to have perfect theoretical understanding before we try anything" reasonable, even if it misrepresents the writer's position.
Indeed, "one-shot-...
The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of “we only get one shot at ASI” is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever.
Or it's the writer's fault and calling it "one shot" is just a bad choice of words, when it being correct depends on specific decomposition into shots and "irretrievability" is better. Like, people are forced to say "there’ll be multiple first critical...
Ah I guess a problem is a one shot problem in respect to a goal if it is possible that there will be a be a issue that arrises from the attempt that makes the goal no longer possible? Would that kinda mean that in some way every problem is one shot? Material and time spent on failing to make as shoe is not recoverable though the severity of the loss is probably low.
But also for the first probe example, the attempt at the software update seems like a second attempt where the first would be to have designed the craft to not have had the problem
I probably am one of the dumbs that dont grasp the concept
This seems like an excellent argument to pause somewhere at about the AGI level, where a mistake is likely to give us a competent sentient computer virus and/or a new criminal organization and/or a nation of rather inconvenient competent people in a data-center: problems large enough to act as warning shots and learning exercises, but not actually wipe us out or permanently disempower us.
However, that would of course require:
a) actual willingness and capability to pause, globally (including China), and also
b) correct judgement of whether the likely resulti...
Is there going to be a follow up giving the other side of the argument: that the develop!ent of ASI actually will be a one-shot enterprise?
ASI safety would have to be invented quite often after the first ASIs. Even the best man-made self-stabilizing systems are not very good at surviving random events. ASI ecosystems affect pretty much everything on this planet, so there is plenty of room for a random event. And they couple so many dimensions, including time and space, and across their scales, it seems this is a textbook recipe for instability.
Right, it is not that there is a first critical try. It is that even if you passed the first critical try, and got an aligned AI/ or a restrained AI system, there would be a second, and a third, and a forth, and while you might have resources from the previous tries, you need the probability of each individual event to shrink faster than 1/n in possible events to never have one event if you cannot stop yourself from taking events, and probabilities of each event are independent, by the divergence of the harmonic series.
So for any outcome that could be a c...
Example 1: The Viking 1 lander
In the 1970s, NASA sent a pair of probes to Mars, the Viking 1 and Viking 2 missions. Total cost of $1B (1970), equivalent to about $7B (2025). The Viking 1 probe operated on Mars's surface for six years, before its battery began to seriously degrade.
One might have thought a battery problem like that would spell the irrevocable end of the mission. The probe had already launched and was now on Mars, very far away and out of reach of any human technician's fixing fingers. Was it not inevitable, then, that if any kind of technical problem were to be discovered long after the space launch in August 1975, nothing could possibly be done?
But the foresightful engineers of the Viking 1 probe had devised a plan for just this class of eventuality, which they had foreseen in general, if not in exact specifics. They had built the Viking 1 probe to accept software updates by radio receiver, transmitted from Earth.
On November 11, 1982, Earth sent an update to the Viking 1 lander's software, intended to make sure the battery only discharged down to a minimum voltage level, rather than running for a fixed time after each charge.
The battery-software update accidentally overwrote the antenna-pointing software.
With the lander's antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.
The error had destroyed the intended mechanism for recovering from errors.
All contact with the Viking 1 lander was permanently lost. Ground engineers tried some strategies for regaining contact, based on extrapolation of where the antenna could have ended up pointing, but none succeeded.
In this I observe a specific instance of a general idea: Murphy's Curse of Inaccessibility on space probes is a deep problem. A clever system designed in the hope of accepting later patches is a relatively shallower solution.
Putting wings on an airplane doesn't make it weightless and repeal the law of gravity. The weight of an airplane is an intrinsic property that goes on making it susceptible to falling out of the sky if the wings stop working. Your model of the airplane should include the ongoing weight and ongoing lift; not, argue that the curse of airplane-weight will be dispelled by wings.
The engineers' attempted strategy for mitigating the underlying oneshot quality of a space probe launch -- the engineers' intended mechanism for correcting mistakes afterwards -- did not actually transform the Viking 1 lander into an Earth-bound car that you could walk over to and fix. Any sort of problem that struck at the corrective machinery itself, would catapult you right back into the fundamental inaccessibility scenario, that you couldn't just walk over and fix a broken corrective mechanism. (And also, of course, large classes of possible error can't be addressed by a software update at all.) The underlying reality was that the probe stayed far and high away.
Rocket science wouldn't be so famously cursed by Murphy's Law, if the heightened susceptibility conditions for Murphy's Law to act upon their projects, were so easy to defeat with a little effort. The many Curses of Murphy upon aerospace engineering can be fought; but not vanquished, not dispelled.
Good aerospace engineers know that. So they put in the extreme levels of paranoia and preparation that are required to sometimes succeed.
One can only imagine what tiny fraction of space probe missions would succeed if the engineers or managers were the sort who went around bragging, "Our space probe won't be inaccessible after launch at all; we built in an antenna to upload software updates! Don't listen to those silly people who'll tell you that you 'can't walk over and fix' a space probe after launch; they lack our own experience to have had the brilliant idea of software updates!"
Would such a cheerful innocent ever succeed at landing a space probe? It seems crazy to assign them a real-world success probability as high as 10%. Aerospace engineers have to work much harder than that, be much more paranoid and cautious than that, to drive the success chance significantly above 0.
Example 2: The Mars Observer
The Mars Observer mission was approved in October of 1984 and launched in September of 1992, at a cost of $813 million ($2B 2025). It flew through space for 330 days, and then, three days before inserting into Mars orbit, communication with the probe was lost.
The best-guess postmortem analysis: After the earlier stresses of launch, and an 11-month flight through vacuum, a PTFE check valve had leaked fuel and oxidizer vapors that accumulated within feed lines in the zero gravity; and this produced an explosion when the engine was restarted (for course correction before orbital insertion).
That sort of thing happens, when you try to do something for the first time. One of the reasons why space probes are famously acutely susceptible to Murphy's Law, is that each new probe gets custom-built for a new mission. Each mission is a chance for something new to go wrong.
Now imagine some manager or space enthusiast saying -- in advance of the actual disaster, of course -- "The Mars Observer mission isn't novel! We can test the probe here on Earth in a vacuum chamber! We have experience from previous space probes! We have the whole mighty edifice of science to observe the laws of physics; and we can use those laws to extrapolate the system behavior of the Mars Observer probe on its way to Mars!"
To enumerate the object-level reasons this doesn't repeal Murphy's Curse of Novelty upon space-probe missions generally, or the Mars Observer specifically:
- Even if humanity did in fact learn something from previous space probes, previous probes weren't exactly like the Mars Observer. The ultimate novelty of the mission was not defeated, repealed, nor averted.
The Mars Observer might have failed even earlier, if it had been attempted with even less experience. It's not that all previous learning had no effect. But humanity had not learned enough, generalized correctly and with sufficient reliability, from those earlier nonidentical space probes, to get the new and different Mars Observer to Mars.
- Even if somebody had spun around the probe in a centrifuge to simulate a high-G space launch, and tested out all the systems in a vacuum chamber, that still would not have faithfully reproduced the exact conditions under which fuel vapor leaked in vacuum and then accumulated over eleven months in zero gravity. The conditions of validation here on Earth would not have been exactly like the deployment conditions.
This is why you can't get solid guarantees on space probes using mathematically valid statistics. The training distribution is not the deployment distribution, and that takes all mathematical guarantees and throws them out the window. Since aerospace engineers aren't wacky lunatics, they know this, and none of them have ever even tried to suggest that any kind of mathematical guarantee could apply.
Real life has many such cases. Much more mundanely than space probes, there's no way to use clever statistical guarantees to force an ordinary human conversation to go well, because no two conversations are the same and they're not sampled from time-invarying distributions.
- Humanity's grasp of chemistry and physics -- built by generalizing mathematically simple laws, over a genuinely vast body of observations -- and then applied to molecules and gases literally identical to the molecules and gases in the Mars Observer -- as put together in straightforwardly-mechanical processes themselves observed repeatedly, vastly simple by comparison with eg large computer programs or biochemistry -- was not actually adequate to predict and control the mission outcome.
Knowing all that physics did not negate the underlying surprisingness of a system of even that small complexity. It did not transform it into a mere repetition of previous operations on identical titanium alloys.
This, again, doesn't mean that humanity's knowledge of science and physics contributed zero help to the Mars Observer mission. That space mission would've failed much earlier and harder -- it is difficult even to imagine the counterfactual -- if humanity's grasp of the underlying science had been more akin to medieval alchemists pontificating about the philosophical significances of reagents, with every leading alchemist making up their own brilliant plan for a space mission where several steps involved metaphysical principles of great uplifting moral significance.
Humanity's understanding of the underlying mechanical processes was how the Mars Observer mission came close to succeeding -- in a way that medieval alchemy never came close to an immortality potion, or even to the far simpler goal of transforming lead into gold.
To sum up: Even (1) learning from previous space probes, (2) testing under controlled conditions attemptedly similar to conditions in space, (3) knowing all relevant fundamental physical laws exactly[1], (4) having an excellent quantitative grasp of relatively simple higher-level phenomena that governed fully, and (5) doing NASA-standard amounts of intensive thinking, gaming, and simulation about what could go wrong with a billion-dollar project, did not repeal Murphy's Curse of Novelty upon space probes. It was, in the end, still the very first Mars Observer mission.
NASA's efforts at understanding could challenge that Curse of Novelty, in a way that no alchemist's philosophizing could have challenged it, even if the alchemist had managed to grasp one or two rules-of-thumb. The people at NASA who put together the Mars Observer mission over many years of careful planning for that exact mission, had a level of professionalism, engineering caution, background scientific knowledge, specific preparation time, and general seriousness, vastly exceeding the professionalism of any alchemist or AGI company executive.
...Which didn't actually repeal all of the Murphian curses upon space probes. It wasn't enough for the Mars Observer to actually work.
The genuinely very serious people at NASA put up enough of a fight that the Mars Observer almost worked. Even the RBMK design for the Chernobyl nuclear reactor almost worked; it worked for many operation-years before one exploded! Despite the Soviet managers taking a few Disaster Stances that put a ceiling on the maximum socially allowed level of pessimism, the Soviet nuclear engineers knew vastly more and took their jobs vastly more seriously than medieval alchemists or modern AGI companies. There was an actual theory of why the Chernobyl reactor was supposed to not explode, written down so that multiple people could read it, based on an understanding from first principles! They had written handbooks in the 24/7 control room, and the written handbooks weren't just made up to look better!
It's just that to have a Murphy-cursed project actually really work in real life, rather than almost work, is very much harder.
(Though again to be clear, professionalism cannot magically make just any project almost-work. You could not give even the genuinely serious people at 1970s NASA the goal of building a contagious virus that conferred de-aging and indefinite biological healthspan, and have the resulting virus almost-work. The level of difficulty for "make an immortality virus" would be beyond what serious people could almost-do in 1970. Part of being serious is having some sense of a project's cursedness level, and not being a lunatic about what you try to do at high stakes.)
Example 3: The Maginot Line
In September 1939, Germany invaded Poland; this is usually the date given for the start of World War 2, though there were other preludes and signs before.
In May of 1940, Germany attacked France.
France thought they were ready.
France had foresightfully, starting in 1929 eleven years before the moment of crisis, already built the Maginot Line: a hugely expensive network of defensive fortifications along most of France's borders. Those fortifications would have trivially been defensively victorious in World War 1, if they'd been built before World War 1. The Maginot forts were supplied by underground railways, to make their supply lines harder to cut. They had the usual stockpiles of food and ammunition. The forts even had air conditioning -- a startling and expensive luxury for a military fort in 1940, but very much the sort of thing that soldiers had wished they'd had in World War 1.
France had learned the hard lessons of experience and prior battles in World War 1, and generalized them to the future!
Being so expensive, the Maginot Line did not cover literally all French borders. It did cover their borders with other countries that Germany might invade first to get at France, not just the border with Germany; the French military was trying to be thorough. But there were still some carefully-reasoned gaps. For example, France figured that the heavily-forested Ardennes wouldn't be easy to pass; France figured that any German invasion via the Ardennes would be slowed by dense forest terrain, and then further slowed by attacks from French aircraft. France figured it would take Germany at least 3 days, and more probably a week, to make it through the Ardennes; which, according to the calculations of the French military command, would give the French plenty of time to rush their own troops into position along that border, in the unlikely event Germany tried that doomed tactic.
The Maginot Line was there to stop sudden attacks leading to sudden victories; to prevent Germany from winning before France could move up its own troops in reply.
Germany invaded through the Ardennes. The Nazis put some careful work and organization into cutting through the terrain quickly. They put up enough of a Luftwaffe screen to prevent their troops from being bombed while that got done.
France fell.
After which France said "oops", and restored from a savepoint in 1929, at the start of when they'd begun to build the Maginot Line. On their second try, France extended their defenses to cover the Ardennes...
Just kidding! In real life, France had fallen, period; the Nazis took the country and held it through the major part of World War 2.
In a serious war -- war for the survival of your country, rather than war as the Sport of Kings -- you only get one try.
"Gambler's ruin" is the mathematical term for what happens to a betting strategy that bets everything; your bankroll can reach zero, and then you have nothing left to bet again. "Murphy's Curse of Ruin", I would say by analogy, is upon the sort of project where, if you fail sufficiently hard, you don't get to try again.
A lot of real life is like that, of course. There's no do-overs in most startups. There's no do-overs in ordinary high-stakes human conversations. We can only outline the curse of Ruin in our mental vision, by constrasting it to the stranger case of an engineer luxuriously getting to build another toaster if their first toaster design has a flaw; or the programmer's luxury of getting to rewrite a line of code and run the program again.
Engineers would be able to do a lot less, if they only got one try. Human programmers would do much much less, if they could only compile once.
How many tries you get, in practice, makes a HUGE difference in how tractable any project in life or engineering actually is.
It's harder, having only one try, in life or war.
Other supposed refutations of oneshotness
Now imagine it's 1929, ten years before World War 2, around the time that construction of Maginot forts began. Imagine that somebody in a conversation in the high halls of French government says -- meaning it as a straightforward truism -- that they'll only get one chance to get this "Maginot Line" business right, because you only get one shot in War.
Try to imagine -- it will take a bit of a stretch, because even in 1929 France, the high military is made up of mostly sorta serious people -- try to imagine the higher French military officials shooting down this pessimistic nay-sayer, loftily proclaiming:
"What do you mean, we get 'only one try' at correctly conducting a war with Germany? What is all this nonsense talk of 'oneshotness'? There will be many cases where French soldiers clash with German soldiers, and our country doesn't get defeated as soon as one French soldier falls. We get lots of tries at killing German soldiers! We reject your theory that the war will be settled in a single oneshot battle after all of the German soldiers teleport directly into Paris. We have experience from fighting World War 1, as well as numerous smaller fights between police officers and criminals; when the next war starts, it won't be unprecedented or novel at all. And we'll get to learn more as Germany invades, which will be a continuous rather than instantaneous process where one battalion crosses the border, and then another battalion. You say that if we lose, we'll be conquered and won't get to try again? Maybe we can get Russia to invade Germany and have our enemies cancel each other out; in which case France would not be conquered, and we would get many tries at fighting the next war over and over. Maybe we can kidnap a German infant and raise him with a habit of answering questions, and get him to tell us how to defeat Germany, at some point when he's old enough to predict how Germany will behave later, but still too young to think of lying; if improved German capabilities can help us defeat Germany, there can be no sudden shift in any balance of military power. We can try lots of things, really! We don't have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should; the problem of War is not single-try at all!"
What you'd mainly say, of course, is that the speaker here is being motivatedly oblivious. They would be assuming a Disaster Stance that goes well beyond the level of motivated optimism in eg Chernobyl, or actual 1930s France when they ignored some war-game results suggesting that Germany could perhaps come through the Ardennes.
After hearing lines of dialogue like that, one should stop considering the speakers as serious people with a few horrendous flaws. It is a point past which I start using phrases like "disaster monkeys".
But to nonetheless dissect all of the fallacies above:
- A larger war can be oneshot even if, zooming into a small-enough scale, you can find some local instances of conflict on which the larger war does not fully depend.
-- The problem "successfully fund and launch a shoe-company startup that becomes a commercial success" is oneshot even though "make a good athletic shoe" is not. You get multiple tries at designing or putting together a good athletic shoe. You get one shot at the startup.[2]
- Formally it is a "fallacy of composition" to see a big strategic problem extended over time, and note that it is made up of some parts where errors are not locally fatal, and conclude that the bigger thing is therefore not oneshot.
-- The startup's oneshotness is a property of the entire big-deal project extended over time, not a property of every single interaction along the way being globally fatal. So to point to one local interaction where failure is not globally fatal, does not dispel the Curse of Oneshotness over the larger global problem.
- There were no doubt some errors in the Mars Observer probe that were recoverable, and successfully recovered, up before the point where the probe was lost. The larger project was still oneshot, from the perspective of a manager or scientist staking some portion of their career on it. (It was obviously not oneshot from the perspective of larger humanity; failure didn't kill your parents, so you're still here to hear about it.)
- Being able to imagine a version of War that would be even more oneshot, does not change the way that actual War is still pretty oneshot. In particular, the speaker here is imagining an even more Murphy-cursed version of War, more subject to Murphy's Curse of Rapidity, where France would get even less chance to learn and react. But that events did not happen infinitely fast, did not save France, because Germany still made it through the Ardennes fast enough. And that incident was fatal enough, that whatever lesson France learned from that, came too late to save the rest of their war.
- These problems were not drawn from an identical distribution to World War 2. What French generals fancied themselves to have Learned From Experience was part of the problem, indeed, because they acquired confident wrong beliefs.
- The Mars Observer probe didn't teleport to Mars, yet was still lost. Things can go wrong even when they're physically continuous.
- For battalions to cross the border one after another, in a physically continuous process, does not mean that France is blessed with adequate time to observe the first battalion emering from the Ardennes, learn the real laws of World War 2, and then rebuild the Maginot Line correctly, before the next battalion emerges from the Ardennes.
- A project can be said to have an underlying Curse of Ruin, that contributes to the sum of its Murphian susceptibilities, whenever a sufficiently major disaster would be sufficiently fatal. Thinking you have a clever plan to not be ruined, is proposing to try to lift against this weight, not to cancel it; putting wings on an aircraft doesn't repeal the law of gravity.
- The fractal difficulties of this proposal would require their own post.
- Having a Clever Plan like this doesn't negate any Murphian curses, nor change the oneshotness of the larger war.
- So far as France knew, they'd tried several things including building the Maginot Line, reforming their military around the valuable lessons of World War 1, making advance plans to deal with probable German invasions, etcetera. So far as France knew, all those things were going to work great. But then those things didn't work, and then the war was over.
- All the many things France tried, collectively formed a single shot with respect to Murphy's Curse of Oneshotness. They did not get another shot after that.
- Sheer naked strawmanning[3] of what is being said when somebody tries to warn you of Murphy's curses upon your project; the chattering of disaster monkeys.
On the extraordinary efforts put forth to misinterpret the idea of oneshotness
Without a whole article like this one to hammer home exactly what I mean, I have found in the past that I cannot use phrases like "oneshot" or "you only get one try" around most so-called "AI safety" people outside of MIRI.
To be clear, a Congressional representative or staffer or national security professional will often immediately understand what is being said, if they haven't been previously contaminated by misinterpretations and straw positions. It's mainly AI companies, some AI professionals with fewer citations than Yoshua Bengio, OpenPhil-funded groups, etcetera, who manage to be unable to hear what is being said.
But the phrase "one shot" can be misunderstood with nearly probability 1, relative to the amount of effort that some people can, will, and have put forth to mishear it; and more importantly, misrepresent it in further debate.
So in conversations where there is a pre-poisoned fool hanging around, maybe you should try introducing it as the Irretrievability Problem rather than "oneshotness" and then the fool will have a harder time misinterpreting that in the very next sentence, because it will be harder for them to forcibly remap the word onto a strawman, maybe? I haven't tried out that new tack yet.
The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of "we only get one shot at ASI" is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever. A oneshot extinction problem that nobody understands very well is horrifying in the presence of any familiarity with the actual human history of engineers trying to do things that are hard to predict exactly -- let alone pre-engineers trying to do things that are hardly understood at all.[4] Once you've grasped three or four of the fundamental obstacles to ASI alignment[5] and three or four of the Murphian curses upon the field[6], you realize that of course ASI alignment is not the sort of thing that has ever before in the history of the world been done correctly on the first outing --
"Aha," somebody now immediately interrupts me, "but we don't have to get it right on the first outing! We can build smaller AIs and observe how they--"
The first 'outing' in the same sense that France's macro-scale attempt to fight in World War 2 was their second 'outing' in fighting against Germany, and their first outing at fighting with WW2 technology; not in the sense that any particular battle of that war is an outing.
ASI alignment is the sort of matter where, historically speaking in the totally normal and usual course of science as it has always been previously observed, there's initially all sorts of wacky ideas for how to do the ill-understood thing, and the first dozen ideas prove to fail under load --
"Yes, which is why it will be important to test those ideas on smaller AIs!"
Here 'fail under load' is trying to point to the way that the Maginot Line failed when Germany actually invaded and the contest was run for real, without the Maginot Line having particularly admitted any invasions before then. 'Fail under load', as in how the Mars Observer mission failed when actually launched, whatever ground-level tests it had passed before then, and whatever NASA's earlier attempts at simulation or careful thinking had turned up and already fixed before the launch. Lots of things that appear to work under lighter loads will fail under a heavier load.
"But we don't have to get it all correct based on pure theory, like you say is possible and say we should do --"
Sheer motivated misrepresentation, of a separate argument that isn't even being made here in the first place; see footnote 3 if you want to expose yourself to a frustrated rant about this.
"-- because, contrary to anything you imagine to be possible, our experience of earlier AIs can inform our models of superintelligence --"
One of the several fundamental difficulties of ASI alignment is that your theory of how to survive AI that is smart enough to kill you -- your theory of how to survive when there is a quantity of machine capability around that can kill you, if it turns against you, if something goes wrong -- has to be successfully generalized only from experiments that don't kill everyone on Earth if they fail. Meaning that you are experimenting on less powerful and less capable AI; which AI, if it reasons correctly, will not estimate that it can kill you -- among many other changes of conditions, shifts of distributions, between the safe mode of survivable experiment and the potentially lethal test environment.
This giant historically unprecedented problem has many ordinary-world valid analogies. Like how you can't determine if someone is trustworthy to handle a billion dollars by seeing how they handle ten dollars, even if it's in fact the same person and they're not getting much smarter, because they can think intelligently about whether it's a good time to steal the money. Or like a Greek city-state whose philosophers are arguing that the city could appoint an trustworthy dictator by watching how some boy acts as a child, and seeing if they act virtuously (knowing they're being watched). Conditions change, because the boy's brain is not what it will be when the boy grows up; nor are the conditions of an appointed city-dictator who is in fact being trusted, the same conditions of being a boy (who is being watched, and getting whacked by currently-bigger entities when he misbehaves). These facets of the problem are not on the same horrifically unprecedented level as "use your own thoughts to anticipate an alien much smarter than you", but they are very normal cases of why you can't solve dangerous problems just by taking a bunch of safe samples from a different distribution. The distribution is inherently different not least because it is safe. Problems like this are, in a very mundane way, why it is also a big deal to figure out who you can trust with a billion dollars, and why we don't just do a few experiments on the trustee and then generalize from those.
Someone could, conceivably, argue that the change to "there being enough machine superintelligence around that ASI could kill humanity if they tried", from "AIs being experimented-upon that couldn't kill us if they tried", will be less than the sort of change from "the sort of tests you can do on a Mars Observer probe on Earth", to "the actual conditions of the probe being launched and flying through space"; or the change from "ordinary operating conditions at Chernobyl" to "running a safety test of the backup cooling system at Chernobyl".
But that would be an argument so incredibly stupid that it might actually sound stupid when they thought about saying it. Of course the jump to actual empowered superintelligence is going to make a bunch of differences much much huger than the NASA difference between "actual space travel" and "artificial test chambers intended to simulate those clearly-understood conditions on Earth". Among other issues, AI-brewing alchemists understand the cognition inside AIs far more poorly than NASA physicists understood the conditions in space -- current AIs, never mind superintelligences! But also, in a very ordinary way, there just isn't a nonlethal way to test out lethal levels of superintelligence. Just like you can't test somebody's suitability to be city dictator by having philosophers follow around watching how they behave as a kid who knows he's being watched[7]; just like you can't make sure somebody can be trusted with a billion dollars by loaning them ten dollars.
You could argue that the jump to "enough superintelligence around to actually kill us" will change less from previously observed conditions with already-deployed or safely-lab-testable AIs, than the act of actually sending the Mars Observer into space was changed from NASA lab tests and simulations.
But people might perhaps disagree with you, if you tried to argue that explicitly.
So instead, the warning, "You only get one real shot at real ASI, and if you screw up everyone is dead and you don't get to try again," gets outrageously strawmanned and misinterpreted as "ASI would win instantaneously because FOOM", or "humanity should attempt to learn zero things by looking at earlier AIs and do everything based on theory", because the actual argument the ASI-survivableists need to make is a less attractive PR battleground.
"Aha, but as you've clearly never considered, we can have more than one ASI; and then if one ASI goes rogue, the other ASIs will stop it for fear of disrupting the orderly law-abiding equilibrium that we started out the ASIs inside; and therefore everyone will not be dead, and we will get to try again!"
If that whole clever scheme goes wrong, everyone is dead and you don't get to try again. I am not even arguing right now all the reasons why the clever scheme is doomed.[8] I am trying to explain why it is not a rejoinder that refutes, "ASI alignment is under Murphy's Curse of Oneshotness."
"Aha, but I can imagine some possible mistakes with superintelligence that would not wipe out humanity!"
Cool! You would have fit right in with a much much less serious version of France's top generals in 1929, if someone had argued that military strategy wasn't a one-shot sort of life problem, because they could imagine a possible mistake they could make with the Maginot Line that would not lose the whole war.
The core idea here is frankly not that complicated. A lot of people get it correctly and immediately. The thing being said is simple and an obvious default expectation when dealing with something vastly smarter than humanity: that is a lethal level of danger if something should happen to maybe possibly go wrong -- YES A SUFFICIENTLY SEVERE THING, YES YOU CAN IMAGINE A NONSEVERE ERROR, NO THAT DOESN'T CHANGE THE CORE IDEA, JESUS CHRIST.
Someone could conceivably try to argue against that really quite simple warning. But it takes a great motivated psychology to be unable to hear which idea is being argued; and manage to misinterpret every historical example, every ordinary everyday-life analogy, and every abstract explanation. Not in the sense of disputing their relevance, but in the sense of inability to repeat back which idea is being argued.
If not for this incredible effort at mishearing and misrepresenting the ideas, I could've just said, "Humanity only gets one shot at getting machine superintelligence right," and anybody who understood the everyday idea of crashing and burning in a big important conversation with someone, and not getting a do-over because no time travel, would've been able to understand the very ordinary core of what was being communicated.
The secret sauce of competent engineers in Murphy-cursed fields: only trying projects so incredibly straightforward as to be actually possible.
Above all else, the reason why Very Serious Engineers sometimes succeed even at slightly cursed problems with no cheap do-overs, is that they have a sense from both theory and practice about which problems are so incredibly ludicrously easy as to actually be solvable.
Go to a nuclear engineer and say, "Build me a reactor that runs off 2% enriched uranium, but the only neutron-absorber you're allowed to use is plain water, no boron or cadmium or hafnium." The nuclear engineer will say back "No, because that is a dumb idea.[9]"
Go to an aerospace engineer and ask them to make an ultra-contagious virus that rewrites human genomes to confer de-aging and biological immortality -- but safely and reliably, using their same Very Serious Methodology that they use for launching space probes that succeed more often than not. The aerospace engineer will laugh, and then, if you seem actually serious, perhaps try to explain like you are five: "I can't do that because science doesn't have a good-enough theory of what a completed immortality virus would look like."
"I can't use the same process that builds space probes that sometimes work, to build you a immortality virus at all, let alone a safe one," says the aerospace engineer. "Because the base resource that a space probe project starts with, is an idea that science strongly implies would work for known straightforward reasons, if nothing surprising happened instead. The incredibly difficult job that takes all the very serious organizational process -- and still only works most of the time -- is having those nice ideas that ought to work in very straightforward ways for very well-understood reasons, actually work all the way to where a probe sends back data from Mars. We don't have that for an immortality virus, so we can't get past step zero of the very serious and safe methodology."
And we understand what goes wrong with the human body during aging, much MUCH better than we understand what goes on inside LLM cognition. We could get correspondingly closer to success, if we tried telling an aerospace engineer to use NASA's assurance processes to build an immortality virus, rather than telling them to build a safe superintelligence.
But mostly what that very serious process would tell you, is that what you have made is not a safe immortality virus, and you should not try to make it very contagious and infect the Earth's population with it.
And the great seriousness of a decent engineer would manifest in this way: that so vast would be their understanding of their own limitations, that they wouldn't need to infect most of humanity with a highly contagious virus that then surprisingly didn't work exactly as they'd hoped, in order to learn to their vast surprise and dismay that building an immortality virus was more than trivially difficult. They would know it even in advance of killing a dozen suicide volunteers or a hundred monkeys! They'd see that incredible surprising shocking unexpected plot twist coming in advance of it actually happening.
So nobody like that would start doing a biology project aimed at making a contagious de-aging virus to the great benefit of all human beings. They'd know that was an overly cursed project for anyone to actually be able to do.
The zeroth skill of a wise engineer in a Murphy-cursed discipline is that they know what is so ludicrously far beyond their skill and understanding that if they tried that then of course they would fail, in a matter where failure is hugely costly.
So nobody that wise would try to brew up a machine superintelligence with anything remotely like modern methods and modern levels of understanding; and the CEOs of AI companies have been filtered to not be people who get that.
Effectively exact for the low-energy domains in question.
The extremely motivated quibbler will imagine up exceptions to this rule, billionaire founders of infinite patience. An average and ordinary startup does operate under a curse of oneshotness in this sense; at some point the funders run out of money, or key employees run out of hope.
I have never said anything like this. If somebody told you otherwise, they were mistaken, and repeating the falsehoods of people very very heavily motivated to come up with insane straw misconstruals of what MIRI was trying to do back in 2015 when we would occasionally publish papers with math in them. Or if I can vent some frustration here:
This is a barbarian-populist's crude angry view of what it means to see papers with math in them that they didn't understand.
I am not going to try again to explain to the barbarians why we ever attempted to publish any papers with math at all. But the notion of getting everything correct on pure theory was not it, nor an attempt to build an AI out of pure math resembling the math in our papers, et cetera ad nauseam.
If someone wants to someday want to understand what you sometimes do with math besides declaring that something is logically absolutely predictable, or turning the math into exact code, that would be a longer conversation. But there are other reasons for sometimes trying to think mathematically! Or writing essays that have algebra in them! (There are of course ways to try to puff yourself up and look more important by writing fancier algebra, but I think MIRI actually did a decent job of not using any more algebra than was required.)
In a way it's a sad historical point that some of the people trying to warn why the Maginot Line is a oneshot sort of problem, once wrote essays with some algebraic formulae in them. Now the disaster monkeys can chatter to one another that all the warnings are coming from old fools upset that the Maginot Line doesn't look like their formulae; they can be utterly impervious to all arguments without bothering to counterargue their direct meanings, because they're certain that hidden premise must be in there somewhere; even as we repeatedly try to say that's not what this new separate conversation is about at all.
They have a reason for dismissal that feels sufficient to reassure themselves, and they'll stick with it no matter what sensory experiences they are otherwise exposed to, and feel happy and self-satisfied with about their right and clever decision, until the moment they kill themselves and you; repeating among themselves, the while, that isn't it sad how MIRI never repented of their foolish old attempts to prove AI mathematically safe; which again, to be clear, is not a kind of thing that math can do in principle, and we knew that from the start.
"But we have these observations we've made! We have these recipes that work!" Medieval alchemists could say the same, and if you don't think they could, you don't respect medieval alchemists enough and you lack a historical sense of how many observations and known recipes they did have. But they did not really know what was going inside, they could not predict in advance exactly what would be observed, they did not know which new recipes would work and why, etc.
If you wanted a more polite metaphor than alchemy, you could pick metallurgy. Pre-20th-century metallurgists worked by mixing components into alloys, raising temperatures, lowering temperatures, observing and recording which recipes worked, etc, without understanding or predicting the crystal structures of alloys in advance. A lot of later metallurgists too, really.
But also, you didn't usually see 19th-century metallurgists having great noble high-minded theories of how their metals would confer immortality based upon their invocation of deep moving metaphysical principles. So I think alchemy is the more correct historical analogue over 19th-century metallurgy. If you can pick out an AI lab worker who is content poking around their LLM recipes, and makes no claim about later models including the impossibility or controllability of superintelligence etcetera, I should think it fair enough to analogize them to a diligent 19th-century metallurgist.
...If they were trying to refine and pile up more and more bricks of uranium metal in an inhabited city in hopes of generating enough thermal energy to heat and power homes; and insisted that they hadn't observed any downsides of that, and weren't going to speculate unscientifically.
Eg: You don't get what you train for, cognitive uncontainability of superhuman planners, distribution shifts with higher capability, Goodhart's Curse as a function of widened option spaces, etc. See AGI Ruin: A List of Lethalities.
Eg: Novelty, fundamental engineering novelty, pre-paradigmatic fundamental scientific confusion about LLM thought processes, rapidity, narrow margins, etc. See AGI Ruin: A List of Lethalities.
Especially if the kid is a new inhuman species of alien. But I do not raise this in the main argument because adding this disjunctive point will invite a certain kind of psychology to leap on it and argue how their LLM isn't so alien and in particular it seems to understand a lot of human stuff, etcetera. (Understanding is not the problem; ASIs always understand things; their preferences are the problem.) The analogy to ordinary life goes through without the kid being an alien, even though in real life the kid is an alien.
If you do not know how to align any ASI, after their negotiations among themselves arrive at a near-Pareto equilibrium, its near-Pareto property means that it will not have all the agents going out of their way to spare the Earth and the Earth's sunlight out of some fear of otherwise being disorderly; they can do better collectively by not doing that. They are smart enough to negotiate detailed near-Pareto coordinated movements and fairly divide the gains from those, rather than flinch back in terror from a human's fear of violating a prior legal setup.
Also a successful space probe needs to not rely on clever-sounding schemes like this at all. This is alchemist-level arguing about how all your different poisons will surely neutralize each other.
It's a dumb idea (1) because water (or more precisely the hydrogen component of water) is both a neutron absorber and a neutron moderator, (2) because it's hard to put in much more or much less water very quickly compared to scramming a well-designed boron rod, (3) because changes in reactor heat levels will affect water behavior in a direct way by turning it into vapor or supercritical vapor, and (4) because changes in water flow affect how much heat is being removed from the reactor. The details of this do not pointwise map onto anything in ASI in particular; it's just an example of how the competent engineer is not so much "capable of doing anything however difficult", as "one who knows what is possible and nonstupid enough to be worth trying".