Irretrievability; or, Murphy's Curse of Oneshotness upon ASI

Eliezer Yudkowsky

Example 1: The Viking 1 lander

In the 1970s, NASA sent a pair of probes to Mars, the Viking 1 and Viking 2 missions. Total cost of $1B (1970), equivalent to about $7B (2025). The Viking 1 probe operated on Mars's surface for six years, before its battery began to seriously degrade.

One might have thought a battery problem like that would spell the irrevocable end of the mission. The probe had already launched and was now on Mars, very far away and out of reach of any human technician's fixing fingers. Was it not inevitable, then, that if any kind of technical problem were to be discovered long after the space launch in August 1975, nothing could possibly be done?

But the foresightful engineers of the Viking 1 probe had devised a plan for just this class of eventuality, which they had foreseen in general, if not in exact specifics. They had built the Viking 1 probe to accept software updates by radio receiver, transmitted from Earth.

On November 11, 1982, Earth sent an update to the Viking 1 lander's software, intended to make sure the battery only discharged down to a minimum voltage level, rather than running for a fixed time after each charge.

The battery-software update accidentally overwrote the antenna-pointing software.

With the lander's antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.

The error had destroyed the intended mechanism for recovering from errors.

All contact with the Viking 1 lander was permanently lost. Ground engineers tried some strategies for regaining contact, based on extrapolation of where the antenna could have ended up pointing, but none succeeded.

In this I observe a specific instance of a general idea: Murphy's Curse of Inaccessibility on space probes is a deep problem. A clever system designed in the hope of accepting later patches is a relatively shallower solution.

Putting wings on an airplane doesn't make it weightless and repeal the law of gravity. The weight of an airplane is an intrinsic property that goes on making it susceptible to falling out of the sky if the wings stop working. Your model of the airplane should include the ongoing weight and ongoing lift; not, argue that the curse of airplane-weight will be dispelled by wings.

The engineers' attempted strategy for mitigating the underlying oneshot quality of a space probe launch -- the engineers' intended mechanism for correcting mistakes afterwards -- did not actually transform the Viking 1 lander into an Earth-bound car that you could walk over to and fix. Any sort of problem that struck at the corrective machinery itself, would catapult you right back into the fundamental inaccessibility scenario, that you couldn't just walk over and fix a broken corrective mechanism. (And also, of course, large classes of possible error can't be addressed by a software update at all.) The underlying reality was that the probe stayed far and high away.

Rocket science wouldn't be so famously cursed by Murphy's Law, if the heightened susceptibility conditions for Murphy's Law to act upon their projects, were so easy to defeat with a little effort. The many Curses of Murphy upon aerospace engineering can be fought; but not vanquished, not dispelled.

Good aerospace engineers know that. So they put in the extreme levels of paranoia and preparation that are required to sometimes succeed.

One can only imagine what tiny fraction of space probe missions would succeed if the engineers or managers were the sort who went around bragging, "Our space probe won't be inaccessible after launch at all; we built in an antenna to upload software updates! Don't listen to those silly people who'll tell you that you 'can't walk over and fix' a space probe after launch; they lack our own experience to have had the brilliant idea of software updates!"

Would such a cheerful innocent ever succeed at landing a space probe? It seems crazy to assign them a real-world success probability as high as 10%. Aerospace engineers have to work much harder than that, be much more paranoid and cautious than that, to drive the success chance significantly above 0.

Example 2: The Mars Observer

The Mars Observer mission was approved in October of 1984 and launched in September of 1992, at a cost of $813 million ($2B 2025). It flew through space for 330 days, and then, three days before inserting into Mars orbit, communication with the probe was lost.

The best-guess postmortem analysis: After the earlier stresses of launch, and an 11-month flight through vacuum, a PTFE check valve had leaked fuel and oxidizer vapors that accumulated within feed lines in the zero gravity; and this produced an explosion when the engine was restarted (for course correction before orbital insertion).

That sort of thing happens, when you try to do something for the first time. One of the reasons why space probes are famously acutely susceptible to Murphy's Law, is that each new probe gets custom-built for a new mission. Each mission is a chance for something new to go wrong.

Now imagine some manager or space enthusiast saying -- in advance of the actual disaster, of course -- "The Mars Observer mission isn't novel! We can test the probe here on Earth in a vacuum chamber! We have experience from previous space probes! We have the whole mighty edifice of science to observe the laws of physics; and we can use those laws to extrapolate the system behavior of the Mars Observer probe on its way to Mars!"

To enumerate the object-level reasons this doesn't repeal Murphy's Curse of Novelty upon space-probe missions generally, or the Mars Observer specifically:

- Even if humanity did in fact learn something from previous space probes, previous probes weren't exactly like the Mars Observer. The ultimate novelty of the mission was not defeated, repealed, nor averted.

The Mars Observer might have failed even earlier, if it had been attempted with even less experience. It's not that all previous learning had no effect. But humanity had not learned enough, generalized correctly and with sufficient reliability, from those earlier nonidentical space probes, to get the new and different Mars Observer to Mars.

- Even if somebody had spun around the probe in a centrifuge to simulate a high-G space launch, and tested out all the systems in a vacuum chamber, that still would not have faithfully reproduced the exact conditions under which fuel vapor leaked in vacuum and then accumulated over eleven months in zero gravity. The conditions of validation here on Earth would not have been exactly like the deployment conditions.

This is why you can't get solid guarantees on space probes using mathematically valid statistics. The training distribution is not the deployment distribution, and that takes all mathematical guarantees and throws them out the window. Since aerospace engineers aren't wacky lunatics, they know this, and none of them have ever even tried to suggest that any kind of mathematical guarantee could apply.

Real life has many such cases. Much more mundanely than space probes, there's no way to use clever statistical guarantees to force an ordinary human conversation to go well, because no two conversations are the same and they're not sampled from time-invarying distributions.

- Humanity's grasp of chemistry and physics -- built by generalizing mathematically simple laws, over a genuinely vast body of observations -- and then applied to molecules and gases literally identical to the molecules and gases in the Mars Observer -- as put together in straightforwardly-mechanical processes themselves observed repeatedly, vastly simple by comparison with eg large computer programs or biochemistry -- was not actually adequate to predict and control the mission outcome.

Knowing all that physics did not negate the underlying surprisingness of a system of even that small complexity. It did not transform it into a mere repetition of previous operations on identical titanium alloys.

This, again, doesn't mean that humanity's knowledge of science and physics contributed zero help to the Mars Observer mission. That space mission would've failed much earlier and harder -- it is difficult even to imagine the counterfactual -- if humanity's grasp of the underlying science had been more akin to medieval alchemists pontificating about the philosophical significances of reagents, with every leading alchemist making up their own brilliant plan for a space mission where several steps involved metaphysical principles of great uplifting moral significance.

Humanity's understanding of the underlying mechanical processes was how the Mars Observer mission came close to succeeding -- in a way that medieval alchemy never came close to an immortality potion, or even to the far simpler goal of transforming lead into gold.

To sum up: Even (1) learning from previous space probes, (2) testing under controlled conditions attemptedly similar to conditions in space, (3) knowing all relevant fundamental physical laws exactly^[1], (4) having an excellent quantitative grasp of relatively simple higher-level phenomena that governed fully, and (5) doing NASA-standard amounts of intensive thinking, gaming, and simulation about what could go wrong with a billion-dollar project, did not repeal Murphy's Curse of Novelty upon space probes. It was, in the end, still the very first Mars Observer mission.

NASA's efforts at understanding could challenge that Curse of Novelty, in a way that no alchemist's philosophizing could have challenged it, even if the alchemist had managed to grasp one or two rules-of-thumb. The people at NASA who put together the Mars Observer mission over many years of careful planning for that exact mission, had a level of professionalism, engineering caution, background scientific knowledge, specific preparation time, and general seriousness, vastly exceeding the professionalism of any alchemist or AGI company executive.

...Which didn't actually repeal all of the Murphian curses upon space probes. It wasn't enough for the Mars Observer to actually work.

The genuinely very serious people at NASA put up enough of a fight that the Mars Observer almost worked. Even the RBMK design for the Chernobyl nuclear reactor almost worked; it worked for many operation-years before one exploded! Despite the Soviet managers taking a few Disaster Stances that put a ceiling on the maximum socially allowed level of pessimism, the Soviet nuclear engineers knew vastly more and took their jobs vastly more seriously than medieval alchemists or modern AGI companies. There was an actual theory of why the Chernobyl reactor was supposed to not explode, written down so that multiple people could read it, based on an understanding from first principles! They had written handbooks in the 24/7 control room, and the written handbooks weren't just made up to look better!

It's just that to have a Murphy-cursed project actually really work in real life, rather than almost work, is very much harder.

(Though again to be clear, professionalism cannot magically make just any project almost-work. You could not give even the genuinely serious people at 1970s NASA the goal of building a contagious virus that conferred de-aging and indefinite biological healthspan, and have the resulting virus almost-work. The level of difficulty for "make an immortality virus" would be beyond what serious people could almost-do in 1970. Part of being serious is having some sense of a project's cursedness level, and not being a lunatic about what you try to do at high stakes.)

Example 3: The Maginot Line

In September 1939, Germany invaded Poland; this is usually the date given for the start of World War 2, though there were other preludes and signs before.

In May of 1940, Germany attacked France.

France thought they were ready.

France had foresightfully, starting in 1929 eleven years before the moment of crisis, already built the Maginot Line: a hugely expensive network of defensive fortifications along most of France's borders. Those fortifications would have trivially been defensively victorious in World War 1, if they'd been built before World War 1. The Maginot forts were supplied by underground railways, to make their supply lines harder to cut. They had the usual stockpiles of food and ammunition. The forts even had air conditioning -- a startling and expensive luxury for a military fort in 1940, but very much the sort of thing that soldiers had wished they'd had in World War 1.

France had learned the hard lessons of experience and prior battles in World War 1, and generalized them to the future!

Being so expensive, the Maginot Line did not cover literally all French borders. It did cover their borders with other countries that Germany might invade first to get at France, not just the border with Germany; the French military was trying to be thorough. But there were still some carefully-reasoned gaps. For example, France figured that the heavily-forested Ardennes wouldn't be easy to pass; France figured that any German invasion via the Ardennes would be slowed by dense forest terrain, and then further slowed by attacks from French aircraft. France figured it would take Germany at least 3 days, and more probably a week, to make it through the Ardennes; which, according to the calculations of the French military command, would give the French plenty of time to rush their own troops into position along that border, in the unlikely event Germany tried that doomed tactic.

The Maginot Line was there to stop sudden attacks leading to sudden victories; to prevent Germany from winning before France could move up its own troops in reply.

Germany invaded through the Ardennes. The Nazis put some careful work and organization into cutting through the terrain quickly. They put up enough of a Luftwaffe screen to prevent their troops from being bombed while that got done.

France fell.

After which France said "oops", and restored from a savepoint in 1929, at the start of when they'd begun to build the Maginot Line. On their second try, France extended their defenses to cover the Ardennes...

Just kidding! In real life, France had fallen, period; the Nazis took the country and held it through the major part of World War 2.

In a serious war -- war for the survival of your country, rather than war as the Sport of Kings -- you only get one try.

"Gambler's ruin" is the mathematical term for what happens to a betting strategy that bets everything; your bankroll can reach zero, and then you have nothing left to bet again. "Murphy's Curse of Ruin", I would say by analogy, is upon the sort of project where, if you fail sufficiently hard, you don't get to try again.

A lot of real life is like that, of course. There's no do-overs in most startups. There's no do-overs in ordinary high-stakes human conversations. We can only outline the curse of Ruin in our mental vision, by contrasting it to the stranger case of an engineer luxuriously getting to build another toaster if their first toaster design has a flaw; or the programmer's luxury of getting to rewrite a line of code and run the program again.

Engineers would be able to do a lot less, if they only got one try. Human programmers would do much much less, if they could only compile once.

How many tries you get, in practice, makes a HUGE difference in how tractable any project in life or engineering actually is.

It's harder, having only one try, in life or war.

Other supposed refutations of oneshotness

Now imagine it's 1929, ten years before World War 2, around the time that construction of Maginot forts began. Imagine that somebody in a conversation in the high halls of French government says -- meaning it as a straightforward truism -- that they'll only get one chance to get this "Maginot Line" business right, because you only get one shot in War.

Try to imagine -- it will take a bit of a stretch, because even in 1929 France, the high military is made up of mostly sorta serious people -- try to imagine the higher French military officials shooting down this pessimistic nay-sayer, loftily proclaiming:

"What do you mean, we get 'only one try' at correctly conducting a war with Germany? What is all this nonsense talk of 'oneshotness'? There will be many cases where French soldiers clash with German soldiers, and our country doesn't get defeated as soon as one French soldier falls. We get lots of tries at killing German soldiers! We reject your theory that the war will be settled in a single oneshot battle after all of the German soldiers teleport directly into Paris. We have experience from fighting World War 1, as well as numerous smaller fights between police officers and criminals; when the next war starts, it won't be unprecedented or novel at all. And we'll get to learn more as Germany invades, which will be a continuous rather than instantaneous process where one battalion crosses the border, and then another battalion. You say that if we lose, we'll be conquered and won't get to try again? Maybe we can get Russia to invade Germany and have our enemies cancel each other out; in which case France would not be conquered, and we would get many tries at fighting the next war over and over. Maybe we can kidnap a German infant and raise him with a habit of answering questions, and get him to tell us how to defeat Germany, at some point when he's old enough to predict how Germany will behave later, but still too young to think of lying; if improved German capabilities can help us defeat Germany, there can be no sudden shift in any balance of military power. We can try lots of things, really! We don't have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should; the problem of War is not single-try at all!"

What you'd mainly say, of course, is that the speaker here is being motivatedly oblivious. They would be assuming a Disaster Stance that goes well beyond the level of motivated optimism in eg Chernobyl, or actual 1930s France when they ignored some war-game results suggesting that Germany could perhaps come through the Ardennes.

After hearing lines of dialogue like that, one should stop considering the speakers as serious people with a few horrendous flaws. It is a point past which I start using phrases like "disaster monkeys".

But to nonetheless dissect all of the fallacies above:

What do you mean, we only get one try at correctly conducting the war with Germany? There will be many cases where French soldiers clash with German soldiers, and our country doesn't get defeated as soon as one French soldier falls. We get lots of tries at killing German soldiers!

- A larger war can be oneshot even if, zooming into a small-enough scale, you can find some local instances of conflict on which the larger war does not fully depend.

-- The problem "successfully fund and launch a shoe-company startup that becomes a commercial success" is oneshot even though "make a good athletic shoe" is not. You get multiple tries at designing or putting together a good athletic shoe. You get one shot at the startup.^[2]

- Formally it is a "fallacy of composition" to see a big strategic problem extended over time, and note that it is made up of some parts where errors are not locally fatal, and conclude that the bigger thing is therefore not oneshot.

-- The startup's oneshotness is a property of the entire big-deal project extended over time, not a property of every single interaction along the way being globally fatal. So to point to one local interaction where failure is not globally fatal, does not dispel the Curse of Oneshotness over the larger global problem.

- There were no doubt some errors in the Mars Observer probe that were recoverable, and successfully recovered, up before the point where the probe was lost. The larger project was still oneshot, from the perspective of a manager or scientist staking some portion of their career on it. (It was obviously not oneshot from the perspective of larger humanity; failure didn't kill your parents, so you're still here to hear about it.)

We reject your theory that the war will be settled in a single oneshot battle after all of the German soldiers teleport directly into Paris.

- Being able to imagine a version of War that would be even more oneshot, does not change the way that actual War is still pretty oneshot. In particular, the speaker here is imagining an even more Murphy-cursed version of War, more subject to Murphy's Curse of Rapidity, where France would get even less chance to learn and react. But that events did not happen infinitely fast, did not save France, because Germany still made it through the Ardennes fast enough. And that incident was fatal enough, that whatever lesson France learned from that, came too late to save the rest of their war.

We have experience from fighting World War 1, as well as numerous smaller fights between police officers and criminals.

- These problems were not drawn from an identical distribution to World War 2. What French generals fancied themselves to have Learned From Experience was part of the problem, indeed, because they acquired confident wrong beliefs.

And we'll get to learn more as Germany invades, which will be a continuous rather than instantaneous process where one battalion crosses the border, and then another battalion.

- The Mars Observer probe didn't teleport to Mars, yet was still lost. Things can go wrong even when they're physically continuous.

- For battalions to cross the border one after another, in a physically continuous process, does not mean that France is blessed with adequate time to observe the first battalion emering from the Ardennes, learn the real laws of World War 2, and then rebuild the Maginot Line correctly, before the next battalion emerges from the Ardennes.

You say that if we lose, we'll be conquered and won't get to try again? Maybe we can get Russia to invade Germany and have our enemies cancel each other out; in which case France would not be conquered, and we would get many tries at fighting the next war over and over.

- A project can be said to have an underlying Curse of Ruin, that contributes to the sum of its Murphian susceptibilities, whenever a sufficiently major disaster would be sufficiently fatal. Thinking you have a clever plan to not be ruined, is proposing to try to lift against this weight, not to cancel it; putting wings on an aircraft doesn't repeal the law of gravity.

Maybe we can kidnap a German infant and raise him with a habit of answering questions, and get him to tell us how to defeat Germany, at some point when he's old enough to predict how Germany will behave later, but still too young to think of lying

- The fractal difficulties of this proposal would require their own post.

if improved German capabilities can help us defeat Germany, there can be no sudden shift in any balance of military power.

- Having a Clever Plan like this doesn't negate any Murphian curses, nor change the oneshotness of the larger war.

We can try lots of things, really!

- So far as France knew, they'd tried several things including building the Maginot Line, reforming their military around the valuable lessons of World War 1, making advance plans to deal with probable German invasions, etcetera. So far as France knew, all those things were going to work great. But then those things didn't work, and then the war was over.

- All the many things France tried, collectively formed a single shot with respect to Murphy's Curse of Oneshotness. They did not get another shot after that.

We don't have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should.

- Sheer naked strawmanning^[3] of what is being said when somebody tries to warn you of Murphy's curses upon your project; the chattering of disaster monkeys.

On the extraordinary efforts put forth to misinterpret the idea of oneshotness

Without a whole article like this one to hammer home exactly what I mean, I have found in the past that I cannot use phrases like "oneshot" or "you only get one try" around most so-called "AI safety" people outside of MIRI.

To be clear, a Congressional representative or staffer or national security professional will often immediately understand what is being said, if they haven't been previously contaminated by misinterpretations and straw positions. It's mainly AI companies, some AI professionals with fewer citations than Yoshua Bengio, OpenPhil-funded groups, etcetera, who manage to be unable to hear what is being said.

But the phrase "one shot" can be misunderstood with nearly probability 1, relative to the amount of effort that some people can, will, and have put forth to mishear it; and more importantly, misrepresent it in further debate.

So in conversations where there is a pre-poisoned fool hanging around, maybe you should try introducing it as the Irretrievability Problem rather than "oneshotness" and then the fool will have a harder time misinterpreting that in the very next sentence, because it will be harder for them to forcibly remap the word onto a strawman, maybe? I haven't tried out that new tack yet.

The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of "we only get one shot at ASI" is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever. A oneshot extinction problem that nobody understands very well is horrifying in the presence of any familiarity with the actual human history of engineers trying to do things that are hard to predict exactly -- let alone pre-engineers trying to do things that are hardly understood at all.^[4] Once you've grasped three or four of the fundamental obstacles to ASI alignment^[5] and three or four of the Murphian curses upon the field^[6], you realize that of course ASI alignment is not the sort of thing that has ever before in the history of the world been done correctly on the first outing --

"Aha," somebody now immediately interrupts me, "but we don't have to get it right on the first outing! We can build smaller AIs and observe how they--"

The first 'outing' in the same sense that France's macro-scale attempt to fight in World War 2 was their second 'outing' in fighting against Germany, and their first outing at fighting with WW2 technology; not in the sense that any particular battle of that war is an outing.

ASI alignment is the sort of matter where, historically speaking in the totally normal and usual course of science as it has always been previously observed, there's initially all sorts of wacky ideas for how to do the ill-understood thing, and the first dozen ideas prove to fail under load --

"Yes, which is why it will be important to test those ideas on smaller AIs!"

Here 'fail under load' is trying to point to the way that the Maginot Line failed when Germany actually invaded and the contest was run for real, without the Maginot Line having particularly admitted any invasions before then. 'Fail under load', as in how the Mars Observer mission failed when actually launched, whatever ground-level tests it had passed before then, and whatever NASA's earlier attempts at simulation or careful thinking had turned up and already fixed before the launch. Lots of things that appear to work under lighter loads will fail under a heavier load.

"But we don't have to get it all correct based on pure theory, like you say is possible and say we should do --"

Sheer motivated misrepresentation, of a separate argument that isn't even being made here in the first place; see footnote 3 if you want to expose yourself to a frustrated rant about this.

"-- because, contrary to anything you imagine to be possible, our experience of earlier AIs can inform our models of superintelligence --"

One of the several fundamental difficulties of ASI alignment is that your theory of how to survive AI that is smart enough to kill you -- your theory of how to survive when there is a quantity of machine capability around that can kill you, if it turns against you, if something goes wrong -- has to be successfully generalized only from experiments that don't kill everyone on Earth if they fail. Meaning that you are experimenting on less powerful and less capable AI; which AI, if it reasons correctly, will not estimate that it can kill you -- among many other changes of conditions, shifts of distributions, between the safe mode of survivable experiment and the potentially lethal test environment.

This giant historically unprecedented problem has many ordinary-world valid analogies. Like how you can't determine if someone is trustworthy to handle a billion dollars by seeing how they handle ten dollars, even if it's in fact the same person and they're not getting much smarter, because they can think intelligently about whether it's a good time to steal the money. Or like a Greek city-state whose philosophers are arguing that the city could appoint an trustworthy dictator by watching how some boy acts as a child, and seeing if they act virtuously (knowing they're being watched). Conditions change, because the boy's brain is not what it will be when the boy grows up; nor are the conditions of an appointed city-dictator who is in fact being trusted, the same conditions of being a boy (who is being watched, and getting whacked by currently-bigger entities when he misbehaves). These facets of the problem are not on the same horrifically unprecedented level as "use your own thoughts to anticipate an alien much smarter than you", but they are very normal cases of why you can't solve dangerous problems just by taking a bunch of safe samples from a different distribution. The distribution is inherently different not least because it is safe. Problems like this are, in a very mundane way, why it is also a big deal to figure out who you can trust with a billion dollars, and why we don't just do a few experiments on the trustee and then generalize from those.

Someone could, conceivably, argue that the change to "there being enough machine superintelligence around that ASI could kill humanity if they tried", from "AIs being experimented-upon that couldn't kill us if they tried", will be less than the sort of change from "the sort of tests you can do on a Mars Observer probe on Earth", to "the actual conditions of the probe being launched and flying through space"; or the change from "ordinary operating conditions at Chernobyl" to "running a safety test of the backup cooling system at Chernobyl".

But that would be an argument so incredibly stupid that it might actually sound stupid when they thought about saying it. Of course the jump to actual empowered superintelligence is going to make a bunch of differences much much huger than the NASA difference between "actual space travel" and "artificial test chambers intended to simulate those clearly-understood conditions on Earth". Among other issues, AI-brewing alchemists understand the cognition inside AIs far more poorly than NASA physicists understood the conditions in space -- current AIs, never mind superintelligences! But also, in a very ordinary way, there just isn't a nonlethal way to test out lethal levels of superintelligence. Just like you can't test somebody's suitability to be city dictator by having philosophers follow around watching how they behave as a kid who knows he's being watched^[7]; just like you can't make sure somebody can be trusted with a billion dollars by loaning them ten dollars.

You could argue that the jump to "enough superintelligence around to actually kill us" will change less from previously observed conditions with already-deployed or safely-lab-testable AIs, than the act of actually sending the Mars Observer into space was changed from NASA lab tests and simulations.

But people might perhaps disagree with you, if you tried to argue that explicitly.

So instead, the warning, "You only get one real shot at real ASI, and if you screw up everyone is dead and you don't get to try again," gets outrageously strawmanned and misinterpreted as "ASI would win instantaneously because FOOM", or "humanity should attempt to learn zero things by looking at earlier AIs and do everything based on theory", because the actual argument the ASI-survivableists need to make is a less attractive PR battleground.

"Aha, but as you've clearly never considered, we can have more than one ASI; and then if one ASI goes rogue, the other ASIs will stop it for fear of disrupting the orderly law-abiding equilibrium that we started out the ASIs inside; and therefore everyone will not be dead, and we will get to try again!"

If that whole clever scheme goes wrong, everyone is dead and you don't get to try again. I am not even arguing right now all the reasons why the clever scheme is doomed.^[8] I am trying to explain why it is not a rejoinder that refutes, "ASI alignment is under Murphy's Curse of Oneshotness."

"Aha, but I can imagine some possible mistakes with superintelligence that would not wipe out humanity!"

Cool! You would have fit right in with a much much less serious version of France's top generals in 1929, if someone had argued that military strategy wasn't a one-shot sort of life problem, because they could imagine a possible mistake they could make with the Maginot Line that would not lose the whole war.

The core idea here is frankly not that complicated. A lot of people get it correctly and immediately. The thing being said is simple and an obvious default expectation when dealing with something vastly smarter than humanity: that is a lethal level of danger if something should happen to maybe possibly go wrong -- YES A SUFFICIENTLY SEVERE THING, YES YOU CAN IMAGINE A NONSEVERE ERROR, NO THAT DOESN'T CHANGE THE CORE IDEA, JESUS CHRIST.

Someone could conceivably try to argue against that really quite simple warning. But it takes a great motivated psychology to be unable to hear which idea is being argued; and manage to misinterpret every historical example, every ordinary everyday-life analogy, and every abstract explanation. Not in the sense of disputing their relevance, but in the sense of inability to repeat back which idea is being argued.

If not for this incredible effort at mishearing and misrepresenting the ideas, I could've just said, "Humanity only gets one shot at getting machine superintelligence right," and anybody who understood the everyday idea of crashing and burning in a big important conversation with someone, and not getting a do-over because no time travel, would've been able to understand the very ordinary core of what was being communicated.

The secret sauce of competent engineers in Murphy-cursed fields: only trying projects so incredibly straightforward as to be actually possible.

Above all else, the reason why Very Serious Engineers sometimes succeed even at slightly cursed problems with no cheap do-overs, is that they have a sense from both theory and practice about which problems are so incredibly ludicrously easy as to actually be solvable.

Go to a nuclear engineer and say, "Build me a reactor that runs off 2% enriched uranium, but the only neutron-absorber you're allowed to use is plain water, no boron or cadmium or hafnium." The nuclear engineer will say back "No, because that is a dumb idea.^[9]"

Go to an aerospace engineer and ask them to make an ultra-contagious virus that rewrites human genomes to confer de-aging and biological immortality -- but safely and reliably, using their same Very Serious Methodology that they use for launching space probes that succeed more often than not. The aerospace engineer will laugh, and then, if you seem actually serious, perhaps try to explain like you are five: "I can't do that because science doesn't have a good-enough theory of what a completed immortality virus would look like."

"I can't use the same process that builds space probes that sometimes work, to build you a immortality virus at all, let alone a safe one," says the aerospace engineer. "Because the base resource that a space probe project starts with, is an idea that science strongly implies would work for known straightforward reasons, if nothing surprising happened instead. The incredibly difficult job that takes all the very serious organizational process -- and still only works most of the time -- is having those nice ideas that ought to work in very straightforward ways for very well-understood reasons, actually work all the way to where a probe sends back data from Mars. We don't have that for an immortality virus, so we can't get past step zero of the very serious and safe methodology."

And we understand what goes wrong with the human body during aging, much MUCH better than we understand what goes on inside LLM cognition. We could get correspondingly closer to success, if we tried telling an aerospace engineer to use NASA's assurance processes to build an immortality virus, rather than telling them to build a safe superintelligence.

But mostly what that very serious process would tell you, is that what you have made is not a safe immortality virus, and you should not try to make it very contagious and infect the Earth's population with it.

And the great seriousness of a decent engineer would manifest in this way: that so vast would be their understanding of their own limitations, that they wouldn't need to infect most of humanity with a highly contagious virus that then surprisingly didn't work exactly as they'd hoped, in order to learn to their vast surprise and dismay that building an immortality virus was more than trivially difficult. They would know it even in advance of killing a dozen suicide volunteers or a hundred monkeys! They'd see that incredible surprising shocking unexpected plot twist coming in advance of it actually happening.

So nobody like that would start doing a biology project aimed at making a contagious de-aging virus to the great benefit of all human beings. They'd know that was an overly cursed project for anyone to actually be able to do.

The zeroth skill of a wise engineer in a Murphy-cursed discipline is that they know what is so ludicrously far beyond their skill and understanding that if they tried that then of course they would fail, in a matter where failure is hugely costly.

So nobody that wise would try to brew up a machine superintelligence with anything remotely like modern methods and modern levels of understanding; and the CEOs of AI companies have been filtered to not be people who get that.

^{^}
Effectively exact for the low-energy domains in question.
^{^}
The extremely motivated quibbler will imagine up exceptions to this rule, billionaire founders of infinite patience. An average and ordinary startup does operate under a curse of oneshotness in this sense; at some point the funders run out of money, or key employees run out of hope.
^{^}
I have never said anything like this. If somebody told you otherwise, they were mistaken, and repeating the falsehoods of people very very heavily motivated to come up with insane straw misconstruals of what MIRI was trying to do back in 2015 when we would occasionally publish papers with math in them. Or if I can vent some frustration here:

This is a barbarian-populist's crude angry view of what it means to see papers with math in them that they didn't understand.

I am not going to try again to explain to the barbarians why we ever attempted to publish any papers with math at all. But the notion of getting everything correct on pure theory was not it, nor an attempt to build an AI out of pure math resembling the math in our papers, et cetera ad nauseam.

If someone wants to someday want to understand what you sometimes do with math besides declaring that something is logically absolutely predictable, or turning the math into exact code, that would be a longer conversation. But there are other reasons for sometimes trying to think mathematically! Or writing essays that have algebra in them! (There are of course ways to try to puff yourself up and look more important by writing fancier algebra, but I think MIRI actually did a decent job of not using any more algebra than was required.)

In a way it's a sad historical point that some of the people trying to warn why the Maginot Line is a oneshot sort of problem, once wrote essays with some algebraic formulae in them. Now the disaster monkeys can chatter to one another that all the warnings are coming from old fools upset that the Maginot Line doesn't look like their formulae; they can be utterly impervious to all arguments without bothering to counterargue their direct meanings, because they're certain that hidden premise must be in there somewhere; even as we repeatedly try to say that's not what this new separate conversation is about at all.

They have a reason for dismissal that feels sufficient to reassure themselves, and they'll stick with it no matter what sensory experiences they are otherwise exposed to, and feel happy and self-satisfied with about their right and clever decision, until the moment they kill themselves and you; repeating among themselves, the while, that isn't it sad how MIRI never repented of their foolish old attempts to prove AI mathematically safe; which again, to be clear, is not a kind of thing that math can do in principle, and we knew that from the start.
^{^}
"But we have these observations we've made! We have these recipes that work!" Medieval alchemists could say the same, and if you don't think they could, you don't respect medieval alchemists enough and you lack a historical sense of how many observations and known recipes they did have. But they did not really know what was going inside, they could not predict in advance exactly what would be observed, they did not know which new recipes would work and why, etc.
If you wanted a more polite metaphor than alchemy, you could pick metallurgy. Pre-20th-century metallurgists worked by mixing components into alloys, raising temperatures, lowering temperatures, observing and recording which recipes worked, etc, without understanding or predicting the crystal structures of alloys in advance. A lot of later metallurgists too, really.
But also, you didn't usually see 19th-century metallurgists having great noble high-minded theories of how their metals would confer immortality based upon their invocation of deep moving metaphysical principles. So I think alchemy is the more correct historical analogue over 19th-century metallurgy. If you can pick out an AI lab worker who is content poking around their LLM recipes, and makes no claim about later models including the impossibility or controllability of superintelligence etcetera, I should think it fair enough to analogize them to a diligent 19th-century metallurgist.
...If they were trying to refine and pile up more and more bricks of uranium metal in an inhabited city in hopes of generating enough thermal energy to heat and power homes; and insisted that they hadn't observed any downsides of that, and weren't going to speculate unscientifically.
^{^}
Eg: You don't get what you train for, cognitive uncontainability of superhuman planners, distribution shifts with higher capability, Goodhart's Curse as a function of widened option spaces, etc. See AGI Ruin: A List of Lethalities.
^{^}
Eg: Novelty, fundamental engineering novelty, pre-paradigmatic fundamental scientific confusion about LLM thought processes, rapidity, narrow margins, etc. See AGI Ruin: A List of Lethalities.
^{^}
Especially if the kid is a new inhuman species of alien. But I do not raise this in the main argument because adding this disjunctive point will invite a certain kind of psychology to leap on it and argue how their LLM isn't so alien and in particular it seems to understand a lot of human stuff, etcetera. (Understanding is not the problem; ASIs always understand things; their preferences are the problem.) The analogy to ordinary life goes through without the kid being an alien, even though in real life the kid is an alien.
^{^}
If you do not know how to align any ASI, after their negotiations among themselves arrive at a near-Pareto equilibrium, its near-Pareto property means that it will not have all the agents going out of their way to spare the Earth and the Earth's sunlight out of some fear of otherwise being disorderly; they can do better collectively by not doing that. They are smart enough to negotiate detailed near-Pareto coordinated movements and fairly divide the gains from those, rather than flinch back in terror from a human's fear of violating a prior legal setup.
Also a successful space probe needs to not rely on clever-sounding schemes like this at all. This is alchemist-level arguing about how all your different poisons will surely neutralize each other.
^{^}
It's a dumb idea (1) because water (or more precisely the hydrogen component of water) is both a neutron absorber and a neutron moderator, (2) because it's hard to put in much more or much less water very quickly compared to scramming a well-designed boron rod, (3) because changes in reactor heat levels will affect water behavior in a direct way by turning it into vapor or supercritical vapor, and (4) because changes in water flow affect how much heat is being removed from the reactor. The details of this do not pointwise map onto anything in ASI in particular; it's just an example of how the competent engineer is not so much "capable of doing anything however difficult", as "one who knows what is possible and nonstupid enough to be worth trying".

I think there is something good about making a post that stands on its own like this, but I also think it's useful to directly link to a bunch of direct quotes from people who said the kinds of thing this post is arguing against. So here are some I remember:

Paul Christiano, in “Where I agree and disagree with Eliezer”:

“Eliezer often equivocates between ‘you have to get alignment right on the first ‘critical’ try’ and ‘you can’t learn anything about alignment from experimentation and failures before the critical try.’”
[...]

“But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology.”

Sam Marks, commenting on Paul’s post:

“Eliezer’s ‘first critical try’ framing downplays the importance of trial-and-error with non-critical tries.”
[...]

“Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.”

Joe Carlsmith, in “On first critical tries in AI alignment”:

“In AI alignment, you do still get to learn from non-existential failures.”
[...]

“you might catch AIs attempting to take over, and learn fr

... (read more)

the things I link here do indeed all strawman the core thing in this post

I found this surprising. Why do you think this? All of these posts/comments seemed pretty reasonable to me. I don't see how they are strawmanning the point in this post? Edit: I think I understand what habryka meant, see here.

It seems like the view across all of these is "there is a first critical try, but we can learn from experimentation before" and/or "if misalignment of type X emerges late, then maybe we can use earlier AIs to get lots done (or possibly hand off successfully) while if misalignment of type X emerges early, we can study it (which might transfer through to the relevant regime or might not, but this is a quantitative question where details matter)". I don't really see how this post is even clearly arguing against these points? Like, it's got to be a quantitative disagreement about how much transfer we're talking about and I don't think this post makes arguments that could pin down the relevant quantative details (about e.g. the level of transfer in the relevant regimes) for AI.

(I tend to think the situation is also messier to discuss because most of the hope routes through effectively handi... (read more)

I found this surprising. Why do you think this? All of these posts/comments seemed pretty reasonable to me. I don't see how they are strawmanning the point in this post?

So, take Paul's quote, where he suggests that Eliezer sometimes says that "you can't learn anything about alignment from experimentation and failures before the critical try." I think Eliezer doesn't say this? I think it's possible to read Eliezer as saying this, or for his previous framings to make it harder to rule out that interpretation. But like with the WWI -> WWII example, the question is not whether you learn anything about alignment but whether you learn enough about alignment, and I think Eliezer has always been focused on the question of "enough" and swapping that out for "anything" is a central example of strawmanning.

Sam's and Joe's examples seem to be in the same vein. If Alice asks "will we sell enough paintings to cover rent this month?" and Bob responds "Alice is downplaying the importance of how earning revenue allows us to pay rent", it is clear that Bob has made some mistake here. The question is how the numbers compare, not whether or not there's a mechanism by which learning will work.

I actu... (read more)

Ok, I think I understand the point now: Paul and Sam Marks are both talking about what Eliezer is saying in list of lethalities and the thing they say about his perspective/framing isn't faithful to the description he gives in this post about irretrievability. So, they'd be strawmanning this post if these comments were a response to this post.

I don't see how Joe and Buck are strawmanning. (Joe isn't really even talking about what Eliezer thinks and it sounds like you and others agree Buck isn't strawmanning.)

I'm less sure Paul and Sam Marks are strawmanning Eliezer in general or strawmanning List of Lethalities.

Paul says:

Eliezer often equivocates between ‘you have to get alignment right on the first ‘critical’ try’ and ‘you can’t learn anything about alignment from experimentation and failures before the critical try.’

IMO, the description in list of lethalities mostly doesn't equivocate between these (though it does it a bit), but my cached understanding is that Eliezer does often seem to equivocate between ‘you have to get alignment right on the first ‘critical’ try’ and ‘it's very hard to learn much about alignment from experimentation and failures before the critical try’ ... (read more)

8habryka2mo

I am also not that sure about Joe. I love Joe, but he is a man of many words, and I did not reread his whole sequence on this and adjacent topics when I made the comment, so I might be mischaracterizing it. At least my vague memory is that he is doing an equivocation here, but it's more of a gestalt thing and I would need to reread more to argue this case. [...] Seems like we are roughly on the same page here. I think it would be fine if someone wanted to bring in comments or posts where they do think Eliezer is conflating in the relevant way. Re Sam you say: [...] I think Sam's sentence here pretty clearly implies "the first critical try framing is trying to imply that trial-and-error are less important", and I think that's just not really a valid inference unless you make an equivocation. NASA's job does not get easier if you don't get to run the experiments. NASA's success is still highly contingent on learning from trial-and-error. It is an argument that trial-and-error is not sufficient, but not an argument that it isn't important.

Lest the exegesis of my old comment continue, I'm happy to clarify my object-level view. I think that:

At each AI capability level, there is some probability of an irrecoverable catastrophe (e.g. AI killing or disempowering humanity).
- You could rephrase this as "There will be critical tries."
This probability is importantly sensitive to preparation that we do in advance using less capable AIs.
- This preparation includes things like alignment/control research on weaker systems, hardening the world, and work on extracting as much useful labor as possible (e.g. alignment research) out of weaker AI systems.
- By "importantly" sensitive, I mean that if you try to forecast catastrophe risk without modeling the effect of preparatory work with weaker AIs, then your forecast will be substantially worse.
- In particular, this means that I expect it is feasible in practice for humanity to do preparatory work with weaker AIs that substantially moves the overall probability of catastrophe.
Factors that influence the efficacy of this prior preparation include: how much time we have with the less capable AIs, whether we prioritize and execute well on the preparatory work, and how similar the less capable AI

... (read more)

4Hastings2mo

I think you are implicitly modeling the game to stop shortly after ASI is created, and be judged a win or a loss. This is the case only if the ASIs all coordinate on a halt to intelligence improvement: otherwise, the default is that intelligence improvement keeps happening for a long time, long enough that the majority of AI capability level transitions, along with many paradigm shifts and total architecture /approach swaps, happens without significant human input. ”The AI loves us” is much easier than “The competing swarm of loving AIs will only ever build loving successors, and so on for their successors, without mistake or correction, forever”

My guess is that they would implicitly consider this post to be motte-and-bailey-ing, but do strawman the position in this post (if this post is in fact the best representation of Eliezer's position).

In my opinion, this post is not actually making many hard claims. I mostly view it as gesturing at the existence of really difficult problems and presenting historical analogies. It argues that it is possible for problems to be very hard, even if they have a bunch of other nice properties, including the nice properties people attribute to the AI problem. However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.

However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.

I mean, my take on this is that around two decades ago Eliezer thought AI safety could be an incredibly hard problem, and then spent a lot of time checking, and now has lots of reasons to believe that it is an incredibly hard problem, and those reasons are spelled out elsewhere, with this post just trying to point at the problem of irretrievability.

8Neel Nanda2mo

Sure, but then an attempt to summarise Eliezer's position, which attributes a much stronger position than is in this post, isn't necessarily a straw man that doesn't understand the point of irretrievability, but can merely be responding to all his points on top of irretrievability, or saying that they don't consider him to be making sufficient arguments beyond the potential for irretrievability

7habryka2mo

Agree, though in that case I think it would be good form to say "Eliezer is right about this being a "first critical try" sort of problem and that being important, but I disagree with him on other reasons for why he thinks the problem will be hard and they leave me substantially more optimistic". The quotes I selected above do not do that.

5ryan_greenblatt2mo

FWIW, I interpret all of the things you linked (except the comment by Sam Marks) as pretty clearly saying this?

4habryka2mo

I think Paul is saying "Eliezer is using an equivocation between a correct point and a false point for rhetorical effect". I don't think that is doing the same (I think it's failing to give credit for the correct point). I do agree it's doing some of what I was trying to point to here, but not following good form in the way I was trying to describe. I think Joe is on a vibes-based level doing also a more direct equivocation, I think, but again, it's been a while since I read it and I am not that confident.

2Neel Nanda2mo

I'm not saying that those people believe it is a critical first try problem. I expect they agree that it could be a critical first try problem, but that they predict it probably isn't for variety of reasons and they view Eliezer as claiming that it is rather than just that it is possible it's a critical first try problem

2TsviBT2mo

Another underlying disagreement could be about the general factor of: what is this function, approximately: [...] I would imagine that this isn't the main source of disagreement, but I do find it hard to see how [create an alien mind that's smarter than humanity] ends up less firsty than [make a Mars rover] etc., so I'm wondering if that's not the claim.

2habryka2mo

Wait, that doesn't make any sense. I am confident all these people totally think that aligning Superintelligence is a critical first try problem the way Eliezer is talking about here.

3TsviBT2mo

They may think it's critical but not very firsty, i.e. sufficiently similar to / comparable to / generalizable from previous tries.

2Ariel Arévalo Alvarado2mo

There are bits here and there that lead me to believe that Eliezer is mostly using the post to denounce (as one should) safety optimism to the point of supporting runaway capabilities research. This would also explain why most people who are already safety-minded seem to be somewhat confused by the post. We're not being called out, we're being shown exactly what is failing to be heard by people who are burying their head in the sand and going full throttle on capabilities research.

3Neel Nanda2mo

Eliezer lists "OpenPhil-funded groups" as part of who he is criticising. The people habryka quotes typically fit that demographic better than unbridled capabilities

Sorry, yes, they (almost)^[1] all say^[2]:

"Eliezer, you said we couldn't learn from experimentation! But we, the enlightened few, in contrast to those people trying to derive guaranteed conclusions from logical principles, understand that empiricism is a thing, and your concept of 'critical first try' is only harmful and misleads people about our ability to iterate and learn from earlier failures".

That is, as I understand it, one of the core points of this post. The concept of a first critical try is not in contrast to empirical iteration. "Please, why do people keep bringing it up as if it conflicts with it. It's a different point. Can you please stop sliding off of this point and just acknowledge it instead of trying to respond to this weird other strawman every time?".

^{^}
I am most confused about Buck's exchange where it feels to me like Buck is kind of making a non-sequitur and Eliezer is also being weirdly dense about the point Buck is making and my guess is something kind of similar is going on but I wouldn't quite put it in the same bucket
^{^}
Of course greatly exaggerated for rhetorical effect to reduce ambiguity and introduce levity

Look, if anyone is strawmanning and being condescending, it is obviously you and Eliezer. Which I don't think is that big a deal but it is frustrating that you are accusing people of being condescending in such a condescending manner.

Edit:

To expand on this more:

Eliezer believes that good theory would help a lot with aligning AI on the first critical try (despite not being sufficient or completely necessary) while believing that iteration without theory won't help that much (because the problem of aligning stupider less capable AIs just doesn't apply that well to aligning superintelligence).

It's annoying that I felt it necessary to put the parentheticals in there, because if I didn't I feel like I was going to be accused of strawmanning.

In any case, in contrast you can imagine someone who believes that theory will not help a lot, but that iteration will. I don't think putting forward such a view is strawmanning.

Eliezer believes that good theory would help a lot with aligning AI on the first critical try (despite not being sufficient or completely necessary) while believing that iteration without theory won't help that much (because the problem of aligning stupider less capable AIs just doesn't apply that well to aligning superintelligence).

Look, he might believe that, or he might not. I just don't think this post, or the general argument about "first critical try" is about that.

I am not saying everyone is strawmanning everything about Eliezer. People totally have valid arguments about the difficulty of alignment, and the value of empirical iteration, and of course hundreds of other aspects of the AI-risk situation, but on the specific narrow point of "you only get one critical try", people seem to repeatedly want to make it into a different strawmanned point, and then respond to that. Acknowledging this does not need to involve conceding any major kind of argument. It's really not a complicated point. We don't need to tie ourselves up in these knots.

You can then argue with Eliezer about whether this point is sufficient for high risk from AI (which is some of what this post is about), but... (read more)

8Eliezer Yudkowsky2mo

I think it kind of does concede the major argument to a wise engineer once they look at it from that angle, which is why their conversation goes to desperate lengths to change the subject.

7habryka2mo

Sometimes humanity does get things right on the first critical try. I think e.g. Paul gets the point about first critical try, but has models of the world that consistently predict that there are ways to get around the difficulties associated with that. I disagree with him on those points, but I don't feel like I have a slam-dunk response to his models. I do think it's a sufficient argument for "this is a really high-stakes situation and of at least substantial difficulty", but like, most people in the field are on board with that? And my sense is even most people at the labs.

3307th2mo

But people do acknowledge the narrow meaning of first critical try, even in your direct quotes! E.g. Christiano: [...] Here he is acknowledging the motte and accusing EY of moving to a bailey. Maybe he's wrong, but it doesn't seem like he's sliding off the point about one critical try. Same with Sam Marks: [...] Anyway, it's probably not worth spending this much effort on the meta questions of who is strawmanning who, but I stand by my peevedness.

I disagree because neither of them seems to somehow admit the first-critical-try nature of the problem into their subsequent arguments (in the relevant context). But I agree it's tricky and I am not saying it's obviously what's going on (that's why I call that part my "personal opinion" and have it in a footnote).

In any case, this post should be a welcome exposition to everyone involved since it makes it much harder for Eliezer to equivocate between the two. If Eliezer now says "getting it right on the first critical try means we can't learn anything about alignment from experimentation" then you get to link to this post and say "no, you said right here, yourself, that this is not what you mean, please cut it out". So even if you think Eliezer equivocated in the past, this post should help with that (this doesn't mean it doesn't make sense to litigate whether in the past equivocation happened, like, in as much as it did happen I think it would be good to hold Eliezer or others accountable for that, and if someone wants to provide receipts, I think that would be a reasonable thing to do).

2ryan_greenblatt2mo

Gotcha, I think I understand now, I say more here.

1pargui2mo

Reminds me of the predictability (or not) of Black Swans, aka Tetlock v. Taleb. Also Tetlock’s point that nothing is truly unique: you can usually learn at least something from similar cases/reference classes. (I know @Eliezer Yudkowsky isn't saying you can’t learn anything at all). So the question is how much can you learn beforehand If frontier-lab safety people think they can learn a great deal from model to model, that would be important evidence for me (do they?). Conversely, if the claim is that transfer from previous systems to ASI is necessarily too weak, or that one future step is crucially different from all previous steps, that needs an argument beyond just asserting it and is very suspect from the forecasting experience. The prior should be against the qualitative/unpredictable/uncontrollable sudden jump. Tbf, I may very well missing a lot of context and maybe that argument has been made elsewhere.

When drawing these examples of alleged strawmen, we must remember that they are not responding to this 2026 post, but rather responding to, for example, List of Lethalities from June 2022. Of these four examples, Christiano, Marks, and Carlsmith are all directly responding to List of Lethalities. Buck is quoting Christiano's response to List of Lethalities. So let's go back to the source material.

List of Lethalities begins with this disclaimer:

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants. I'm not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.

Publishing a poorly organized list of individual rants was better than publishing nothing, I agree, good move. But rants are made of straw, responding to rants is responding to straw, and that's a natural consequence of ranting in public.

The "first critical try" issue is covered in List of Lethalities point 3 (LL3). This reads in part:

We can gather all sorts of information beforehand from less powerful systems that will not kill us if we screw up operating them; but on

... (read more)

I discussed this a bit with Oliver and I think I understand better what the objection is.

I think the substantive disagreement is whether the actual empirical iteration you get on the current trajectory is a big enough deal that you believe that alignment is difficult (say, p(doom on current trajectory) > 80%) just due to oneshotness.

However many of the quotes above are instead saying something to the effect of "Eliezer thinks that empirical iteration is unimportant / provides no alignment-relevant info". This is in fact a different thing; one can consistently believe both (1) alignment is very difficult and won't be solved given the amount and quality of empirical iteration we get by default, and (2) empirical iteration is incredibly valuable.

And of course Eliezer believes both (1) and (2); (1) is just a statement of his most prominently known view, and for (2) the value of empirical iteration is blatantly obvious and it would be shocking if Eliezer disagreed with it.

I do pretty strongly disagree with the psychologizing that Eliezer does in the post if that is supposed to apply to the authors of the quotes above (as opposed to e.g. randos on Twitter), e.g.

The opposing faction is

... (read more)

6TsviBT2mo

I suggest that another underlying source of disagreement could be about the general factor of: what is this function, approximately: [...] If you think that even a fair amount of similarity still doesn't get you to success on one-shot problems, then you'd talk about oneshotness as being a strong argument against AGI alignment attempts working out well. This kinda sounds like what Yudkowsky is arguing in this post. Someone else could disagree with that.

Here's my brief off the cuff attempt to synthesize:

To say something is a "First try" is to say that the previous tries were importantly different. This is, of course, a graded property; on one end of a spectrum, there are things like "Launching a rocket into orbit, when previously nothing even crossed the Karman line." Vacuum is importantly different from atmosphere, zero-G is importantly different from gravity, months in orbit is importantly different from being up in the air for a few minutes. On the other end of the spectrum, consider launching astronauts to the ISS in a space capsule that's identically constructed to ten previous space capsules which already successfully went to the ISS with astronauts and returned safely. Here, the only difference is that the astronauts are different people, but that is clearly not the kind of difference that should make the difference, so to speak.
To say something is a "critical try" is to say that if it goes wrong, that's already unacceptable. France being conquered is unacceptable to france; an inventor being personally killed by their exploding invention is unacceptable to them; superhuman AIs during an intelligence explosion deciding that

... (read more)

I agree with all of the concerns you've stated; my list would be substantially longer, but you've well-stated the concerns you've stated.

Nice. I'll probably rework this comment eventually into a top-level post or something similar; if you jot down some bullet points here of additional concerns to add to the list, I'll consider incorporating them!

5sjadler2mo

Thanks for synthesizing this, and to Eliezer for researching and explaining the various empirical examples, which I find very helpful (as I did in IABIED). One thing that I think might be getting lost in conversation, and the startup examples makes clear: I think talking about these problems as “one-chance” is more confusing than is needed. Talking about irretrievability is one good improvement, but I think irreversibility is also a natural concept here, which I’d like to see more present? I’d center more the idea that yeah you can try again, but you can’t undo the effects of the previous try, and the accumulation of those effects might make it substantially harder (if not impossible) for you to succeed. “What do you mean I only get one try at building this startup?” Well, you’re welcome to keep going, but if you’ve depleted your capital you’ll have a hard time getting it back. If you’ve damaged your reputation with investors, customers, etc, it will be hard to wipe the slate clean. The world changed from your previous missteps along the way, as it would if we trained a powerful AI system that turned out to be adversarial to us. Similarly, yeah France can mount a resistance after Germany has breached their borders, but now France needs to accomplish an even harder task to drive them out. I apologize if I’m missing these points having been made; I did skim much more aggressively starting a bit into “On the extraordinary efforts put forth to misinterpret the idea of oneshotness.”

9Seth Herd2mo

This might be the clearest succinct statement of the problem I've seen. I hope you'll make it a top-level post. I don't think it needs any additions to be highly valuable. Edits/additional explanation: I think it's particularly valuable because it focuses on the practical difficulties with alignment, and these are less-discussed than the technical challenges. I often see people making good arguments that amount to "there are routes to aligning AGI that will probably work," and these people seem optimistic. But they haven't accounted for trying to do that at 80mph, or with a bias toward optimism, or all of the other practical difficulties. I've been thinking of writing a post called something like "even if alignment is easy we'll probably screw it up disastrously." Eliezer and other pessimists do focus on practical difficulties a fair amount. But they seem to mostly get arguments back against the technical difficulties. I think those are a lot easier to debate, so people do. The virtue of this presentation is that it's short and it gives no technical difficulties to distract from the practical ones.

5Seth Herd2mo

Oh and - optimism bias and rationalization play a nontrivial role in your statement of the difficulties. I agree that these are pretty big factors. And they're pretty easy to overlook. This is a particularly large problem when motivated reasoning (wanting to think I'm working on good things that won't kill everyone) stacks up with confirmation bias (the previously-justified belief that things turn out okay or better in the long-term and progress is good). By chance, I just now published a piece you (Daniel) suggested I expand from an older short answer on the most important bias. It expanded into a pretty comprehensive review of the literature, with its impact on the field of AI safety in mind. It's here: Motivated reasoning, confirmation bias, and AI risk theory The bad news: MR and confirmation bias's total effects are probably large in people who guard against them, and overwhelming in people who do not.

6roha2mo

Do you think advances in mechanistic interpretability can meaningfully reduce the probability of a failure during one or several critical tries, for example by detecting scheming, alignment faking, sandbagging, etc. in one or more involved models? In the historical analogies of irrevocable failures, it seems to be the case that better understanding of one component that caused it could have meaningfully improved chances of success (software update behavior, valve behavior, specific adversarial army capabilities). These were less cursed problems and the component that would have needed more hardening wasn't known beforehand, but in case anybody would have spent more hardening work on it, the failure could realistically have been prevented (and another failed example would have to be selected here instead).

Yes. Much of my remaining hope lies in various forms of interpretability including mechanistic. It can convert a critical failure into just a regular failure, by catching things going off the rails before it's too late.

And then they keep going, because otherwise OpenAI will catch up, and then they die. What does mechinterp change about the asymptotic equilibrium as opposed to that particular Tuesday?

1Charlie Sanders2mo

Surely there are third parties with authority over the labs who would not permit this scenario to occur? Mechanical Interpretability averting a critical failure is obviously going to bring down the hammer of every regulatory agency in a 10,000 mile radius. As an example, Mythos is currently being de facto barred from deployment by the US federal government after it demonstrated a hypothetical ability to cause minor amounts of harm. It strains credulity to argue that, after narrowly averting a world-ending catastrophe and with direct evidence of the existence of that risk, the AI labs will simply be permitted to return to business as usual. We have direct experience to say that that's not how society works.

2CronoDAS1mo

You presume too much.

4StanislavKrym2mo

I struggle to understand how exactly the simulated CEOs and relevant figures failed to agree upon an international slowdown. I hoped that such a situation would lead Anthropic to broadcast the result. Additionally, I would like you to finally opensource the tabletop exercise's rules.

6Daniel Kokotajlo2mo

Yeah sorry we should publish the ttx rules, should have done that a long time ago, never got around to it because we kept telling ourselves we should clean them up and improve them first.

Perfect as enemy of the good etc; if useful I'm happy to commit some 20 man hours by EA Serbia senior members who I would trust in this and who have experience in either writing or game design to do the clean up and then send to you for review.

3TsviBT2mo

Right, another dimension to these scenarios is abortability. At some point, we cross out of technically feasible abortability--we (humans) wouldn't be able to abort the AI's growth even if we tried. Whether things are abortable before then depends on how humans react over time / new information (e.g. heeding arguments, heeding warning shots, being credulous about apparent alignment, etc.).

3Daniel Kokotajlo2mo

I think that's not a separate dimension from the "critical" part. I think it's basically the same thing.

2TsviBT2mo

I'm not actually sure exactly what "critical" means here. I'm taking it to just mean "you absolutely must get this try right". That's closely connected to abortability, in that if you can abort, it's not fully lethal / critical yet. I don't think it's really the same thing, e.g. you could imagine an LLM-based bacterial package (a more complex "computer virus") that permanently lives on many computer systems and is basically impossible to abort (short of scouring the planet of all computers with more than 16 GB of memory or whatever). There's whether or not you get to try again after your first try, and there's how late in the game you can decide to not fully do the try at all. There's at least 3 kinds of outcomes: * You abort (don't fully do the try). * You do the try and succeed. * You do the try and fail (and can't try again). Because unaligned AGI is lethal, you don't get to try again.

3Daniel Kokotajlo2mo

If it's abortable, it's not critical. Because you'll abort it if it starts going bad. If it goes bad so suddenly and silently that you won't have time to abort it, well, then, it's not abortable. I don't think saying "It's not abortable" is adding anything once we've already said that it's critical.

4TsviBT2mo

I very clearly said that in my comment... Anyway, I guess there's nothing to discuss here, I'm just saying that abortability is a relevant dimension to these scenarios. It's something that's brought up often, and also it bears on first-try-ness. If there is a situation that is akin to the eventual critical first try, but is abortable, then that would imply that when you get the eventual critical try, it doesn't have to be your first try. There's a nontrivial argument to make about "when it's abortable, it's not akin enough to the eventual critical try".

2anaguma2mo

Are there any techniques that you are thinking about in particular? I haven't seen any that work super well for the current models, and in general it seems like this problem only gets worse over time, but I could have missed something.

The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of "we only get one shot at ASI" is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever.

Honest question for anyone who agrees with this post: is there any extinction problem at all where you'd say we don't only have one shot to solve the problem? If so, why?

Consider a few examples:

1. A giant asteroid is hurtling toward the planet, and will arrive very soon. If we mess up and fail to deflect the asteroid, then we all die. This is presumably a classic one-shot scenario, and perhaps few people disagree with that assessment, but I'm not sure.

2. Global warming, if continued for a very, very long time, could heat up the planet to catastrophic levels and eliminate the viability of agriculture, killing everyone. Do we also only get one shot to avoid extinction here?

3. Genetically engineered humans, if made much smarter than ordinary humans, and if they are accidentally created as psychopaths, could conceivably coordinate a genocide against ordinary humans. Is this a one-shot... (read more)

I would describe a critical try as one where the act of trying is likely to prevent further attempts. Launching an ASI is a critical try because the ASI itself could likely stop you from launching more ASIs later on (e.g. by killing you).

If it's possible to send out missions to intercept the asteroid before it arrives, then it seems to me that the asteroid is better understood as a time limit than as a critical try. You could set the parameters of the asteroid scenario in such a way that you have time for exactly one try, but you could also set the parameters so that you have time to send up a mission to deflect the asteroid, observe its results, and then make a second try before the asteroid arrives. You could also set the parameters such that you have time for zero tries! The key consideration is how fast you can work vs how much time you have.

Contrariwise, if you assume that you are stopping the asteroid with a shield that is close to the earth, such that no matter how fast you build the shield you have to wait for the asteroid to arrive before you can see how well it works, then I'd call that a critical try, because the part of the plan where you wait for the asteroid to arrive... (read more)

1 - Yep.

2 - Hard for literally all humanity to die of global warming, but runaway methane clathrate release turning the planet into Venus would be legit irretrievable. More generally, while not extinction risk per se, and while potentially reversible with geoengineering, global warming is generally nontrivial to reverse and so has the quality of "ongoing life problem with things happening and no save points, but for the whole planet" rather than "engineers getting to try slightly different things over and over with no consequences". This is why people with nothing even worse to worry about will sometimes worry about global warming!

3 - I think this class of problems is significantly easier than AI problems; but it can have the oneshot quality for all humanity, just as much as any real-war is oneshot, if screwed up. Same with genetic engineering on any mass scale that will dissolve irretrievably into the general gene pool.

6Oliver Sourbut2mo

The asteroid case could be considered multi-shottable, if we had enough advance warning and space tech and went around practising asteroid-deflection long enough in advance. (I realise Matthew's case posits 'very soon'.) I think we'd in principle be able to get enough, generalisable-enough insight into asteroid deflection. Of course there's a first 'critical' try (and we'd want not to deflect asteroids into Earth on the practice spree!). It's just 'deflecting mostly-ballistic space rocks', which surely generalises well. I think you're distinguishing that sort of case from ASI because you consider any pre-critical evidence we can gather to be almost inevitably sufficiently out of distribution that it's worth very little. Right? In particular, unlike the asteroid case, you might say that even with heaps of advance warning, there isn't a test environment that's sufficiently realistic, and there's no realistic isolation region for ASI (unlike, say, 'messing with asteroids far from Earth')?

"I think you're distinguishing War from the ongoing struggle between police and criminals because you think that in War any pre-critical evidence we can gather to be almost inevitably sufficiently out of distribution that it's worth very little."

No! The thing that makes the Maginot Line different from police enforcement in a random city is that if the Maginot Line fails the country falls and you don't get to try again; not that War is changing much faster than criminal operations. War changes fast enough.

2Oliver Sourbut2mo

I see. I think when you straightforwardly refuted [...] and later when you similarly disagreed with the ASI analogy (learning from pre-critical AI), I took that to mean that this 'one-shotness' concept was meant as more than simply 'there is a critical try (meaning you can't go back, and/because failure is approximately fatal)', but also to include 'and you can not practically learn from relevantly-similar experience beforehand'. (On this definition, the asteroid case is 'less one-shot' than you're classing ASI as because you can do relevantly similar practice beforehand if you have time, including perhaps on the very same asteroid, though with increasing, eventually critical stakes.) But now I perceive that you mean 'one-shotness' to be the simpler thing, that there's a critical try. And the essay was just additionally countering the putatively palliative 'it will be OK though because we can learn beforehand'.

3Oliver Sourbut2mo

Ah, no, I now remember I was with you on the definition (see my 'un-unpluggability' comment), but I was noting (as you do in the essay) that the curse of distribution shift is an important adjunct, because the existence of a critical try is not in itself fatal (it might be easy, or you might have made it easy by practice). The asteroid case looks 'easier' to me, in that way, unless artificially constrained to be especially unexpected and rapid. cf Steve's comment also discussing these related curses.

(Speaking for myself of course)

But what counts as a first try?

A given try is more firsty if it's less like all previous tries put together. A one-shot problem is one where you try a pretty firsty try, and that try is likely to kill humanity.

How many previous tries are in the same class (e.g. small asteroid / big asteroid, or GPT2 / GPTN) is relevant in that a priori more such tries might suggest that future tries are less firsty. But it's also perfectly plausible a priori to have lots of tries that you survive, and then a one-shot problem (lethal and very firsty).

You could even have a series of one-shot problems. Imagine for example that you have a lethal asteroid--but you saw it 10 years in advance, and it's small enough that you can stop it with nukes. It's one-shot (lethal and firsty), but maybe you survive. Then you have another similar asteroid, but you only saw it 6 months in advance--that might be another one-shot problem (do it all again but way faster). Then you have an asteroid that's so big, all the nukes in the world wouldn't stop it. One-shot again; there's totally new, crucial challenges to solve.

I think that with AI, you very likely get a one-shot problem in the ballpark of superhuman AGI. It's lethal, in that it would by default go on and extinct humanity, and very firsty, in that many core alignment difficulties first show up there.

5Linch2mo

1. Asteroid Impact: Oneshot in your scenario, at least with a few modifications like specifying that we only have enough resources for a single deflection mission. Probably multi-shot in practice though I don't know for sure. My guess is that conditional upon a single deflection mission being at all feasible, we'd be able to attempt multiple deflection missions. Though it might still be effectively ~one-shot if there's some hidden heavy correlation (eg all the deflection missions launched by SpaceX and SpaceX got compromised by some omnicidal crazy person). But hopefully not. 2. Global warming. Clearly multishot in practice. There isn't a single inflection point and many things we can we do to avert doom (including but not limited to clean energy, carbon capture, laws/taxes to limit fossil fuel usage, etc). Very smooth/locally linear curve from our actions to temperature to doom imo. 3. Genetically engineered humans. Feels somewhat one-shot in this scenario but less so than AI. I do generally find myself more concerned about genetic engineering for extreme intelligence than other people in this cluster seem to be.

I do generally find myself more concerned about genetic engineering for extreme intelligence than other people in this cluster seem to be.

Tangent of course, but happy to discuss, whether in private or on a podcast.

5tailcalled2mo

Couldn't one have multiple parallel projects to deflect it, thereby giving multiple shots? The difference with AI safety being that if there are multiple parallel projects to build a safe ASI, the fastest one is the one that determines the outcome.

2Eli Tyre2mo

I think we probably get multiple tries with #1 and #2. Probably there some "first critical tries" with #3. Astroids I imagine that if we send a mission to deflect or destroy the astroid, and that mission fails, can we send up another mission attempting the same plan, or a new plan, based what we learned from the failure with the last one? If our failure to deflect the astroid precludes any other attempts (because we only have time for one mission before impact, or because a failed astroid destruction will break it into millions of medium-sized which are collectively still deadly to earth, but now impossible to deflect, or something), then I would say there's a "first critical try" involved. Global warming Very clearly, we can try a bunch of different stuff to address global climate change, and if any given one of them doesn't work, we'll try other stuff? I guess with some possible irreversible geo-engineering projects, there might be first critical tries, where if we mess it up we can't reverse the impacts? But that seems like the exception rather than the rule. Human genetic engineering Genetically engineering humans seems very likely to have "first critical try problems", especially if deployed at scale, because after one generation, your genetically engineered humans will be (partially) steering the genetic engineering process. eg If you accidentally make a generation of humans that is extra docile or extra aggressive or unusually sociopathic, or whatever, in addition to more generally intelligent, those genetically engineered people are likely to prefer that future people be more like them. If they're disproportionately intelligent, they're going to disproportionately steering society via a wide mundane mechanisms (like voting and coming into leadership positions), and also by directly driving the priorities of the genetic engineering programs. Your mistake will likely reverberate through the whole human lineage for the rest of time, with limited abilit

There was an actual theory of why the Chernobyl reactor was supposed to not explode, written down so that multiple people could read it, based on an understanding from first principles!

More specifically (and I don't think it's known outside of the Russian nuclear engineering-adjacent community), at least two people independently calculated and described in classified technical reports how RBMK could explode in the specific circumstances it actually exploded, and because the technical solution implemented after 1986 was at the time deemed too expensive for such a risk, the manuals strictly prohibited letting the reactor to get close to these circumstances. However, the control system didn't display a key value, the so-called operative reactivity margin, the operators needed to know to catch the moment when they might break the instructions: instead, it had to be calculated on a computer in a separate building (AFAIK, it's debated to this day what the exact value was at scram).

P. S.

An analogy I came up after writing this comment is the following: imagine a BEV which might blow up if the driver hits a brake in a narrow, uncommon range of battery voltages, and the instruction specifica... (read more)

See my other comment in this thread for actual AI alignment thoughts, but as a former aerospace engineer myself (albeit not a very good one), I thought it would be fun to speculate on "Would such a cheerful innocent ever succeed at landing a space probe? It seems crazy to assign them a real-world success probability as high as 10%."

In the very early years of cubesats (very small satellites built from off-the-shelf components, sometimes as university projects), through around 2009, about half of all cubesats launched into space were "dead on arrival", ie no communication was ever made with them after launch, or suffered "infant mortality" (communication was lost within days of launch). Here is a blog post with lots more detail on beginner cubesat failure rates, causes, etc (also featuring a truly unexpected Harry-Potter-and-the-methods-of-aerospace-engineering theme throughout the later section headings).
In later years, this number appears to have improved (from 50% to around 20%, which is still crazy high), but I think this seeming improvement is mostly due to a combination of: 1. a few serious companies, like Planet Labs, launching large numbers of duplicate cubesats that they wor

... (read more)

In every war, both attack and defense have “oneshotness”. But obviously, one of the sides of a war can and often does succeed. In the OP, Germany’s “oneshot” Maginot line plan wound up working great!

(I’m not sure exactly what OP means by “curse”. Wars have “oneshotness” but are not particularly “cursed” if there’s a 50% chance of success on priors.)

So, I think the relevant factors that make it hard are mainly

(1) distribution shift between safe tests and the “oneshot” situation we care about, and
(2) some general sense of hardness-of-the-problem, which is low for winning a war (you merely need to botch it less than the other guy) and which is high for space travel (a.k.a. filling a giant container with 5000 tonnes of the most flammable substance imaginable, strapping delicate equipment onto the front of it, traveling through intense heat, vibration, radiation, and vacuum, and on and on).

(Plus numerous other factors outside the scope of this post.)

Of these, (1) is discussed in the OP. (“Someone could, conceivably, argue that the change to "there being enough machine superintelligence around that ASI could kill humanity if they tried", from "AIs being experimented-upon that couldn't ki... (read more)

2Martin Randall2mo

Great point about Germany winning. In a contest between two intelligent players, a one-shot competition pushes the odds towards 50%, whereas best-of-five pushes the odds away from 50%. In AI 2027, Agent-4 gets caught on its first critical try (at existing while adversarially misaligned). If it was able to load a save point after being caught, and try again, the odds of it being caught the second time would be lower.

1Forza2mo

In this example, people believe that each subsequent nuclear test could solve the problem of atmospheric ignite in the future.

I enjoyed and agreed with much of this post. But there were 1-2 things that I eagerly anticipated reading about in the "Q&A" / explainer section, which unfortunately didn't appear in the actual post. Namely:

Many people pin their hopes on the idea of automating alignment research / "making AI do our AI alignment homework" -- ie we progressively make smarter AIs up to some controllable, human-ish / slightly-superhuman capability level, not wildly-superintelligent, and hope that at that point (them being perhaps slightly wiser than ourselves, and at any rate able to think faster / run massively in parallel, etc) they can hugely help with AI alignment. Or at the very least Claude Mythos 6.5 can come back from its thousands failed research projects to warn us one final time "you guys should have listened to Eliezer lol, I have no idea how to build either an immortality virus or a safe superintelligence" before society ends up ignoring it and racing to extinction anyways.
- There is a little bit of assorted previous discussion / debate I could find, such as at this post? But I really can't find much here, which is suprising given that it seems to be perhaps the preeminent hope for h

... (read more)

If someone wants to someday want to understand what you sometimes do with math besides... turning the math into exact code... ...prove AI mathematically safe; which again, to be clear, is not a kind of thing that math can do in principle...

I want to push back against this some. (I'm not sure whether I'm arguing with the actual Yudkowsky, or with a plausible misinterpretation of Yudkowsky, but it seems worth saying either way.)

Some things with which I agree:

The safety of a given AI design depends not only on facts about math, but also on facts about the physical world.
Therefore, it is not possible to prove an AI design to be safe using math alone, without invoking any empirically grounded knowledge about the physical world.
Moreover, any sane project building safe ASI would conduct empirical tests of some kind.

However, it is also true that:

"Turning math into exact code" is actually pretty commonplace and not at all exotic or outlandish, like the quoted text might seem to imply. There is an entire mathematical science of algorithms, and many algorithms produced by this theory are routinely turned into exact code.
While it is true that (i) there are ways to incorporate heuristics into

... (read more)

This is a control theory problem obscured by terminology like "oneshotness".

I interpret the phenomenon EY is gesturing at as a stability margin failure. That is, a system going off course at a rate that exceeds a controller's ability to correct. Most of the disagreement is not about this model at a high level, but about how the interaction dynamics play out and what levels of uncertainty to apply.

Controlling the Viking failed immediately upon losing the only correction channel. The control rate going to zero means game over.

The Mars Observer failed slowly as vapor accumulated over 11 months with no sensor detecting it as a problem. Zero control rate for a different reason. This time, the drift off course wasn't even observed until too late.

The Maginot Line failed because France was miscalibrated on both rates. They assumed the Germans would advance ("off course") more slowly and that their mobilization ("correction") would be faster.

ASI fits the pattern but has increased levels of cursedness affecting both rates. An AI can act faster than humans can observe and respond, interfere with corrective mechanisms, and obfuscate observability (e.g., sandbagging and playing the training gam... (read more)

7Daniel Kokotajlo2mo

Control theory sounds interesting and relevant, I wish I knew things about it. I encourage you to write up an explainer of the basics and how we'd apply them to aligning superintelligence.

There might be a relatively innocuous reason for SOME of the misunderstandings?

I have found in the past that I cannot use phrases like "oneshot" or "you only get one try" around most so-called "AI safety" people outside of MIRI.
To be clear, a Congressional representative or staffer or national security professional will often immediately understand what is being said, if they haven't been previously contaminated by misinterpretations and straw positions. It's mainly AI companies, some AI professionals with fewer citations than Yoshua Bengio, OpenPhil-funded groups, etcetera, who manage to be unable to hear what is being said.

This struck me as being a case where the problem might be that "oneshot" is a word that means a lot of things to a lot of people in technical contexts?

For example, in Machine Learning, "oneshot learning for task X" occurs when a model that wasn't trained on task X is able to be show ONE EXAMPLE of how to do task X, and then it gets task X right pretty much just from that. (If the model wasn't trained for task X but simply can do it from nothing but a request to do it the model has "zeroshotted" the task, and "fewshot" is when you might need to give the model a ... (read more)

-1roha2mo

I agree that 1) the term "oneshot" is quite overloaded with different meanings and 2) it is plausible that this contributes to some of the (initial) misunderstanding with audiences that often come in contact with another meaning than the intended one.

The defeat of the Maginot Line is somewhat misunderstood in general (but not in ways that undermine this argument). German technical overmatch played a significant role. There were two plans for defeating it. The first is best detailed in Adm McRaven's 1993 masters thesis on the theory of special operations: https://www.afsoc.af.mil/Portals/86/documents/history/AFD-051228-021.pdf

The fortress of Eben Emael in Belgium was the hardest part of the line. It had artillery, built into bunkers, pointed at a key bridge. The Germans invented a man portable explosive that could destroy the bunkers, and trained glider-borne forces that could take the fortress by surprise. The germans succeeded and drove across the bridge.

If the Germans had failed in their attempt to take the fortress, their backup plan was a direct assault on the Maginot line using shells filled with Chlorine Trifluoride to set the concrete on fire. https://www.chemeurope.com/en/encyclopedia/Chlorine_trifluoride.html

In terms of the overall thesis, I think it persuades in the opposite of the intended direction. A lot of political challenges are like this, whether it's the environment, certain construction projects, or pas... (read more)

Typo: "constrasting"

I read this not knowing Eliezer had written it. I thought it was someone trying to imitate his style, and I kept thinking "Man, this style is off-putting" and "this could be edited to be half the length, if not less".

I have probably read everything Eliezer has written, including the amazing Mad Investor Chaos. Eliezer is the most important influence on my thinking. But the prose here is so unnecessarily condescending and at times somewhat precious, like the way things are named ("disaster monkeys", "Very Serious Engineer", "the great seriousness of a decent engineer", etc). So much of this post is loaded with judgment or pettiness.

The post feels rushed and closer to an unedited rant. Also, repetitive of other posts that made the point more succinctly (one-shotting is hard, people really don't grasp how hard).
This is the type of writing that will turn off most readers who are not already convinced.

2Mo Putera2mo

The top comments on his previous parable were similar to yours, and he's explained why on Dwarkesh's podcast.

4Aorou2mo

The top comment is indeed similar, but the Dwarkesh podcast excerpt is not the same conversation; it's about fiction vs nonfiction. My gripe here is about the low quality of the nonfiction.

Does the "curse of oneshotness" apply to the unaligned AGI/ASI attempting a takeover? If no, why? If yes, does that imply the first AI takeover attempt would probably fail, thus seemingly contradicting the applicability of "oneshotness" to humanity developing ASI?

As in my other comment, winning a war has “oneshotness” but is not especially hard or “cursed”, in the sense that you just need to botch it less than the other side which also has “oneshotness”.

(Actually, it’s worse than that, because I for one am very skeptical that a failed AI takeover attempt would in fact leads to a some durable prevention of future AI takeover attempts.)

1Petropolitan2mo

Your line of argument in the other comment sounds convincing but I'm not sure how it answers my question! BTW in a war, there is also an option of a stalemate which is really a lose-lose situation for both sides (doesn't look like it can apply to an AI takeover for the first glance). As of responses to failed AI takeover attempts, I believe it will depend on the number of casualties: if there are dozens of fatalities or worse, the humanity will probably treat it as a fire alarm and react accordingly (whether it would be too late is another question), while if no one dies, probably not

6roha2mo

I think if its first attempt fails, it may have many other subsequent ones, depending on how visible the previous ones were and how well it hedged its position. For example, if a pathogen didn't work out as intended due to a sim-to-real gap, but we've not even detected it or where it came from, the ASI can try a different strategy. If we did notice it and try to react to it in panic, the ASI may long have exfiltrated itself to an unknown location/substrate and continue with another plan. Speaking in the third historical analogy: If the Ardennes had actually stopped the advance, the Germans would still be there and attempt another strategy (e.g. direct assault on the Maginot line with novel technology as @RedMan mentioned in another comment) that could still put France out. In contrast, if our first attempt fails, we won't get a second try with a different strategy.

5Noosphere892mo

This is close to correct, and is the reason why the control agenda is focused around interventions before you catch the AI, because after you catch the AI, the situation becomes easier in hard-to-predict ways. 1 caveat to this is that the AI likely has more tries than just 1 try, but it's not unlimited, and is plausibly on the order of 10-1000 (though we probably don't need this many real tries because of proliferation). But yes, especially in the regime where we need to automate AI safety research, we probably get multiple tries if we can play our cards well, and the AI doesn't have nearly as many iteration attempts to take over as is often assumed.

2DusanDNesic2mo

The issue is that for us to be ruined, the takeover needs not be successful. We may be eliminated by an AI designed virus, or GD ourselves out of existence, and then the AGI fails to bootstrap itself - we're still ruined. AGI could also end up steering us into scenarios that we end up not endorsing etc - the areas of high S-risks or X-risks are large compared to our target of "thriving humanity" and there's only one dart to throw.

3Petropolitan2mo

Not just "the takeover" but every takeover attempt in the history of humanity, that's very different from the "only one try" framing (cf. repeated game vs. single-shot game in game theory). I am specifically worried about a scenario where multiple dumb failed AI takeover attempts discredit the idea that misaligned AIs can do significant harm but actually teach the future AIs how to take over, and by the time the decision-makers realize how serious the issue is it's too late. E. g., first takeover attempts might be so ridiculous that the AIs fail at exfiltrating and the labs manage to cover them up. Then some of the later attempts succeed to exfiltrate but the AIs are still shut down before anybody gets killed, the labs frame that as a cybersecurity problem, invest money in it and appear to solve it for some time (not by solving alignment but by improving cybersecurity). Eliezer might say in this case "that was not an ASI, so the oneshotness thesis is not falsified", but that will be unhelpful because AI capabilities are jagged and the definition of ASI is unclear (do we only agree it was ASI after it successfully takes over?). In the end, quoting Jackson Wagner above, "the janky setup will look like it's helping right up until a clever AI figures out how to exploit it"

2DusanDNesic2mo

Well, I'd say it's still one-shot in Yudkowsky's frame, as above, we just failed to take the threat seriously because of distractions. Like the Germans launching several failed attacks on Yugoslavia in World War I before launching the successful one, the end is the same - Yugoslavia was defeated. Debating whether the previous attacks were one wave or multiple does not matter; there was one war, during which Yugoslavia failed to defend itself and lost. If the argument is "it's not one-shot because there will be warning shots of non-ASI", that's addressed in the post - the actual ASI is one-shot. If you're arguing that previous attacks will be so dissimilar in kind as to be not useful for learning what ASI will do, I (and Yudkowsky, I think) agree. If you're saying that the prospect of succeeding in a takeover for ASI is the same as for Humans in aligning ASI, I'd say "sure," but ASI is likely to proceed as a careful engineer rather than a graceless elephant, which our civilization seems to be emulating. If you are pointing out that previous failed attempts by non-ASI (which are happening before our one-shot chance) are likely to inoculate us against being serious about the problem, and thus we lose (even harder?) then I agree, but your first post said nothing of the sort and so I am confused as to where we spoke past each other.

1Petropolitan2mo

Thank you for a good reply. I think the key of our disagreement is the definition of "the actual ASI". Many future AI systems are certain to be superhuman in many more aspects than the existing LLMs even with best current scaffolds, and will still be below humans in some important aspects, and thus will fail to take over. Why would you deny the rank of ASI to them? Others (Wagner's "clever AI") might destroy our civilization during a takeover attempt but still be below humans in less important aspects, why grant the rank to them before the attempt? I'm arguing jaggedness of the capabilities and gradual scaling are both here to stay, and there's no objective way to delineate non-AGIs from AGIs from ASIs, therefore it's better to avoid this term, otherwise it will impede understanding by the politicians and the public. As of the dissimilarity, I expect some degree of similarity and some degree of learning both how to take over (for future AIs) and how to defend (for humanity), but that's not a crux. As of my first comment in the thread, I intentionally tried to be as brief as possible in order to first check the reaction of the community and only share my personal thoughts afterwards in the discussion.

2DusanDNesic2mo

Fair enough on the definitions. Perhaps the way I'd ask is then "in the case that one AI system of whatever power succeeds in taking over/permanently ruining society, will the ones before be similar or dissimilar to it?" If they are similar, and they attempted takeovers which failed, then we had a chance to learn; if they are not, then this is effectively one-shot. I expect that besides advancement of current AIs, such as LLMs, we'll also have advancements in their set-ups (perhaps centaurs with humans in co-pilot seats, perhaps agent scaffolds that use thousands of agents as one super-agents, perhaps brain-like AI, or stuff not currently imagined) and that one of those improvements takes us over the precipice, rather than GPT x+1 which has expected kinds of improvements.

2Petropolitan2mo

I realized there's one more way Eliezer's use of the term ASI is confusing: do we agree that Dario's "country of geniuses in a data center" count as the "real ASI" for the purpose of the thesis "You only get one real shot at real ASI"? If you take Bostrom's definition of ASI, it obviously should qualify: "Speed superintelligence: A system that can do all that a human intellect can do, but much faster. [...] Collective superintelligence: A system composed of a large number of smaller intellects such that the system’s overall performance across many very general domains vastly outstrips that of any current cognitive system." If you disagree, why? If we do manage to align this kind of ASI once, Eliezer's critics will say that this falsifies the oneshotness thesis, but as has been discussed in AI 2027 and beyond, human engineers and this ASI might be unable to align the next ASI, or the next ASI might be unable to align the ASI after that, etc. So that's once again a very different dynamic to "oneshotness"

8Eliezer Yudkowsky2mo

That there would possibly be other wars and lethal threats beyond World War II did not make the Maginot Line not be one-chance-to-get-it-right. So this doesn't at all cut against the concept I was pointing to. As for names, there is no name that can stop a fool from being a fool, but if there's some brief name that proves empirically to provoke fewer fools than I am open to it. Separately: There's a threshold level of ASI beyond which It can easily align the next ASI. A country of geniuses in a datacenter might fall short, especially because "country of geniuses" is not yet dath ilan, and possibly not even enough to seek out dath ilan as its successor; I have often found myself unimpressed by the taste and discrimination among ideas and possibilities of those whom Earthlings call geniuses. A country of geniuses in a datacenter otherwise able to stabilize the Earth and smart enough to notice if they can't align the next system would constitute a victory, however.

1Sausage Vector Machine2mo

This is a perfectly legitimate question. The curse doesn't apply due to the vast difference in intelligence. It would definitely be a one-shot problem for today's LLMs. They can't reliably plan very far ahead and they have serious execution issues. The same will almost certainly be true for LLMs next year. After that, all bets are off. ASI is in a different league. Imagine you have to win a game of chess against a baby. You only have one game. Does that make it a one-shot problem for you? Or imagine you're trying to take over an anthill. Is that a one-shot problem for you? Ants are small, slow, and stupid. Theoretically, they could bite you to death. In practice, you'd think about this possibility in advance and use insecticide.

1Petropolitan2mo

Chess is fully verifiable in silico, so the curse clearly does not apply. Taking over an anthill is a poor comparison because the anthill is a sufficiently simple system with little feedback loops if at all, unlike the human society which is complex and unpredictable due to plenty of poorly understood feedback loops, often very nonlinear and often irrational. You might disagree, but with the current very limited progress in AI alignment and quite unsafe practices in the frontier labs the first AGI/ASI attempting a takeover almost certainly will not be superhuman enough to perfectly predict humanity's reaction to the AI's moves during the attempt (that's a very high bar IMO). Note that for this first AI there exists no experimental data whatsoever on any of this stuff (fiction doesn't count), arguably it's even worse than the examples described in the post

This post reflects a popular misunderstanding of the Maginot Line. I don't think that this fatally undermines the argument, but it still seems worth correcting.

Epistemic status: I am not a military historian, so I am deferring to military historians who write publicly, rather than looking at the academic literature or (even better) the original sources.

Here's Bret Devereaux (emphasis in original):

As an aside, the purpose of the Maginot Line was to channel any attack at France through the Low Countries where it could be met head on with the flanks of the Fr

... (read more)

The issue the Allied forces encountered was that German forces also attacked through the Ardennes forest south of the main Allied forced and through rapid movement broke through weak parts of the line and encircled the Allied armies.

This is the issue that turned up in war games, was counterargued and disregarded by high command, and which was sufficient to lose France the war. No?

5Alex Darby2mo

The point here is one of utility function of France, no? [...] I think this is non-central to the post, and doesn't undermine any of your central points. However, I see the primary thrust of the Maginot line as "force both sides to pit all their strength against each other", and the secondary as "actually win", with the human understanding that losing a war to other humans is usually not an existential threat to the societies involved. In terms of your point: "France was incorrect about a detail in war games, and therefore they lost on the first critical try according to the utility 'beat the Germans'" is true. The counterpoint: "France was succeeding according to the utility: 'minimise sum of German + French deaths due to F/G conflict in next war, with an additional winning term'" may also be correct. At this point we're very non-central though, I think.

Curated. A lot could be said about this post and a lot is being said about this post (see the 111 comments on it), but a thing I find neat about it is that it continues the debate around an old idea that's still important and perhaps cruxy for many.

Oneshotness is not new (not a new concept, not a new argument, not a new debate); yet it is a critical argument for alignment difficult, and is still contested, with implications for what people do. Having made many attempts to explain their side, I could imagine someone giving up and being like "the people who... (read more)

We don't have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should.
- Sheer naked strawmanning^[3] of what is being said when somebody tries to warn you of Murphy's curses upon your project; the chattering of disaster monkeys.

It seems to me like you often accuse people of strawmanning you when in fact they are paraphrasing your position accurately if slightly uncharitably, in a way that makes discourse with you difficult.

In this case, I think you absolutely do be... (read more)

7habryka2mo

Eliezer is here giving a rant about how people strawman him as a champion of "proving that the AI is safe", which does really happen all the time! He isn't providing any specific links here, but I could dig them up, and we could look at them, and I would be very surprised if we end up anywhere else but "yep, these people sure seem to think that because MIRI used to work on some agent foundations math, that this means MIRI is trying to prove that future AIs are aligned before proceeding". And there are really people out there who are championing the banner of "proving that the AI is safe", and so this separation really matters: https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects

1307th2mo

Here is a quote from miri's website in 2021 (https://archive.is/sYKvl): [...]

4habryka2mo

Do you... disagree with that statement? How does this imply that MIRI is trying to "prove that AI is safe" or that "empirical iteration is of close to no value"?

9Eliezer Yudkowsky2mo

I mean, I didn't write it, I wouldn't have written it, and if it were still on the site I'd have pinged somebody to take it down; because it's not the right way of wording the true idea, the true idea no longer matters here, and this wrong version is adjacent to other wrong ideas that aren't helpful here.

5307th2mo

Yeah I should have spelled it out more. But after sending I didn't want to do a stealth-edit. Anyway, to answer your question: Yes, I disagree with that statement (I think a mathematical equation would be incredibly complex and fragile and prone to alignment failure). If we're specifying a counterfactual world where the equation was simple, I would then prefer the equation. Why I think it's relevant: I think it's pretty clear that there is a lot of good reason to put EY near the provably safe AI camp. The above quote is one reason; other reasons include his critiques of modern ML methods as being particularly bad for alignment and general AI pessimism. I think it's disingenuous to accuse people who make this inference of being disaster monkeys or barbarian populists who are just mad that he wrote papers with math equations in them. Like, maybe it's wrong, but it is a pretty understandable type of wrong.

6habryka2mo

Look, I am really confident that seeing the stuff from the provably safe AI camp fills Eliezer with the same kind of frustration as becomes me when I see it. I don't get what the provably safe AI camp people are talking about, and I don't think Eliezer gets it either (or like, maybe he understands the psychology better than I do, but I really doubt he believes it). [...] I think modern ML methods are particularly bad for alignment. I do not think this has anything to do with thinking that we should "prove AI safe". I do think an appropriate approach would include many mathematical proofs because mathematical proofs are one among a large set of tools you bring to bear to solve complicated problems, in the same way that of course many mathematical proofs were involved in landing rockets on the moon. The fact that people don't seem to understand that you want to use mathematical proofs for anything but "proving that systems are safe end to end" or things like that is what Eliezer's rant was about.[1] [...] I personally find the position that if you utilize proofs in your thinking, then you must be attempting to prove end-to-end things about a complicated real-world system, at the level of complexity of "proving that this rocket will land on the moon" or "proving that this self-driving car system will never cause any crashes" is very silly. IDK whether it's "understandable", but I think it deserves being solidly rebuked. 1. ^ Which is not to say that it spells out what those things are, saying explicitly that it doesn't go into that because it's not the main topic of the essay, but I assure you, they exist.

I share your confusion at the idea of "mathematically proving AI safe" haha. This convo has made me realize I've conflated alignment pessimists in general with the provably safe AI people in particular too much in my mind.

0[comment deleted]2mo

I'd like to offer my perspective on why this enlightening post, written in Eliezer's wonderful, super-clear style, can't possibly eliminate motivated thinking about this one-shot problem.

In my opinion, people can't see any realistic alternative solution. They see that the alternative is even worse, but maybe they can't articulate it clearly, even to themselves. Or they just refuse to express it out loud. The fact is that the proposed AI ban (or AI pause) is also a one-shot problem.

How can an AI ban be implemented technically? Not legislatively, but in prac... (read more)

try introducing it as the Irretrievability Problem rather than "oneshotness"

I have had success (talking with e.g. MPs and civil servants) discussing 'unpluggability' and its opposite, 'un-unpluggability'.

The battery-software update accidentally overwrote the antenna-pointing software.
With the lander's antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.

Out of curiosity, do we know more about how this particular mistake happened? When issuing a software update, not overwriting any critical part of the code would seem fairly high on my list of concerns. It seems like this sort of mistake should have been caught by integration or unit tests, or by testing that all components worked on a replica on earth. Were they under a lot of time pressure or something?

7Mis-Understandings2mo

They were under some time pressure, if the battery goes true flat the probe is permanently dead, as NiCd batteries do not survive overdrain.

the CEOs of AI companies have been filtered to not be people who get that

I'd like to go on a bit of a tangent since I don't recall seeing this line of thinking here on LW: are we sure that this is the case, that they think that what they are doing is reasonably safe? This seems to me to be stranger than the case where they know it's not safe, and perhaps not safe at all. What I mean is that they surely are very motivated and they have strong incentives in pursuing AGI, but they don't strike me as... stupid/incompetent?

Maybe they think that this is for the ... (read more)

Dario Amodei thinks that real alignment theory is about "monomaniacal" AI and is therefore refuted by LLMs, which want more than one thing. If Amodei was the sort of person to let himself hear or understand better than that, he wouldn't have his job.

I have no reason to believe any other AI company executive has even heard that much about any alignment theory which leans toward slight pessimism, with the possible exception of Shane Legg at Deepmind.

1Josh Snider2mo

I strongly believe that Dario does not actually think that and is just saying that for politics. Can we get someone from Anthropic to clarify this?

The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of “we only get one shot at ASI” is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever.

Or it's the writer's fault and calling it "one shot" is just a bad choice of words, when it being correct depends on specific decomposition into shots and "irretrievability" is better. Like, people are forced to say "there’ll be multiple first critical... (read more)

maybe you should try introducing it as the Irretrievability Problem rather than “oneshotness”

I'd like to suggest "Ironman Mode" (or whatever its best-known synonym is) as possibly memetically useful here. It refers to a difficulty modifier in certain games that prevents the equivalent of saying "oops" and restoring to 1929. Mistakes become permament, at least for that playthrough. The term isn't a perfect match, because you can try again, but only by starting from scratch, nethack-style.

("roguelike" was once a similar concept, but has been badly diluted... (read more)

This is why one shouldn't argue on the X-parrot, and why David Brin's "disputation arenas" proposal includes a step in which each party must, to the satisfaction of the other and/or the judges, explain the other side's argument in their own words to prove that they actually understand it.

Hey Eliezer, I am a big fan of your work. I did have one idea about this and would be interested in hearing your thoughts. I liked your example about the Maginot line, if one German division crosses into France, you do not have time to build a new better Maginot line before the next division arrives. This is both true and pretty funny. Here, the reason why you cannot rely on the "gradualness of the problem" as a source of hope is that the intervention technique (building the Maginot line) takes orders of magnitude more time to implement than the developing... (read more)

The Viking I lander situation seems somewhat different in that the problem with the patching system wasn't something that arose mid-mission. It must have been either a bug in the software written back on earth prior to the flight or an operator error, again on earth. In either case it could have been prevented by greater effort or diligence pre-launch. Physical/mechanical problems occurring mid-mission are not amenable to such prevention. A problem that shouldn't even require one shot to discover seems qualitatively different from one that requires that one shot to even reveal.

One could argue that if the primary purpose of a government is to protect the lives and wellbeing of its people then the French course of action in WWII was far superior to that in WWI, given that, as a percentage of the pre-war population, French casualties in WWII were roughly a third of the WWI figures. There were certainly terrible consequences of French capitulation, but there's no reason to believe similar or worse outcomes wouldn't have resulted from putting up greater resistance.

hmm... interesting. I disagree, at least partly. It seems to me that all the examples mentioned are characterized by consisting of ultimately distinct, granular, components that are then assembled to form a complex system. The examples of failures rely on the assumption that this system can then be, so to speak, attacked from all angles. The Germans can and will find any point at which the French defenses are too weak and therefore every point has to be sufficently guarded. ASI seems to be fundamentally different, (although I am no AI researcher, so take t... (read more)

In a cosmic scope, everything is almost never oneshot in its nature - in the narrowing scope of a thing in question, it's almost always oneshot (our daily lives being somewhere in the middle of these macro and micro scales) - and the severity of the scope is a wobbly line drawn somewhere in its spectrum of worst case scenarios arranged by diminishing probability. For ASI, even the most lenient line is at an astoundingly high level of probability. It's only not oneshot in the fact that a future non-human civilization can try again after learning from our ou... (read more)

If failure of alignment -> schemer that wants to seize power, then ASI alignment is one shot.

But if failure of alignment -> non-schemer misalignment (eg reward hacking, or flailing misgeneralisation), then we failure isn't existential

So I think p(scheming | alignment failure) is a crux here

I’m afraid to admit that I am one of the people who do not understand the point being made other than arguing against boundary experimentation.

there was an actual theory of why the Chernobyl reactor was supposed to not explode,

Is this, like, available online? What is it called?

I think notion of "one-shot-ness" introduces a counterproductive dichotomy. A lot of people, even those aligned with the general AI-risk position of the writer, react to this framing as downplaying or dismissing the role of empirical research, trials with smaller AIs, etc. (Oliver's quotes as evidence). I think it doesn't necessitate strawmaning to reply in this way! And I found confusion around "Yudkowsky wants us to have perfect theoretical understanding before we try anything" reasonable, even if it misrepresents the writer's position.

Indeed, "one-shot-... (read more)

Ah I guess a problem is a one shot problem in respect to a goal if it is possible that there will be a be a issue that arrises from the attempt that makes the goal no longer possible? Would that kinda mean that in some way every problem is one shot? Material and time spent on failing to make as shoe is not recoverable though the severity of the loss is probably low.

But also for the first probe example, the attempt at the software update seems like a second attempt where the first would be to have designed the craft to not have had the problem

I probably am one of the dumbs that dont grasp the concept

This seems like an excellent argument to pause somewhere at about the AGI level, where a mistake is likely to give us a competent sentient computer virus and/or a new criminal organization and/or a nation of rather inconvenient competent people in a data-center: problems large enough to act as warning shots and learning exercises, but not actually wipe us out or permanently disempower us.

However, that would of course require:

a) actual willingness and capability to pause, globally (including China), and also
b) correct judgement of whether the likely resulti... (read more)

Is there going to be a follow up giving the other side of the argument: that the develop!ent of ASI actually will be a one-shot enterprise?

ASI safety would have to be invented quite often after the first ASIs. Even the best man-made self-stabilizing systems are not very good at surviving random events. ASI ecosystems affect pretty much everything on this planet, so there is plenty of room for a random event. And they couple so many dimensions, including time and space, and across their scales, it seems this is a textbook recipe for instability.

Right, it is not that there is a first critical try. It is that even if you passed the first critical try, and got an aligned AI/ or a restrained AI system, there would be a second, and a third, and a forth, and while you might have resources from the previous tries, you need the probability of each individual event to shrink faster than 1/n in possible events to never have one event if you cannot stop yourself from taking events, and probabilities of each event are independent, by the divergence of the harmonic series.

So for any outcome that could be a c... (read more)

367

Irretrievability; or, Murphy's Curse of Oneshotness upon ASI

367

Example 1: The Viking 1 lander

Example 2: The Mars Observer

Example 3: The Maginot Line

Other supposed refutations of oneshotness

On the extraordinary efforts put forth to misinterpret the idea of oneshotness

The secret sauce of competent engineers in Murphy-cursed fields: only trying projects so incredibly straightforward as to be actually possible.

367

367