Four mindset disagreements behind existential risk disagreements in ML

Rob Bensinger

136 Four mindset disagreements behind existential risk disagreements in ML

11th Apr 2023

10 min read

136

I've previously written that good ideas and conversations about AGI seem to have propagated through ML weirdly slowly.

A different weird phenomenon I observe is that the field's relative inaction about AGI seems less based on a confident set of beliefs about AGI definitely going well (or definitely being far off), and more based on an implicit sense like "the default is everything going well, and we don't need to change anything until there's overwhelming evidence to the contrary".

Some people do have confident beliefs that imply "things will go well"; I disagree there, but I expect some amount of disagreement like that.

But that doesn't seem to be the crux for most people in ML.

In a sane world, it doesn't seem like "well, maybe AI will get stuck at human-ish levels for decades" or "well, maybe superintelligence couldn't invent any wild new tech" ought to be cruxes for "Should we pause AI development?" or "Is alignment research the world's top priority?"

Note that I'm not arguing "an AGI-mediated extinction event is such a big deal that we should make it a top priority even if it's very unlikely". There are enough other powerful technologies on the horizon, and enough other risks for civilizational collapse or value lock-in, that I don't in fact think AGI x-risk should get major attention if it's very unlikely.

But the most common view within ML seems to be less "it's super unlikely for reasons X Y Z", and more of an "I haven't thought about it much" and/or "I see some reasons to be very worried, but also some reasons things might be fine, so I end up with medium-ish levels of worry".

In a mid-2022 survey, 48% of researchers who had recently published in NeurIPS or ICML gave double-digit probabilities to advanced AI's long-term effect being “extremely bad (e.g., human extinction)”. A similar number gave double-digit probabilities to "human inability to control future advanced AI systems causing human extinction or similarly permanent and severe disempowerment of the human species".

In an early 2021 survey, 91% of researchers working on "long-term AI topics" at CHAI, DeepMind, MIRI, OpenAI, Open Philanthropy, and what would become Anthropic gave double-digit probabilities to "the overall value of the future will be drastically less than it could have been, as a result of AI systems not doing/optimizing what the people deploying them wanted/intended".

The level of concern and seriousness I see from ML researchers discussing AGI on any social media platform or in any mainstream venue seems wildly out of step with "half of us think there's a 10+% chance of our work resulting in an existential catastrophe".

I think the following four factors help partly (though not completely) explain what's going on. If I'm right, then I think there's some hope that the field can explicitly talk about these things and consciously course-correct.

"Conservative" predictions, versus conservative decision-making.
Waiting for a fire alarm, versus intervening proactively.
Anchoring to what's familiar, versus trying to account for potential novelties in AGI.
Modeling existential risks in far mode, versus near mode.

1. "Conservative" predictions, versus conservative decision-making

If you're building toward a technology as novel and powerful as "automating every cognitive ability a human can do", then it may sound "conservative" to predict modest impacts. But at the decision-making level, you should be "conservative" in a very different sense, by not gambling the future on your technology being low-impact.

The first long-form discussion of AI alignment, Eliezer Yudkowsky's Creating Friendly AI 1.0, made this point in 2001:

The conservative assumption according to futurism is not necessarily the “conservative” assumption in Friendly AI. Often, the two are diametric opposites. When building a toll bridge, the conservative revenue assumption is that half as many people will drive through as expected. The conservative engineering assumption is that ten times as many people as expected will drive over, and that most of them will be driving fifteen-ton trucks.
Given a choice between discussing a human-dependent traffic-control AI and discussing an AI with independent strong nanotechnology, we should be biased towards assuming the more powerful and independent AI. An AI that remains Friendly when armed with strong nanotechnology is likely to be Friendly if placed in charge of traffic control, but perhaps not the other way around. (A minivan can drive over a bridge designed for armor-plated tanks, but not vice-versa.)
In addition to engineering conservatism, the nonconservative futurological scenarios are played for much higher stakes. A strong-nanotechnology AI has the power to affect billions of lives and humanity’s entire future. A traffic-control AI is being entrusted “only” with the lives of a few million drivers and pedestrians.

People who think their role is only to be a "conservative predictor", and not a "conservative decision-maker", will skew the scholarly conversation toward taking more extreme risks, because acknowledging extreme things sounds too out-there to them.

I personally wouldn't even call the predictions here "conservative", since this conflates "sounds normal" with "robust to uncertainty". All consistent object-level views about AI and technological progress have at least one "wild" implication (as noted in Holden Karnofsky's The Most Important Century), so views that sound normal here generally have to use misdirection and vagueness to obscure the wild part.

The availability heuristic and absurdity bias cause us to neglect big changes until it's too late.

My own view is that extreme disaster scenarios are very likely, not just a tail risk to hedge against. I actually expect AGI systems to achieve Drexler-style nanotechnology within anywhere from a few months to a few years of reaching human-level-or-better ability to do science and engineering work. At this point, I'm looking for any hope of us surviving at all, not holding out hope for a "conservative" scheme (sane as that would be).

But the point stands that if you have more "medium-sized" probabilities on those capabilities being available (as opposed to very high or very low ones), then a sane response to AGI should explicitly grapple with that, not pretend the probability is negligible because it's scary.

I do think debates between the "risk is extremely high" camp and the "risk is medium-sized" camp are important. But the importance mostly stems from "this suggests we have different background models, and should try to draw those out so they can be discussed explicitly", not "we should only take action about extreme risks once we're 95+% sure of them".

2. Waiting for a fire alarm, versus intervening proactively

There's No Fire Alarm for Artificial General Intelligence (written in 2017) makes a few different claims:

Predicting when future technologies will be invented is usually very hard, and may be flatly impossible for humans in the typical case, at least when the technology isn't imminent. AGI is likely no exception.
There isn't any "fire alarm" for AGI, i.e., there's no event that field leaders all know is going to happen well before AGI, such that everyone can coordinate around that event as "here's the trigger for us to start thinking seriously about AGI risk and alignment".
Many people aren't working on AGI alignment today (in 2017) because they're waiting for some clear future social signal that you won't look weird or panicky for thinking about something as science-fictiony as AGI. But in addition to there being no known event like that we can safely wait around for, there probably aren't any unknown events like that either. It will probably still be socially risky to loudly worry about AGI even on the eve of AGI's invention. If you're waiting for social permission to get involved, you'll likely never get involved at all.

Quoting Yudkowsky:

When I observe that there’s no fire alarm for AGI, I’m not saying that there’s no possible equivalent of smoke appearing from under a door.
What I’m saying rather is that the smoke under the door is always going to be arguable; it is not going to be a clear and undeniable and absolute sign of fire; and so there is never going to be a fire alarm producing common knowledge that action is now due and socially acceptable.

Claims 1 and 2 still seem correct to me. We can hope that 3 is maybe false, and that we're now seeing a shift in the field toward taking AGI seriously, even if this wasn't foreseeable in 2017 and doesn't come with a lot of clarity about timelines.

For now, however, it still seems to me that the basic dynamics described in the Fire Alarm post are inhibiting action. Things are murky now, and I think there's a common implicit expectation that they'll be less murky later, and that we can safely put off thinking about the problem until some unspecified future date.

The bystander effect still seems powerful here. People don't want to be the first in a given social context to express alarm, so they default to looking vaguely calm while waiting for someone else to speak up or spring into action first. But everyone else is doing the same thing, so no one ends up acting at all.

This is a case where unilaterally acting at all (in sane and actually-helpful ways), speaking up, blurting your actual thoughts, etc. can be particularly powerful and important.

In some cases it may only take one person shattering the Overton window in order to open the floodgates for other people who were quietly worried. And even where that's not true, I expect better results from people hashing out their disagreements in argument than from people timidly waiting for the right moment.

3. Anchoring to what's familiar, versus trying to account for potential novelties in AGI

The level and nature of the risk from AGI turns on the physical properties of AGI. "AlphaGo wasn't dangerous" is evidence for "AGI won't be dangerous" only insofar as you think AGI is similar to AlphaGo in the relevant ways.

But for some reason a lot of people who wouldn't go out on a limb and claim that AlphaGo and AGI are actually particularly similar in the ways that matter, do treat AGI like "just a normal ML system". Their policy suggests confidence that AGI is in the same reference class as systems like AlphaGo or DALL-E in all the ways that matter, even though they wouldn't ever actually state that as a belief.

The whole conversation is baked through with a tacit assumption that the difficulty, danger, and importance of AGI alignment needs to be "just more of the same", even though AGI itself is a very new sort of beast.

But "get a smarter-than-human AI system to produce good outcomes" is not in fact similar to a problem we've faced before! I think it's a solvable problem in principle, but the difficulty level does not need to be calibrated to business-as-usual efforts.

Quoting Beyond the Reach of God:

Once upon a time, I believed that the extinction of humanity was not allowed. And others who call themselves rationalists, may yet have things they trust. They might be called "positive-sum games", or "democracy", or "technology", but they are sacred. The mark of this sacredness is that the trustworthy thing can't lead to anything really bad; or they can't be permanently defaced, at least not without a compensatory silver lining. In that sense they can be trusted, even if a few bad things happen here and there.
The unfolding history of Earth can't ever turn from its positive-sum trend to a negative-sum trend; that is not allowed. Democracies—modern liberal democracies, anyway—won't ever legalize torture. Technology has done so much good up until now, that there can't possibly be a Black Swan technology that breaks the trend and does more harm than all the good up until this point.
There are all sorts of clever arguments why such things can't possibly happen. But the source of these arguments is a much deeper belief that such things are not allowed. Yet who prohibits? Who prevents it from happening? If you can't visualize at least one lawful universe where physics say that such dreadful things happen—and so they do happen, there being nowhere to appeal the verdict—then you aren't yet ready to argue probabilities.
[...] If there is a fair(er) universe, we have to get there starting from this world—the neutral world, the world of hard concrete with no padding, the world where challenges are not calibrated to your skills.

4. Modeling existential risks in far mode, versus near mode

Quoting Yudkowsky again, in "Cognitive Biases Potentially Affecting Judgment of Global Risks":

In addition to standard biases, I have personally observed what look like harmful modes of thinking specific to existential risks. The Spanish flu of 1918 killed 25-50 million people. World War II killed 60 million people. is the order of the largest catastrophes in humanity’s written history. Substantially larger numbers, such as 500 million deaths, and especially qualitatively different scenarios such as the extinction of the entire human species, seem to trigger a different mode of thinking—enter into a “separate magisterium.” People who would never dream of hurting a child hear of an existential risk, and say, “Well, maybe the human species doesn’t really deserve to survive.”
There is a saying in heuristics and biases that people do not evaluate events, but descriptions of events—what is called non-extensional reasoning. The extension of humanity’s extinction includes the death of yourself, of your friends, of your family, of your loved ones, of your city, of your country, of your political fellows. Yet people who would take great offense at a proposal to wipe the country of Britain from the map, to kill every member of the Democratic Party in the U.S., to turn the city of Paris to glass—who would feel still greater horror on hearing the doctor say that their child had cancer—these people will discuss the extinction of humanity with perfect calm. “Extinction of humanity,” as words on paper, appears in fictional novels, or is discussed in philosophy books—it belongs to a different context than the Spanish flu. We evaluate descriptions of events, not extensions of events. The cliché phrase end of the world invokes the magisterium of myth and dream, of prophecy and apocalypse, of novels and movies. The challenge of existential risks to rationality is that, the catastrophes being so huge, people snap into a different mode of thinking.

In terms of construal level theory, personal tragedies are "near", while human extinction is "far". We think of far-mode things in more abstract and detached terms, more like morality tales or symbols than like messy, concrete, mechanistic processes.

Rationally, we ought to take larger disasters proportionally more seriously than equally probable small-scale risks. In practice, we don't seem to do that at all.

"Well, maybe we aren't all going to die; it's not a sure thing!" is a lot weaker than the bar we usually require for doing anything.

The above is my attempt at a partial explanation of what's going on:

People are taking the risks unseriously because they feel weird and abstract.
When they do think about the risks, they anchor to what's familiar and known, dismissing other considerations because they feel "unconservative" from a forecasting perspective.
Meanwhile, social mimesis and the bystander effect make the field sluggish at pivoting in response to new arguments and smoke under the door.

What do you think of this picture? Do you have a different model of what's going on? And if this is what's going on, what should we do about it?

Frontpage

136

Mentioned in

188The basic reasons I expect AGI ruin

70AGI ruin mostly rests on strong claims about alignment and deployment, not about society

Four mindset disagreements behind existential risk disagreements in ML

New Comment

12 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:43 PM

[-]teradimich2y76

The level of concern and seriousness I see from ML researchers discussing AGI on any social media platform or in any mainstream venue seems wildly out of step with "half of us think there's a 10+% chance of our work resulting in an existential catastrophe".

In fairness, this is not quite half the researchers. This is half the agreed survey.

'We contacted approximately 4271 researchers who published at the conferences NeurIPS or ICML in 2021. [...] We received 738 responses, some partial, for a 17% response rate'.

I expect that worried researchers are more likely to agree to participate in the survey.

[-]Steven Byrnes2y21

I recall that they tried to advertise / describe the survey in a way that would minimize response bias—like, they didn’t say “COME TAKE OUR SURVEY ABOUT AI DOOM”. That said, I am nevertheless still very concerned about response bias, and I strongly agree that the OP’s wording “48% of researchers” is a mistake that should be corrected.

[-]Rob Bensinger2y50

I figured this would be obvious enough, and both surveys discuss this issue; but phrasing things in a way that encourages keeping selection bias in mind does seem like a good idea to me. I've tweaked the phrasing to say "In a survey, X".

[-]David Bravo2y3-1

I like this model, much of which I would encapsulate in the tendency to extrapolate from past evidence, not only because it resonates with the image I have of the people who are reluctant to take existential risks seriously, but because it is more fertile for actionable advice than the simple explanation of "because they haven't sat down to think deeply about it". This latter explanation might hold some truth, but tackling it would be unlikely to make them take more actions towards reducing existential risks if they weren't aware of, and weren't able to fix, possible failure modes in their thinking, and weren't aware that AGI is fundamentally different and extrapolating from past evidence is unhelpful.

I advocate shattering the Overton window and spreading arguments on the fundamental distinctions between AGI and our natural notions of intelligence, and these 4 points offer good, reasonable directions for addressing that. But the difficulty also lies in getting those arguments across to people outside specific or high-end communities like LW; in building a bridge between the ideas created at LessWrong, and the people who need to learn about them but are unlikely to come across LessWrong.

[-]Signer2y-11

But at the decision-making level, you should be “conservative” in a very different sense, by not gambling the future on your technology being low-impact.

What's the technical (like, with numbers) explanation for "why?"? And to what degree - it's common objection that being conservative to the extent of "what if AI will invents nanotechnology" is like worrying that your bridge will accelerate your traffic million times.

[-]Rob Bensinger2y141

This is why I said in the post:

Some people do have confident beliefs that imply "things will go well"; I disagree there, but I expect some amount of disagreement like that.

... and focused on the many people who don't have a confident objection to nanotech.

I and others have given lots of clear arguments for why relatively early AGI systems will plausibly be vastly smarter than humans. Eric Drexler has given lots of clear arguments for why nanotechnology is probably fairly easy to build.

None of this constitutes a proof that early AGI systems will be able to solve the inverse protein folding problem, etc., but it should at least raise the scenario to consideration and cause it to be taken seriously, for people who don't have specific reasons to dismiss the scenario.

I'll emphasize again this point I made in the OP:

Note that I'm not arguing "an AGI-mediated extinction event is such a big deal that we should make it a top priority even if it's very unlikely".

And this one:

My own view is that extreme disaster scenarios are very likely, not just a tail risk to hedge against. I actually expect AGI systems to achieve Drexler-style nanotechnology within anywhere from a few months to a few years of reaching human-level-or-better ability to do science and engineering work. At this point, I'm looking for any hope of us surviving at all, not holding out hope for a "conservative" scheme (sane as that would be).

So I'm not actually calling for much "conservatism" here. "Conservative" would be hedging against 1-in-a-thousand risks (or more remote tail risks of the sort that we routinely take into account when designing bridges or automobiles). I'm calling for people to take seriously their own probabilities insofar as they assign middling-ish probabilities to scenarios (e.g., 1-in-10 rather than 1-in-1000).

Another example would be that in 2018, Paul Christiano said he assigned around 30% probability to hard takeoff. But when I have conversations with others who seem to be taking Paul's views and running with them, I neither generally see them seriously engaging with hard takeoff as though they think it has a medium-ish probability, nor do I see them say anything about why they disagree with 2018-Paul about the plausibility of hard takeoff.

I don't think it's weird that there's disagreement here, but I do think it's weird how people are eliding the distinction between "these sci-fi scenarios aren't that implausible, but they aren't my mainline prediction" and "these sci-fi scenarios are laughably unlikely and can be dismissed". I feel like I rarely see pushback that's even concrete and explicit even to distinguish those two possibilities. (Which probably contributes to cascades of over-updating among people who reasonably expect more stuff to be said about nanotech if it's not obviously a silly sci-fi scenario.)

[-]Signer2y43

To be clear, I very much agree with being careful with technologies that have 10% chance of causing existential catastrophe. But I don't see how the part of OP about conservatism connects to it. I think it's more likely that being conservative about impact would generate probabilities much less than 10%. And if anyone says that their probability is 10%, then maybe it's the case of people only having enough resolution for three kinds of probabilities and they think it's less than 50%. Or they are already trying to not be very certain and explicitly widen their confidence intervals (maybe after getting probability from someone more confident), but they actually believe in being conservative more than they believe in their stated probability. So then it becomes about why it is at least 10% - why being conservative in that direction is wrong in general or what are your clear arguments and how are we supposed to weight them against "it's hard to make impact"?

[-]Rob Bensinger2y20

I think it's more likely that being conservative about impact would generate probabilities much less than 10%.

I don't know what you mean by "conservative about impact". The OP distinguishes three things:

conservatism in decision-making and engineering: building in safety buffer, erring on the side of caution.
non-conservatism in decision-making and engineering, that at least doesn't shrug at things like "10% risk of killing all humans".
non-conservatism that does shrug at medium-probability existential risks.

It separately distinguishes these two things:

forecasting "conservatism", in the sense of being rigorous and circumspect in your predictions.
forecasting pseudo-conservatism ('assuming without argument that everything will be normal and familiar indefinitely').

It sounds like you're saying "being rigorous and circumspect in your predictions will tend to yield probabilities much less than 10%"? I don't know why you think that, and I obviously disagree, as do 91+% of the survey respondents in https://www.lesswrong.com/posts/QvwSr5LsxyDeaPK5s/existential-risk-from-ai-survey-results. See e.g. AGI Ruin for a discussion of why the risk looks super high to me.

[-]Signer2y00

I don’t know what you mean by “conservative about impact”

I mean predicting modest impact for reasons futurist maybe should predict modest impacts (like "existential catastrophes never happened before" or "novel technologies always plateau" or whole cluster of similar heuristics in opposition to "building safety buffer").

It sounds like you’re saying “being rigorous and circumspect in your predictions will tend to yield probabilities much less than 10%”?

Not necessary "rigorous" - I'm not saying such thinking is definitely correct. I just can't visualize thought process that arrives at 50% before correction, then applies conservative adjustment, because it's all crazy, still gets 10% and proceeds to "then it's fine". So if survey respondents have higher probabilities and no complicated plan, then I don't actually believe that opposite-of-engineering-conservatism mindset applies to them. Yes, maybe you mostly said things about not being decision-maker, but then what's the point of that quote about bridges?

[-]gjm2y44

I'm not sure that a technical explanation is called for; "conservative" just means different things in different contexts. But how about this?

The general meaning of "conservative" that covers both these cases is something like "takes the worst case duly into account when making decisions".
When you are engaging in futurism, your most important real goal is not accurate prediction but sounding impressive, and accordingly the worst case is where you say something that sounds stupid and/or is spectacularly refuted by actual events.
- Therefore, to be "conservative" when engaging in futurism, you make "cautious" predictions in the sense of ones that don't sound stupid and that when they're wrong are highly excusably wrong. Which generally means predicting not too much change.
When you are trying to decide policy for dealing with some possible huge risk, your most important real goal is having things actually turn out well, and accordingly the worst case is where your decision leads to millions of deaths or economic catastrophe or something.
- Therefore, to be "conservative" when making large-scale-risk policy, you make "cautious" predictions in the sense of ones that take into account ways that things could go very badly. This means, on the one hand, that you don't ignore risks just because they sound weird; and, on the other hand, that you don't commit all your resources to addressing Weird Risk One when you might actually need them for Weird Risk Two, or for Making Normality Go Well.
If you want a version of the above with numbers, think in terms of expected utilities.
- If you don't actually anticipate what you say having much impact on anything other than what people think of you, and you think there's a 10% chance that runaway AI destroys everything of value to the human race, then your calculation goes something like this. If you remain calm, downplay the risks of catastrophe, etc., but maybe mention the catastrophes as unlikely possibilities then you pass up a 10% chance of looking prophetic when everything of value is destroyed (but in that case, well, everything of value has been destroyed so it doesn't much matter) and whatever shock value it might have to say "we're all doomed!!!!!111"; in exchange you get to look wise and reasonable and measured. Maybe being successfully prophetic would be +100 units of utility for you, except that shortly afterwards everyone is dead so let's call it +1 instead. Maybe shock value is +5 by drawing public attention to you, but looking crazy is -5 so that balances out. And looking wrong when AI hasn't killed us yet in 5 years' time is -5. That's -4.4 units of expected utility from being alarmist, versus what would have been say -200 with probability 0.1 but is actually only -1 because, again, we are all dead, plus +1 when AI continues to not kill us for 5 years; expectation is +0.8 units. 0.8 is better than -4.4 so don't be alarmist.
- If you do anticipate what you say having an impact, and you think there's a 10% chance of catastrophe if we don't take serious action, and that if you are alarmist it'll raise the chance of a meaningful response from 2% to 2.5%, and that if catastrophe happens / would otherwise happen that meaningful response gives us a 20% chance to survive, and you reckon the survival of the human race is enough more important than whether or not you look like an idiot or a prophet, then you completely ignore all the calculations in the previous paragraph, and essentially the only thing that matters is that being alarmist means an extra 20% chance of survival 0.5% of the time in a situation that happens 10% of the time, so an extra 0.01% chance that the entire human race survives, which is a very big deal. (If you're being carefully conservative you should also be considering the danger of taking resources away from other huge risks, or of crying wolf and actually making a meaningful response less likely if the catastrophe isn't very close, but probably these are second-order considerations.)
- I am not sure that the numbers really add anything much to the informal discussion above.
My account of what "conservative" means for futurists takes a cynical view where futurists are primarily interested in looking impressive. There is another perspective that can be called "conservative", which observes that futurists' predictions are commonly overdramatic and accordingly says that they should be moderated for the sake of accuracy. But I assume that when you arrive at your (say) 10% probability of catastrophe, that's the best estimate you can come up with after taking into account things like whatever tendency you might have to overdramatize or to extrapolate short-term trends too enthusiastically.

[-]Signer2y10

Thank you. Your explanation fits "futurist/decision-maker" distinction, but I just don't feel calling decision-maker behavior "conservative" is appropriate? If you probability is already 10%, than treating it like 10% without adjustments is not worst-case thinking. It's certainly not the (only) kind of conservatism that Eliezer's quote talks about.

There is another perspective that can be called “conservative”, which observes that futurists’ predictions are commonly overdramatic and accordingly says that they should be moderated for the sake of accuracy.

This is perspective I'm mostly interested in. And this is where I would like to see numbers that balance caution about being overdramatic and having safety margin.

[-]JNS2y21

Those are not the same at all.

We have tons of data on how traffic develops over time for bridges, and besides they are engineered to withstand being pack completely with vehicles (bumper to bumper).

And even if we didn't, we still know what vehicles look like and can do worst case calculations that look nothing like sci-fi scenarios (heavy truck bumper to bumper in all lanes).

On the other hand:

What are we building? Ask 10 people and get 10 different answer.

What does the architecture look like? We haven't built it yet, and nobody knows (with certainty).

Name one thing it can do: <Sci-Fi sounding thing goes here> or ask 10 people and get 10 very different answers (number 5 will shock you)

I'll give you my personal take on those three:

We are building something that can "do useful things we don't know how to do"
I don't know, but give current trajectory very likely something involving neural networks (but unlikely to be exclusively).
Design (and possible build) the technology necessary for making a molecular level duplicate of a strawberry, with the ability to identify and correct cellular damage and abnormalities.

Moderation Log