Alex Lawsen

AI x-risk, approximately ordered by embarrassment

Advanced AI systems could lead to existential risks via several different pathways, some of which may not fit neatly into traditional risk forecasts. Many previous forecasts, for example the well known report by Joe Carlsmith, decompose a failure story into a conjunction of different claims, and in doing so risk missing some important dangers. ‘Safety’ and ‘Alignment’ are both now used by labs to refer to things which seem far enough from existential risk reduction that using the term ‘AI notkilleveryoneism’ instead is becoming increasingly popular among AI researchers who are particularly focused on existential risk. This post presents a series of scenarios that we must avoid, ranked by how embarrassing it would be if we failed to prevent them. Embarrassment here is clearly subjective, and somewhat unserious given the stakes, but I think it gestures reasonably well at a cluster of ideas which are important, and often missed by the kind of analysis which proceeds via weighing the incentives of multiple actors: * Sometimes, easy problems still don’t get solved on the first try. * An idea being obvious to nearly everyone does not mean nobody will miss it. * When one person making a mistake is sufficient for a catastrophe, the relevant question is not whether the mistake will be obvious on average, but instead whether it will be obvious to everyone with the capacity to make it. The scenarios below are neither mutually exclusive nor collectively exhaustive, though I’m trying to cover the majority of scenarios which are directly tackled by making AI more likely to try to do what we want (and not do what we don’t). I’ve decided to include some kinds of misuse risk, despite this more typically being separated from misalignment risk, because in the current foundation model paradigm there is a clear way in which the developers of such models can directly reduce misuse risk via alignment research. Many of the risks below interact with each other in ways which are diffi

151Apr 12, 2023

Alex Lawsen

Message

AI grantmaking at Open Philanthropy.

I used to give careers advice for 80,000 hours.

503

AI x-risk, approximately ordered by embarrassment

Apr 12, 2023151

ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try

Written quickly rather than not at all, I was thinking about this a couple of days ago, and decided to commit to writing something by today rather than adding this idea to my list of a million things to write up. This post describes a toy alignment problem that I’d...

Apr 8, 202317

A tension between two prosaic alignment subgoals

Written quickly rather than not at all, as I've described this idea a few times and wanted to have something to point at when talking to people. 'Quickly' here means I was heavily aided by a language model while writing, which I want to be up-front about given recent discussion....

Mar 19, 202331

Deceptive failures short of full catastrophe.

Epistemic status: trying to unpack a fuzzy mess of related concepts in my head into something a bit cleaner. A lot of my concern about risks from advanced AI is because of the possibility of deceptive alignment. Deceptive alignment has already been discussed in detail on this forum, so I...

Jan 15, 202333

alexrjl's Shortform

Aug 29, 20223

Thoughts on 'List of Lethalities'

I read through Eliezer’s “AGI Ruin: A List of Lethalities” post, and wrote down my reactions* as I went. I tried to track my internal responses, *before* adjusting for my perception of the overall thoughts of the field. In order to get this written and posted, I aimed to write...

Aug 17, 202227

An easy win for hard decisions

This is a crosspost from the EA forum. It refers to EAs and the EA community a couple of times, but as it is essentially just about a nice norm and decision making, it seemed worth having here too. There are a lot of things about this community that I...

May 5, 202227

Load More (7/8)

LESSWRONG
LW

LESSWRONG
LW

Alex Lawsen

Alex Lawsen

Alex Lawsen

AI x-risk, approximately ordered by embarrassment

Incentive Problems With Current Forecasting Competitions.

Deceptive failures short of full catastrophe.

A tension between two prosaic alignment subgoals

Alex Lawsen

AI x-risk, approximately ordered by embarrassment

ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try

A tension between two prosaic alignment subgoals

Deceptive failures short of full catastrophe.

alexrjl's Shortform

Thoughts on 'List of Lethalities'

An easy win for hard decisions

AI x-risk, approximately ordered by embarrassment

Incentive Problems With Current Forecasting Competitions.

Deceptive failures short of full catastrophe.

A tension between two prosaic alignment subgoals

AI x-risk, approximately ordered by embarrassment

ELCK might require nontrivial scalable alignment progress, and seems tractable enough to try

A tension between two prosaic alignment subgoals

Deceptive failures short of full catastrophe.

alexrjl's Shortform

Thoughts on 'List of Lethalities'

An easy win for hard decisions