The impact you might have working on AI safety

Fabien Roger

This is an attempt at building a rough guess of the impact you might expect having with your career in AI safety.

Your impact will depend on the choices you make, and the estimation of their consequences also depends on your beliefs about AI safety, so I built an online tool so that you can input your own parameters into the model.

The main steps of the estimation

Compute when a world-ending AGI might happen, if there has been no AI safety progress,
Compute when the AI alignment problem might be solved,
Compute your impacts on timelines, and see how it affects the odds that humanity survives.

Main simplifying assumptions

Unsafe AGI being built and AI alignment being solved are independent events: progress towards AGI doesn’t speed up AI safety, but AI Alignment prevents building an unsafe AGI.
AI alignment is either solved or not: if it is solved before an AGI can be built, nothing happens, if it is not solved when an AGI is built, and AGI is unsafe by default, then humanity disappears. Under this assumption, AI alignment is "solved" if a technical solution is found and the alignment tax is low enough that everyone pays it.
Anything that happens after 2100 is irrelevant (AI alignment can’t be solved after this point, and AGI can’t be built beyond this point).
There is no uncertainty on how much you speed up progress on AI alignment. If your impact is small, using your expected speedup on AI alignment as the “true value” of your speedup should not make a big difference in the end result. However, it would make the estimate completely wrong in more extreme scenarios.

These assumptions are very strong, and wrong. I’m unsure in which direction the results would move if the assumptions were more realistic. I am open to suggestions to replace parts of the model in order to replace these assumptions with assumptions that are less wrong. Please tell me if I introduced a strong assumption not mentioned above.

Mathematical model of the problem

Model of when a world-ending AGI might happen, if there has been no AI safety progress

Without your intervention, and if there is no AI safety progress, AGI will happen at time , where $X$ follows a distribution of density $^q$ over $[t_{0}, + \infty [$ , where $\int_{t_{0}}^{+ \infty}^q = P (AGI happens)$ ^[1]. AGI kills humanity with probability $p_{kill}$ . This is assumed to be independent of when AGI happens. Therefore, without you, AGI will kill humanity at time $Y$ , where $Y$ follows a distribution of density $q (t) = p_{kill}^q (t)$ (for all $t \in [t_{0}, + \infty [$ ).

Model of when the AI alignment problem might be solved

Without you, AI Alignment will be solved at time $Z$ , where $Z$ follows a distribution of density $p$ , where $\int_{t_{0}}^{+ \infty} p = P (AI alignment can be solved)$ . Its cumulative distribution function is $F$ .

Modeling your impacts on timelines

With your intervention, AI alignment will be solved at time $¯ ¯¯ ¯ Z$ , where $¯ ¯¯ ¯ Z$ follows a distribution of density $¯ ¯ ¯ p$ . Its cumulative distribution function is $¯ ¯¯ ¯ F$ .

Between time $t_{1}$ and $t_{2}$ , you increase the speed at which AI Alignment research is done by a factor of $s$ . That is modeled by saying that $¯ ¯¯ ¯ F (t) = F (u (t))$ where $u (t)$ is a continuous piecewise linear function with slope $1$ in $[t_{0}, t_{1}]$ , slope $1 + s$ in $[t_{1}, t_{2}]$ , and slope $1$ in $[t_{2}, + \infty [$ : between $t_{1}$ and $t_{2}$ , you make “time pass at the rate of $1 + s$ ".

$s$ can be broken down in several ways, one of which is $s = f^s$ , where $f$ is the fraction of the progress in AI Alignment that the organization at which you work is responsible for, and $^s$ is how much you speed up the speed at which your organization is making progress^[2]. This is only approximately true, and only relevant if you don’t speed your organization’s progress too much. Otherwise, effects like your organization depends on the work of others come into play and $f$ becomes a complicated function of $^s$ .

A similar work can be done to compute $¯ ¯ ¯ q$ : with you, AGI will happen at time $¯ ¯¯ ¯ Y$ , where $¯ ¯¯ ¯ Y$ follows a distribution of density $¯ ¯ ¯ q$ obtained from $q$ in the same way as we obtained $¯ ¯ ¯ p$ from $p$ .

Computing how your intervention affects the odds that humanity survives

The probability of doom without your intervention is $d = \int_{t_{0}}^{+ \infty} \int_{t_{0}}^{+ \infty} 1_{t > u} q (t) p (u) d t d u + p_{sad}$ , where $p_{sad} = P (AGI that kills happens & AI Alignment can not be solved)$ $= (\int_{t_{0}}^{+ \infty} q (t) d t) (1 - \int_{t_{0}}^{+ \infty} p (t) d t)$ (which does not depend on your intervention).

The probability of doom with your intervention is $¯ ¯ ¯ d = \int_{t_{0}}^{+ \infty} \int_{t_{0}}^{+ \infty} 1_{t > u} ¯ ¯ ¯ q (t) ¯ ¯ ¯ p (u) d t d u + p_{sad}$ .

Hence, you save the world with probability $Δ = d - ¯ ¯ ¯ d$ . From there, you can also compute the expected number of lives saved.

Results of this model

This Fermi estimation takes as input $^q$ (your belief about AGI timelines), $p_{kill}$ (your belief about how AGI would go by default), $p$ (your belief about AI alignment timelines), $f$ (your belief about your organization's role in AI Alignment) and $^s$ (your belief about how much you help your organization). The results are a quite concrete measure of impact.

You can see a crude guess of what the results might look like if you work as a researcher in a medium-sized AI safety organization for your entire life here. With my current beliefs about AGI and AI alignment, humanity is doomed with probability $0.13$ , and you save the world^[3] with probability $3 \times 10^{- 5}$ .

I don’t have a very informed opinion about the inputs I put inside the estimation. I would be curious to know what result you would get with better informed estimations of the inputs. The website also contains other ways of computing the speedup s, and it is easy for me to add more, so feel free to ask for modifications!

Note: the website is still a work in progress, and I’m not sure that what I implemented is a correct way of discretizing the model above. The code is available on GitHub (link on the website), and I would appreciate it if you double-checked what I did and added more tests. If you want to use this tool to make important decisions, please contact me so that I increase its reliability.

^{^}
$P (X = + \infty) = 1 - \int_{t_{0}}^{+ \infty}^q = P (AGI is never achieved)$
^{^}
$^s$ should take into account that your absence would probably not remove your job from your organization. In most cases, it would result in someone else doing your job, but slightly worse.
^{^}
You “save humanity” in the same sense as you make your favorite candidate win if the election was perfectly balanced: the reasoning I used is in this 80000 hours article. In particular, the reasoning is done with causal decision theory and does not take into account the implications of your actions on your beliefs about other actors' actions.

[-]Evan R. Murphy3y10

Neat tool, it looks like you put a lot of work into this. I think the harder part for me is determining the parameters that need to be input into your model (e.g. % likelihood that AGI will go wrong). But this could be an interesting way to explore the ramifications of different views on AGI and safety research.