If running a single copy of a given AI system (let's call it SketchyBot) for 1 month has a 5% chance of destroying the world ...
Even given entirely aleatoric risk, it's not clear to me that the compounding effect is necessary.
Suppose my model for AI risk is a very naive one - when the AI is first turned on, its values are either completely aligned (95% chance) or unaligned (5% chance). Under this model, one month after turning on the AI, I'll have a 5% chance of being dead, and a 95% chance of being an immortal demigod. Another month, year, or decade, and there's still a 5% chance after a decade that I'm dead, and a 95% chance I'm an immortal demigod. Running other copies of the same AI in parallel doesn't change that either.
More generally, it seems that any model of AI risk where self.goingToDestroyTheWorld() is evaluated exactly once isn't subject to those sorts of multiplicative risks. In other words, 1 - .95**60 == we're all dead only works under fairly specific conditions, no epistemic arguments required
In fact, the epistemic uncertainty can actually increase the total risk if my base is the evaluated once model. Adding other worlds where the AI decides if it wants to destroy the world each morning, or is fundamentally incompatible with humans no matter what we try, just moves that integral over all possible models towards doom.
To make this model and little richer and share something of how I think of it, I tend to think of the risk of any particular powerful AI the way I think of risk in deploying software.
I work in site reliability/operations, and so we tend to deal with things we model as having aleatory uncertainty like holding constant a risk that any particular system will fail unexpected for some reason (hardware failure, cosmic rays, unexpected code execution path, etc.), but I also know that most of the risk comes right at the beginning when I first turn something on (turn on new hardware, deploy new code, etc.). A very simple model of this is something like where most of the risk of failure happens right at the start and beyond that there's little to no risk of failure, so running for months doesn't represent a 95% risk; almost all of the 5% risk is eaten up right at the start because the probability distribution function is shaped such that all the mass is under the curve at the beginning.
Agree, good point. I'd say it's aleatoric risk is necessary to produce compounding, but not sufficient, but maybe I'm still looking at this the wrong way.
The mathematical property that you're looking for is independence. In particular, your computation of 1 - .95**60 would be valid if the probability of failure in one month is independent of the probability of failure in any other month.
I don't think aleatoric risk is necessary. Consider an ML system that was magically trained to maximize CEV (or whatever you think would make it aligned), but it is still vulnerable to adversarial examples. Suppose that adversarial example questions form 1% of the space of possible questions that I could ask. (This is far too high, but whatever.) It's likely roughly true that two different questions that I ask have independent probabilities of being adversarial examples, since I have no clue what the space of adversarial examples looks like. So the probability of failure compounds in the number of questions I ask.
Personally, I still put a lot of weight on models where the kind of advanced AI systems we're likely to build are not dangerous by default, but carry some ~constant risk of becoming dangerous for every second they are turned on (e.g. by breaking out of a box, having critical insights about the world, instantiating inner optimizers, etc.).
In this case I think you should estimate the probability of the AI system ever becoming dangerous (bearing in mind how long it will be operating), not the probability per second. I expect much better intuitions for the former.
1) Yep, independence.
2) Seems right as well.
3) I think it's important to consider "risk per second", because
(i) I think many AI systems could eventually become dangerous, just not on reasonable time-scales.
(ii) I think we might want to run AI systems which have the potential to become dangerous for limited periods of time.
(iii) If most of the risk is far in the future, we can hope to become more prepared in the meanwhile
I really appreciate you sharing a word for this distinction. I remember being in a discussion about the possibility of indefinite lifespans way back on the extropians mailing list, and this one person was making an argument about it being impossible due to accumulation of aleatory risk using life insurance actuarial models as a starting point. Their argument was fine as far as it went, but it created a lot of confusion when it seemed there was disagreement on just where the uncertainty lay, and I recall trying to disentangle that model confusion lead to a lot of hurt feelings. I think having some term like this to help separate the uncertainty about the model, the uncertainty due to random effects, and the uncertainty about the model implying certain level of uncertainty due to random effects would have helped tremendously.
(Optional) Background: what are epistemic/aleatory uncertainty?
Epistemic uncertainty is uncertainty about which model of a phenomenon is correct. It can be reduced by learning more about how things work. An example is distinguishing between a fair coin and a coin that lands heads 75% of the time; these correspond to two different models of reality, and you may have uncertainty over which of these models is correct.
Aleatory uncertainty can be thought of as "intrinsic" randomness, and as such is irreducible. An example is the randomness in the outcome of a fair coin flip.
In the context of machine learning, aleatoric randomness can be thought of as irreducible under the modelling assumptions you've made. It may be that there is no such thing as intrinsic randomness, and everything is deterministic, if you have the right model and enough information about the state of the world. But if you're restricting yourself to a simple class of models, there will still be many things that appear random (i.e. unpredictable) to your model.
Here's the paper that introduced me to these terms: https://arxiv.org/abs/1703.04977
Relevance for modelling AI-Xrisk
I've previously claimed something like "If running a single copy of a given AI system (let's call it SketchyBot) for 1 month has a 5% chance of destroying the world, then running it for 5 years has a 1 - .95**60 ~= ~95% chance of destroying the world". A similar argument applied to running many copies of SketchyBot in parallel. But I'm suddenly surprised that nobody has called me out on this (that I recall), because this reasoning is valid only if this 5% risk is an expression of only aleatoric uncertainty.
In fact, this "5% chance" is best understood as combining epistemic and aleatory uncertainty (by integrating over all possible models, according to their subjective probability).
Significantly, epistemic uncertainty doesn't have this compounding effect! For example, you could two models of how the world could work, one where we are lucky (L), and SketchyBot is completely safe, and another where we are unlucky (U), and running SketchyBot even for 1 day destroys the world (immediately). If you believe we have a 5% chance of being in world U and a 95% chance of being in world L, then you can roll the dice and run SketchyBot and not incur more than a 5% Xrisk.
Moreover, once you've actually run SketchyBot for 1 day, if you're still alive, you can conclude that you were lucky (i.e. we live in world L), and SketchyBot is in fact completely safe. To be clear, I don't think that absence of evidence of danger is strong evidence of safety in advanced AI systems (because of things like treacherous turns), but I'd say it's a nontrivial amount of evidence. And it seems clear that I was overestimating Xrisk by naively compounding my subjective Xrisk estimates.
Overall, I think the main takeaway is that there are plausible models in which we basically just get lucky, and fairly naive approaches at alignment just work. I don't think we should bank on that, but I think it's worth asking yourself where your uncertainty about Xrisk is coming from. Personally, I still put a lot of weight on models where the kind of advanced AI systems we're likely to build are not dangerous by default, but carry some ~constant risk of becoming dangerous for every second they are turned on (e.g. by breaking out of a box, having critical insights about the world, instantiating inner optimizers, etc.). But I also put some weight on more FOOM-y things and at least a little bit of weight on us getting lucky.