I showed recently, predicated on a few assumptions, that a certain agent was asymptotically “benign” with probability 1. (That term may be replaced by something like “domesticated” in the next version, but I’ll use “benign” for now).
This result leaves something to be desired: namely an agent which is safe for its entire lifetime. It seems very difficult to formally show such a strong result for any agent. Suppose we had a design for an agent which did value learning properly. That is, suppose we somehow figured out how to design an agent which understood what constituted observational evidence of humanity’s reflectively-endorsed utility function.
Presumably, such an agent could learn (just about) any utility function depending on what observations it encounters. Surely, there would be a set of observations which caused it to believe that every human was better off dead.
In the presence of cosmic rays, then, one cannot say that agent is safe for its entire lifetime with probability 1 (edited for clarity). For any finite sequence of observations that would cause the agent to conclude that humanity was better off dead, this sequence has strictly positive probability, since with positive probability, cosmic rays will flip every relevant bit in the computer’s memory.
This agent is presumably still asymptotically safe. This is a bit hard to justify without a concrete proposal for what this agent looks like, but at the very least, the cosmic ray argument doesn’t go through. With probability 1, the sample mean of a Bernoulli() random variable (like the indicator of whether a bit was flipped) approaches , which is small enough that a competent value learner should be able to deal with it.
This is not to suggest that the value learner is unsafe. Insanely inconvenient cosmic ray activity is a risk I’m willing to take. The takeaway here is that it complicates the question of what we as algorithm designers should aim for. We should definitely be writing down sets assumptions from which we can derive formal results about the expected behavior of an agent, but is there anything to aim for that is stronger than asymptotic safety?
Not quite. The AI starts with some prior over (environment, advisor policy) pairs and updates it with incoming observations. It can take an action if, given its current belief state, it is sufficiently confident that it is an action the advisor could take. The confidence threshold is controlled by the parameter η which has a certain optimal value to achieve the best regret bound (as γ→1, η→0; in other words, the more long-term the plan is, the more cautious the AI becomes; obviously catastrophes modify this trade-off). That is, the AI generalizes from what it already observed rather than requiring the exact same state to repeat itself. Indeed, if we required the exact same state to repeat itself, the regret bound would scale with the number of states. Instead, it scales with the number of hypotheses (of course we can also derive a "structural" / "non-uniform" version for a countable number of hypotheses). Also, I am pretty sure that we can derive a regret bound that scales with RVO and MB dimensions (I also think MB dimension can be replaced by prior entropy, but so far hasn't been able to prove it), which can be bounded either in terms of the number of hypotheses or in terms of the number of states and actions, and can also remain small when both the number of hypotheses and the number of states are large.