Original Post:
We present an algorithm [updated version], then show (given four assumptions) that in the limit, it is human-level intelligent and benign.
Will MacAskill has commented that in the seminar room, he is a consequentialist, but for decision-making, he takes seriously the lack of a philosophical consensus. I believe that what is here is correct, but in the absence of feedback from the Alignment Forum, I don't yet feel comfortable posting it to a place (like arXiv) where it can get cited and enter the academic record. We have submitted it to IJCAI, but we can edit or revoke it before it is printed.
I will distribute at least min($365, number of comments * $15) in prizes by April 1st (via venmo if possible, or else Amazon gift cards, or a donation on their behalf if they prefer) to the authors of the comments here, according to the comments' quality. If one commenter finds an error, and another commenter tinkers with the setup or tinkers with the assumptions in order to correct it, then I expect both comments will receive a similar prize (if those comments are at the level of prize-winning, and neither person is me). If others would like to donate to the prize pool, I'll provide a comment that you can reply to.
To organize the conversation, I'll start some comment threads below:
- Positive feedback
- General Concerns/Confusions
- Minor Concerns
- Concerns with Assumption 1
- Concerns with Assumption 2
- Concerns with Assumption 3
- Concerns with Assumption 4
- Concerns with "the box"
- Adding to the prize pool
Edit 30/5/19: An updated version is on arXiv. I now feel comfortable with it being cited. The key changes:
- The Title. I suspect the agent is unambitious for its entire lifetime, but the title says "asymptotically" because that's what I've shown formally. Indeed, I suspect the agent is benign for its entire lifetime, but the title says "unambitious" because that's what I've shown formally. (See the section "Concerns with Task-Completion" for an informal argument going from unambitious -> benign).
- The Useless Computation Assumption. I've made it a slightly stronger assumption. The original version is technically correct, but setting is tricky if the weak version of the assumption is true but the strong version isn't. This stronger assumption also simplifies the argument.
- The Prior. Rather than having to do with the description length of the Turing machine simulating the environment, it has to do with the number of states in the Turing machine. This was in response to Paul's point that the finite-time behavior of the original version is really weird. This also makes the Natural Prior Assumption (now called the No Grue Assumption) a bit easier to assess.
Edit 17/02/20: Published at AAAI. The prior over world-models is now totally different, and much better. There's no "amnesia antechamber" required. The Useless Computation Assumption and the No Grue Assumption are now obselete. The argument for unambitiousness now depends on the "Space Requirements Assumption", which we probed empirically. The ArXiv link is up-to-date.
So I think this proposal is a specific instance of a general principle: wireheading acts as a safety valve. As long as the AI is prone to wireheading, and you leave it some path to wireheading, the AI will never bother with anything more difficult than that path. In this particular case, it will never do anything harder than hacking its human operator, and that's where most of the safety benefit comes from.
This kind of setup creates a direct trade-off between safety and capability: the harder it is to hack the human operator, the more likely that some other escape channel is easier - manipulating electrical circuitry to push atoms around, or some weird thing like that. The AI will become an existential threat if you send in an operator who really doesn't want to give it 100% reward, and make that operator hard to hack - then the AI has a strong incentive to gain whatever power it needs. (I definitely do not trust a door with a kill switch to keep a superintelligent AI in a box.)
My intuition says that nothing else in the proposal actually matters for safety, except the wireheading safety valve aspect. In particular, I think just giving the AI direct exploration abilities won't mess up safety, as long as the wireheading path is present and there's not "99.99% certainty is better than 99.98%"-type failure mode.