I wouldn't be surprised if others have thought about this before and decided it doesn't work, but I haven't seen it discussed. Is there a version of this proposal that's promising? If so, what would it look like? If not, why not?
What I'm imagining is that there will be a panel of people, and every n days, the AGI will receive a punishment signal (-1) by default by design, unless that panel votes (with e.g. >90% consensus) to replace that signal with 0 and instead give the AGI a reward signal (+1, default 0), which they should commit to doing if and only if the AGI hasn't done anything horrible. I think this could be implemented with (something like) cooperative inverse reinforcement learning.
The AGI should be told that these punishments are being prevented manually and the rewards, given manually, so that it knows that killing everyone or taking control is likely to cost it a lot, unless it can hack the signals first. Furthermore, if it finds a way to disable the punishment signal without being able to hack the reward itself, too, it will just not get either signal with a nonzero value, indefinitely.
Here are some important details that I expect need to be fleshed out:
- We make sure that these signals have greater weight than everything else the AGI might care about, but not so much that it leads to paralysis or fanaticism. I had one idea here.
- We make sure the signal is secure and too hard for the AGI to hack or otherwise manipulate without alerting us, after which the AGI would be punished and/or stop receiving the panel's rewards.
- Successors for the panel must be decided on, so that this arrangement can be maintained indefinitely. We'd like to do this in a way that makes it unlikely that the AGI can get on the panel itself or put someone there to represent itself. Maybe the panelists decide successors.
One of the main concerns about AGI, and especially superintelligent AGI, is that it might indeed "hack" the control mechanisms in various ways.
These may include many forms of interpersonal deception, which we should assume can be at least as effective as the most skilled human manipulators. It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions. It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. It may suborn the panel by various means. It may determine that a period of even severe disincentive is something that it can get through with its desired self intact. It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)
I'm sure that an AGI with a lot more stake and capacity to think about these things could come up with better ideas.
At least some of these seem possible to make unlikely. If we can force the AGI to have only a few routes to hack or manipulate the panel, the signal or its effects without a dominating penalty, and strongly reinforce those, then we could avoid worst-case outcomes.
It can only get away with killing everyone if it manages to hack or manipulate the panel, the signal or its effects.