One of the main concerns about AGI, and especially superintelligent AGI, is that it might indeed "hack" the control mechanisms in various ways.
These may include many forms of interpersonal deception, which we should assume can be at least as effective as the most skilled human manipulators. It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions. It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. It may suborn the panel by various means. It may determine that a period of even severe disincentive is something that it can get through with its desired self intact. It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)
I'm sure that an AGI with a lot more stake and capacity to think about these things could come up with better ideas.
At least some of these seem possible to make unlikely. If we can force the AGI to have only a few routes to hack or manipulate the panel, the signal or its effects without a dominating penalty, and strongly reinforce those, then we could avoid worst-case outcomes.
It may conduct operations that appear good (and even are good) but also have a side effect of more easily allowing or hiding future bad actions.
It can only get away with killing everyone if it manages to hack or manipulate the panel, the signal or its effects.
It may modify itself directly to ignore these reward and punishment inputs, or induce some human to do it. (...) It may be able to subvert the effect of the signals, such as prearranging its mental state such that the punishment signals alter it in a long-term desirable direction (no pain no gain?)
This was my main worry. Some ideas:
Aside: Does your response here rule out CIRL as a solution in general?
It may suborn the panel by various means.
We can make sure to choose the panelists well and that there are enough of them, since the AGI needs a minimum number of votes from a size-limited panel (if this is the route its taking). The AGI would need to get almost all of the panel to succeed, by design, so this would be very risky for it to attempt.
If we assume the panelists can't be hidden from the AGI anyway, we can have them meet each other in-person (pairwise?) and undergo interrogations to check that they're not compromised.
It may determine that a period of even severe disincentive is something that it can get through with its desired self intact.
Hmm, if we successfully align its goals with maximizing the signal, then this could still make sense for it to do if it expects to wirehead itself after, but if it does wirehead itself, would that make it much less dangerous? I suppose it might then focus on self-preservation (and spreading?) so it could wirehead indefinitely, which accumulating large amounts of resources is still useful for.
Maybe we can hardcode the AGI so that later rewards can't make up for earlier failures (e.g. something like described at the bottom of this comment), although this might make it useless, so then no one would use such a design.
I wouldn't be surprised if others have thought about this before and decided it doesn't work, but I haven't seen it discussed. Is there a version of this proposal that's promising? If so, what would it look like? If not, why not?
What I'm imagining is that there will be a panel of people, and every n days, the AGI will receive a punishment signal (-1) by default by design, unless that panel votes (with e.g. >90% consensus) to replace that signal with 0 and instead give the AGI a reward signal (+1, default 0), which they should commit to doing if and only if the AGI hasn't done anything horrible. I think this could be implemented with (something like) cooperative inverse reinforcement learning.
The AGI should be told that these punishments are being prevented manually and the rewards, given manually, so that it knows that killing everyone or taking control is likely to cost it a lot, unless it can hack the signals first. Furthermore, if it finds a way to disable the punishment signal without being able to hack the reward itself, too, it will just not get either signal with a nonzero value, indefinitely.
Here are some important details that I expect need to be fleshed out: