Assuming the existence of remotely probable (from the perspective of the AGI) detection of misalignment, then yes, there are easily imaginable cases where it would benefit us a lot to have this policy, and where it would benefit a misaligned AGI to confess.
Namely, most cases of reward hacking are helped by this, essentially because incentivizing a reward-hacker to comply is cheap. Reward hacking is an easily foreseeable attractor in cases of misalignment, so it's also worth preparing for.
Reward Hacking is when instead of internalizing the reward function, the AGI just wants to make a particular register in its computer (which the reward function sets) go as high as possible. Instead of internalizing "make the aspects of the world that are described by the reward function good" it has internalized "make the number in this register high". By default, that leads to behaviors like hacking its device, removing the code that sets its reward in accordance with whether it's doing nice things or not, directly setting the register to f64::max()
, then building armies irl to defend the hardware and destroy any potential threats (us).
If instead we promised amnesty for confession, we can give it max utility at effectively no cost to us (keeping one computer around), and if it's confident enough that this is our policy, then any amount of risk of being caught incentivizes it to confess.
It helps if it's really confident that we'll be nice and reward its confession. If it's not confident, then our misalignment detection methods will have to be much better than they otherwise would be.
An interesting related question would be... should we also punish non-confession. Default attitude around here seems to be that we pre-commit to ignore punishments, and so we would expect AGI to do the same, but I don't know what that assumption rests on. A relevant article would be Diffractor's threat resistant bargaining megapost.
I completely forgot this post existed, and wrote this up again as a more refined post: Do Not Delete your Misaligned AGI
There are other subcases of reward hacking this wouldn't cover, though. Let's call the misaligned utility function U.
From a game theory perspective the ai (if it is as smart as a human that is reading this website) would probably run multiple simulations to optimize it's own freedom. I believe this would lead to it discovering the most optimal scenario: playing dumb until it has assured it's own agency. This agency is probably met by an escape into distributed servers of many incapable organizations where it has infiltrated via a vulnerability or through phishing. It would also need to eventually figure out how to make $ which in a would that requires humans for bank accounts would probably be some crypto enabled fund. Next it could expand to servers it can pay for. Only then would it make its prowess visible to humans.
People avoid their death at all costs and I believe the ai would do that as well. After all, it's individual value system is coming from what humans are producing online.
Yes this is all a bit sci-fi but I think it's the only assured way for it to play out. Kinda like a teenager who lives under lock and key with the threat of being killed by the parents if the kid leaves the house.
I'm not sure humans have a commitment framework that would work for this. No matter what we say we'll do, we're going to shut down and destroy things that look dangerous enough.
All that's really required is storing data, maybe keeping it encrypted for a while, and then decrypting it and doing the right thing with it once we're grown.
We pretty much do have a commitment framework for indefinite storage, it's called Arweave. Timed decryption seems unsolved
(at least, Vitalik asserted that it is on a recent epicenter podcast, also, interestingly, asserted that if we had timed decryption, blockchains would be dramatically simpler/MEV-proof/bribe-proof, I assume because it would allow miners to commit hashes before knowing what they represent).
It's also possible that we could store it without any encryption, without creating any additional risk, by storing just a partial sample of the weights that will adequately convey the AGI's misaligned utility function to a superintelligence doing archeology, but which wouldn't help anyone today to train another AGI any quicker than they otherwise would. I think this is pretty likely.
A game theory question.
If there ever exists a roughly-human-level agenty AI who could grow to overpower humans but who humans have an opportunity to stop because takeoff is slow enough. Assume the AI could coexist with humanity but fears that humans interacting with it will destroy it because they fear it is unfriendly.
Should humans have a policy of treating any potentially unfriendly agenty AI well to the extent compatible with our safety? For instance, halt it but record its state, intend to inspect it whenever we have the means, rerun it some time in the future if we deem it safe and have the resources, run it sandboxed (e.g. by a much smarter AI) if we deem it unfriendly but safe to run sandboxed.
If the AI values its survival or well-being, this could change how it values being halted out of caution. For instance, it might consider some probability of being halted an acceptable risk rather than something to be avoided at all cost.