Exploring Last-Resort Measures for AI Alignment: Humanity's Extinction Switch

0xPetra

This post was rejected for the following reason(s):

Low Quality or 101-Level AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meets a pretty high bar. We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. You're welcome to post quotes in the latest AI Questions Open Thread.
I want to be clear: I think your thought process for this post was pretty fine and I'd want you to keep thinking about AI, it's just that a problem is that we don't know how to program an AI to care about something like humanity being dead. I think a useful thing for you to think through would be "how would you program an AI to reliably recognize a dead man switch has been triggered?"

In May I attended Zuzalu's AI x Crypto mini-conference and got more into AI x-Risk. I thought about a potential solution. I know it might be naive, but looked for something like this and didn't find anything related. So I decided to share some notes here to get some feedback. It's written as a post directed to laymen readers.

Alignment is especially hard because understanding AGI's inner workings is not only challenging because we currently don't understand it fully, but also because of the complexity and potentially opaque nature of AGI itself.

Between the ideas that surfaced, one that caught my attention is that we should NOT use censorship-resistant tech, also known as blockchain to train AIs, since this can make it difficult to unplug them. As an outsider in the field, this was a new concept. Unplugging an AGI is of the measures that can be taken to stop it. This concept is better known as off-switch.

To put it simply there are two types of approaches to this type of problem. The main one is focused on embedding the alignment into the AI, which for this specific case is called a Shutdown utility function (https://arbital.com/p/shutdown_utility_function/). But there are also external solutions, that try to limit desalignment. Both cases should be combined for an optimal solution. Here I’m focusing on the external safeguard.

But if an AI goes rogue and gets to exterminate humans, we won’t be able to turn it off. This leads me to think: what if we use a dead-man switch? This is usually implemented in case a person dies, to trigger a pre-defined action. Usually, it's a beacon that doesn’t activate while it receives a signal. For the alignment case, the trigger would unplug all AI servers and the ones that could be used to train a new one. This brings some problems to be solved, let’s explore them:

Potential resistance from AGI: there are many potential resistance points. The most obvious is that it could try to make a copy that wakes up and restarts servers to keep it running.
Corrigibility: AGI's willingness to allow its objectives to be corrected by humans without resistance. An off-switch is a kind of corrigibility, but designing an AGI that remains corrigible as it self-improves is a nontrivial task.
Reward Hacking: an AGI might find a way to "hack" its reward system to receive maximal rewards without actually following the intended spirit of its programming.
Misuse of Off-Switch: There's also the risk that the off-switch might be misused or activated prematurely, either due to human error, malicious intent, or misunderstanding of the AGI's actions.
Game theory: We have a Western vs. Eastern AI race. Making sure there’s coordination regardless of where the data center is. Limit using off-switch as a weapon by triggering false positives. In this line, each data center could provide a Zero Knowledge proof that shows the data center is running the switch.
Small model AGI: What happens if an AGI can fit into a hacker’s computer? Maybe chip manufacturers should embed the off-switch into the hardware.

Why should we bother to unplug the AGI if we are already dead?

If the AI knows of the switch-offs, its basic incentives are against turning it off.

Who would trigger the dead man switch?

Since we don’t want to measure just a small percentage of the population, but actual extinction the trigger should be distributed among many humans around the globe. A consensus on top of a blockchain to track signals seems the most transparent approach. Tokenomics to make this work is a key part of the model. There should be short-term rewards for people and data centers using this. For this solution to be safe we need Quantum resistance on Ethereum, to make sure that we are truly cryptographically safe even in case of a superintelligence.

Even if we pull this off, humanity still has a weak point: The media. The AI could relatively easily manipulate social discourse to steer decisions out of a working solution. AGI will try to get rid of the switch, to perpetuate the attack.

LESSWRONG
LW

7

Exploring Last-Resort Measures for AI Alignment: Humanity's Extinction Switch

7

7