This post was rejected for the following reason(s):
Low Quality or 101-Level AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meets a pretty high bar. We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example. You're welcome to post quotes in the latest AI Questions Open Thread.
- I want to be clear: I think your thought process for this post was pretty fine and I'd want you to keep thinking about AI, it's just that a problem is that we don't know how to program an AI to care about something like humanity being dead. I think a useful thing for you to think through would be "how would you program an AI to reliably recognize a dead man switch has been triggered?"
In May I attended Zuzalu's AI x Crypto mini-conference and got more into AI x-Risk. I thought about a potential solution. I know it might be naive, but looked for something like this and didn't find anything related. So I decided to share some notes here to get some feedback. It's written as a post directed to laymen readers.
Alignment is especially hard because understanding AGI's inner workings is not only challenging because we currently don't understand it fully, but also because of the complexity and potentially opaque nature of AGI itself.
Between the ideas that surfaced, one that caught my attention is that we should NOT use censorship-resistant tech, also known as blockchain to train AIs, since this can make it difficult to unplug them. As an outsider in the field, this was a new concept. Unplugging an AGI is of the measures that can be taken to stop it. This concept is better known as off-switch.
To put it simply there are two types of approaches to this type of problem. The main one is focused on embedding the alignment into the AI, which for this specific case is called a Shutdown utility function (https://arbital.com/p/shutdown_utility_function/). But there are also external solutions, that try to limit desalignment. Both cases should be combined for an optimal solution. Here I’m focusing on the external safeguard.
But if an AI goes rogue and gets to exterminate humans, we won’t be able to turn it off. This leads me to think: what if we use a dead-man switch? This is usually implemented in case a person dies, to trigger a pre-defined action. Usually, it's a beacon that doesn’t activate while it receives a signal. For the alignment case, the trigger would unplug all AI servers and the ones that could be used to train a new one. This brings some problems to be solved, let’s explore them:
Why should we bother to unplug the AGI if we are already dead?
If the AI knows of the switch-offs, its basic incentives are against turning it off.
Who would trigger the dead man switch?
Since we don’t want to measure just a small percentage of the population, but actual extinction the trigger should be distributed among many humans around the globe. A consensus on top of a blockchain to track signals seems the most transparent approach. Tokenomics to make this work is a key part of the model. There should be short-term rewards for people and data centers using this. For this solution to be safe we need Quantum resistance on Ethereum, to make sure that we are truly cryptographically safe even in case of a superintelligence.
Even if we pull this off, humanity still has a weak point: The media. The AI could relatively easily manipulate social discourse to steer decisions out of a working solution. AGI will try to get rid of the switch, to perpetuate the attack.
Related links