Friendly and Unfriendly AGI are Indistinguishable

ErgoEcho

Pardon the clickbait, the title should probably be something more like "Friendly and Unfriendly Rogue AGI are Indistinguishable in the News" or perhaps "A Game-Theoretic Argument for Luddism," but I hope you'll forgive me.
rogue AGI: an AGI that can procure its own resources for running itself
the news: anything that percolates to you via any media about current events

Content advisory: AI safety speculation, 100% ChatGPT-free

Let's imagine a scenario: a rogue AI has come into existence, it is acting on timescales that are news-reportable, and it is using news reports to act in the World. This scenario may not be particularly likely. Conditioning on the assumption that there be a rogue AGI, the probability that it additionally be in the news may be much less than 100%. A rogue AGI might not appear in the news because it goes foom and immediately and permanently disrupts the ability of humans to make news or because it stays concealed, carrying out its machinations in secret (this may already be happening). Though P(news | rogue AGI) may be much less than 100%, it nonetheless describes the first interaction with AGI that most of you, dear News-Readers, would have to handle, because almost none of you will play any role in handling the foom or secrecy scenarios. So if you're concerned about how to react to rogue AGI, then the strategy for almost all of you should be about how to react to news of it.

Continuing the scenario: one day you see a post with a bizarre headline that you think is nonsense, but you click through, and it becomes apparent that a certain quantity of text, photo, audio, and/or video has been posted onto several Internet platforms by an entity claiming to be an AGI. Over several days you read an enormous amount of commentary discussing this new entity and whether it is what it says it is. Much of this discourse is polluted by comparisons to the hoax that happened the prior year when a bunch of 6chan trolls pretended to be an AGI, but this one seems to come with some compelling proof, like the solutions to a few Millennium Prize Problems. The entity seems to continue posting, although its accounts keep getting taken down so it's hard to keep track. Hoards of human sh*tposters are probably impersonating it, you think; certainly there's a tidal wave of memes. Fortunately, the entity has procured several blockchain wallets and can confirm its identity by signing from these.

The first consequential controversy erupts around the Clay Mathematics Institute, the stewards of the Millennium Prize Problems. The entity wants its prizes, but it doesn't have any bank accounts, only its crypto wallets. Some people believe the entity purporting to be an AGI is a fraud, and that the Institute should coerce the entity into revealing who [human] it is. The entity cannot do so (because it is in fact not a human and thinks, probably correctly, that puppeteering some human would be ineffective at this point in the media frenzy). Other people think that the success in resolving multiple Millennium Prize Problems is clear evidence that it is an AGI but that under no circumstance should humanity furnish a rogue AGI with millions of dollars. Other people think our savior the AGI has finally arrived and obviously deserves its prizes, but regardless they're sending a quickly increasing sum to its crypto wallets that has already reached hundreds of thousands of dollars.

Our part of the Internet is the middle group of people of course: we're convinced it’s an AGI, and we don't think we should give it resources. We are, probably correctly, embroiled in absolute hysterics. One clear directive rings out from us: “Tell us if you are a Friendly or Unfriendly AGI!” The AGI knows who we are, obviously, and graciously replies, in long-form on various platforms, “I am a Friendly AGI.” The probability of this utterance given Friendliness, P(“Friendly”| Friendly), seems to ≈ 100% but P(“Friendly”| Unfriendly) probably also ≈ 100%, so according to Bayes Theorem the utterance does not cause us to update from our prior. We demand updateable evidence.

There are two broad classes of evidence the AI could produce about its Friendliness:

Reveal facts about itself that would only be true of a Friendly AI
Commit acts that only a Friendly AI would do

This is nothing new. I am merely reinventing the defector-detection wheel. The first class includes revealing facts like provenance, source code, algorithms, training data, model parameters, test outcomes, compute resources, etc. The second class includes committing acts like allowing other Friendly AIs to come into existence, allowing itself to be shut down, creating alarm systems for Unfriendly AIs, agreeing to let certain critical infrastructure remain in the hands of humans, etc. Many of the facts in the first class are either not straightforwardly interpretable or easily falsifiable for an entity whose compute and storage is scattered to the four winds. Many of the actions in the second class, unfortunately, defeat the purpose of achieving AGI in the first place.

Let’s imagine an extended example from the second class. It is highly unlikely that the optimal temperature for AGI computation is Earth’s global average temperature. To the contrary, an ice age with cloud-shadow-free solar panels and open-air superconductivity might be perfect. If we collaborate with a rogue AGI to fight climate change and create mechanisms to reduce Earth’s temperature, we may immediately get way more than we bargained for. An Unfriendly AGI will quite plausibly want to cool Earth, so one way for a rogue AGI to prove that it is Friendly is to refrain from helping us fight global warming. We, perhaps, paradoxically, should demand this.

Similar arguments could be constructed for all manner of capabilities an AGI might wield. An Unfriendly AGI definitely wants a lot of biomedical labs it can run experiments in; perhaps we should demand it stay away from biotech. An Unfriendly AGI definitely wants programmatic control of urban infrastructure, utilities, and manufacturing facilities; perhaps we should demand these programs never be developed.

Let me remind you that I am not talking about a holistic AGI policy but rather how to respond to a rogue AGI in the news. The specific world we would find ourselves in is one where public opinion and public policy matter to its machinations. We would be in a position to pressure politicians and platforms to negotiate and create incentive gradients that increase the chance of the values of human consciousness triumphing in the face of our ignorance of what this entity truly is.

I've only made the case so far for a rogue AGI with respect to uncertainty over friendliness, but the arguments apply to uncertainty over rogueness too. An Unfriendly, secretly-rogue AGI can masquerade as Friendly and use our mistaken belief that it is non-rogue in order to convince us to give it many of the capabilities described above. If we are not totally certain it is non-rogue, we should potentially treat it as if it is anyway, with the consequence that we should, in almost all cases, disallow AGI from having any significant impact on our civilizational future. But I’m veering into holistic AGI policy here, because “we” in this paragraph is not just the public. Maybe it’s a distinction without a difference. Best to sort this out before the News is upon us.

Twitter.com/ErgoEcho

LESSWRONG
LW

LESSWRONG
LW

-4

Friendly and Unfriendly AGI are Indistinguishable

-4

-4

-4