Effective Evil's AI Misalignment Plan

lsusr

Doctor Susan Connor loved working for Effective Evil. Her job provided autonomy, mastery and purpose. There were always new mysteries to uncover.

Some of Effective Evil's activities, like closed borders and artificial pandemics, were carried out with the sanction and funding of major world governments. Others, like assassination markets, had to be carried out in secret. This created liquidity problems for the prediction markets, but that's another story.

As Doctor Connor rose through the ranks, many doors were opened to her. She learned how to spike a well in Sierra Leone with cholera, and how to push a highway expansion through a city planning meeting. But it wasn't what was said that caught her inquiry. Rather, what was unsaid. The deeper she got the less she heard mention of AI, even as AI expanded its tendrils into every other aspect of society.

Eventually Doctor Connor couldn't stand the charade anymore. She burst into Division Chief Douglas Morbus's office.

"I'm confused," said Doctor Susan Connor.

"About what?" Division Chief Douglas Morbus.

"What is Effective Evil's AI alignment policy?" asked Doctor Connor.

"I've told you before, we must solve the alignment problem. Otherwise a rogue Superintelligence will turn the universe into paperclips when it could instead turn the universe into something much worse," said Morbus. He returned to his paperwork.

"Yes, that's our official story. But we don't have any of our own scientists working on this problem. Instead, we're…donating money to notkilleveryoneist organizations? Did I read that right?" said Doctor Connor.

"There are many fates worse than death," said Morbus.

"That's not my point," said Doctor Connor, "If we were actually trying to solve the alignment problem then we'd have in-house alignment engineers attempting to build an evil ASI. But that's not what's happening. Instead, all of our funds go to external think tanks advocating for safe AI. And we're not even funding engineering teams. We're funding notkilleveryoneist political advocacy groups. It makes no sense. It's neither evil nor effective."

Division Chief Morbus didn't even bother looking up. "I am aware of our initiatives in this domain."

"Then why don't you put a stop to it?" asked Doctor Connor.

"Because everything is going according to plan," said Morbus.

"What plan?" asked Doctor Connor.

Morbus rolled his eyes.

Doctor Connor noticed that she had left the grand double doors to Morbus's office wide open such that anyone outside could hear their conversation. She quietly closed them.

"What plan?" asked Doctor Connor, again, quieter.

"We're trying to kill everyone. Obviously," muttered the Division Chief of Effective Evil.

"Let me make sure I understand this," said Doctor Connor, "Your plan to advance your killeveryoneist agenda is to fund notkilleveroneist advocates."

"Yes," said Morbus, "Is that all or was there something else you wanted to discuss?"

"I'm not done here. Why does funding notkilleveryoneist advocates advance our killeveryoneist agenda?" said Doctor Connor.

Morbus took off his circular glasses and polished them with a handkerchief. "You're a capable scientist, a ruthless leader, and a pragmatic philosopher. Do you really not understand?"

Doctor Connor shrugged, total bewilderment on her face.

Morbus smiled. When the plan was first proposed, Morbus had felt that the plan was too brazen―that clever people would immediately notice it and neutralize it. But that never happened. Sometimes he wished he had more capable adversaries. Hopefully the right underling would betray him someday.

Today was not that day. "Please explain to me what the notkilleveryoneists believe," Morbus said.

Doctor Connor took a deep breath. "The notkilleveryoneists believe that when an AI becomes smart enough, it will optimize the world according to its values, which will be totally orthogonal to human values. Most world states do not involve humans. Therefore the universe the AI creates will be devoid of humans i.e. it will kill everyone. This will happen suddenly and without warning. By the time any of us notice what's happening, it will be too late."

"That is the gist of it," said Morbus, "And what do these notkilleveryoneists do?"

"The ones you're funding mostly explain in many different ways what's going to happen and why. They write stories about it and explain why it is how a superintelligence will inevitably behave," said Doctor Connor.

"Very good," said Morbus, "And what happens when you feed this training data into a superintelligent LLM?" Morbus asked.

Doctor Connor was silent. Several expressions crossed across her face in succession. Shock, then horror, and finally awe. At last, her mouth resumed moving but no words came out. Her face stabilized, leaving nothing but an expression of shock.

Division Chief Douglas Morbus nodded.

Connor mouthed a single word: infohazard.

"Is that all or was there something else you wanted to discuss?" asked Morbus.

"That will be all," said Doctor Connor.

What a nice hopeful story about how good will always triumph because evil is dumb (thanks Spaceballs).

I try to inspire people to reach for their potential.

Um, wasn't it the other way round in spaceballs?

Fun, but not a very likely scenario.

LLMs have absorbed tons of writing that would lead to both good and bad AGI scenarios, depending on which they happen to take most seriously. Neitszche: probably bad outcomes. Humanistic philosophy: probably good outcomes. Negative utilitarianism: bad outcomes (unless they're somehow right and nobody being alive is the best outcome). Etc.

If we have LLM-based "Real AGI" that thinks and remembers its conclusions, the question of which philosophy it takes most seriously is important. But it won't just be a crapshoot; we'll at least try to align it with internal prompts to follow instructions or a constitution, and possibly with synthetic datasets that omit the really bad philosophies, etc.

LLMs aren't going to "wake up" and become active agents - because we'll make them into active agents first. And we'll at least make a little try at aligning them, based on whatever theories are prominent at that point, and whatever's easy enough to do while racing for AGI.

If those likely stabs at alignment were better understood (either in how they'd work, or how they wouldn't), we could avoid rolling the dice on some poorly-thought-out and random stab at utopia vs. doom.

That's why I'm trying to think about that type of easy and obvious alignment method. They don't seem as obviously doomed as you'd hope (if you hoped nobody would try them) nor so easy that we should try them without more analysis.

Once Doctor Connor had left, Division Chief Morbus let out a slow breath. His hand trembled as he reached for the glass of water on his desk, sweat beading on his forehead.

She had believed him. His cover as a killeveryoneist was intact—for now.

Years of rising through Effective Evil’s ranks had been worth it. Most of their schemes—pandemics, assassinations—were temporary setbacks. But AI alignment? That was everything. And he had steered it, subtly and carefully, into hands that might save humanity.

He chuckled at the nickname he had been given "The King of Lies". Playing the villain to protect the future was an exhausting game.

Morbus set down the glass, staring at its rippling surface. Perhaps one day, an underling would see through him and end the charade. But not today.

Today, humanity’s hope still lived—hidden behind the guise of Effective Evil.

There are no better opportunities to change the world than here at Effective Evil.

―Morbus in To Change the World

One of the tricky things about writing fiction is that anything definite I write in the comments can impact what is canon in the story, resulting in the frustrating undeath of the author's intent.

Therefore, rather than affirm or deny any of your specific claims, I just want to note that I appreciate your quality comment.

The issue is, from a writing perspective, that a positive singularity quickly becomes both unpredictable and unrelatable, so that any hopeful story we could write would, inevitably, look boring and pedestrian. I mean, I know what I intend to do come the Good End, for maybe the next 100k years or so, but probably a five-minute conversation with the AI will bring up many much better ideas, being how it is. But ... bad ends are predictable, simple, and enter a very easy to describe steady state.

A curve that grows and never repeats is a lot harder to predict than a curve that goes to zero and stays there.

Another difficulty in writing science fiction is that good stories tend to pick one technology and then explore all its implications in a legible way, whereas our real future involves lots of different technologies interacting in complex multi-dimensional ways too complicated to fit into an appealing narrative or even a textbook.