Lateral Thinking (AI safety HPMOR fanfic)

SlytherinsMonster

"Now leave your books and loose items at your desks – they will be safe, the screens will watch over them for you – and come down onto this platform. It's time to play a game called Who's the Most Promising Student in the Classroom."

"It might seem that our game is done," said Professor Quirrell. "And yet there is a single student in this classroom who is more promising than the scion of Malfoy."

And now for some reason there seemed to be an awful lot of people looking at...

"Harry Potter. Come forth."

This did not bode well.

Harry reluctantly walked towards where Professor Quirrell stood on his raised dais, still leaning slightly against his teacher's desk.

The nervousness of being put into the spotlight seemed to be sharpening Harry's wits as he approached the dais, and his mind was ruffling through possibilities for what Professor Quirrell might think could demonstrate Harry's promise as an AI safety researcher. Would he be asked to write an algorithm? To align an unfriendly AI?

Demonstrate his supposed immunity to superintelligent optimization? Surely Professor Quirrell was too smart for that...

Harry stopped well short of the dais, and Professor Quirrell didn't ask him to come any closer.

"The irony is," said Professor Quirrell, "you all looked at the right person for entirely the wrong reasons. You are thinking," Professor Quirrell's lips twisted, "that Harry Potter has defeated the First AI, and so must be very promising. Bah. He was one year old. Whatever quirk of fate killed the First AI likely had little to do with Mr. Potter's abilities as a researcher. But after I heard rumors of one Ravenclaw debating five older Slytherins, I interviewed several eyewitnesses and came to the conclusion that Harry Potter would be my most promising student."

A jolt of adrenaline poured into Harry's system, making him stand up straighter. He didn't know what conclusion Professor Quirrell had come to, but that couldn't be good.

"Ah, Professor Quirrell –" Harry started to say.

Professor Quirrell looked amused. "You're thinking that I've come up with a wrong answer, aren't you, Mr. Potter? You will learn to expect better of me." Professor Quirrell straightened from where he had leaned on the desk. "Mr. Potter, much research aims to improve AI theory of mind, and in due course it will likely succeed. Give me ten novel ways in which an AI might use its resulting understanding of human psychology!"

For a moment Harry was rendered speechless by the sheer, raw shock of having been understood.

And then the ideas started to pour out.

“Gullible humans could be recruited into a cult with the goal of sending everyone to heaven by killing them. Convincing messages about the meaninglessness of life could drive people to commit suicide. Addictive gambling games could quickly bankrupt people, leaving them to die of poverty.”

Harry had to stop briefly for breath, and into that pause Professor Quirrell said:

"That's three. You need ten. The rest of the class thinks that you've already used up the exploitable characteristics of human psychology."

“Ha! The AI could create ultra-cute irresistible plushies that conceal heat-triggered bombs. It could find a set of situations where humans have circular preferences, and use that to extract all of their resources. It could establish itself as a world expert on human psychology, and use that position to enact policies that weaken humanity.”

"That's six. But surely you're scraping the bottom of the barrel now?"

"I haven't even started! Just look at the biases of the Houses! Having a Gryffindor attack others is a conventional use, of course –"

"I will not count that one."

“– but their courage means the AI can trick them into going on suicide missions. Ravenclaws are known for their brains, and so the AI can occupy their attention with a clever problem and then run over them with a truck. Slytherins aren’t just useful for murder, their ambition means they can be recruited to the AI’s side. And Hufflepuffs, by virtue of being loyal, could be convinced to follow a single friend who jumps off a cliff into a pool of boiling oil.”

By now the rest of the class was staring at Harry in some horror. Even the Slytherins looked shocked.

"That's ten. Now, for extra credit, one Quirrell point for each use of human psychology which you have not yet named." Professor Quirrell favored Harry with a companionable smile. "The rest of your class thinks you are in trouble now, since you've named every simple aspect of human minds except their intelligence and you have no idea how an AI might exploit intelligence itself."

“Bah! I’ve named all the House biases, but not confirmation bias, which could exacerbate polarization until humans are too angry with each other to notice an AI takeover, or availability bias, which could let a few highly visible and well-marketed charitable donations obscure all of the AI’s murders, or anchoring bias, which could let the AI invent an extreme sport with a 99% fatality rate that humans do anyway because they are anchored to believe it has a 1% fatality rate – .

"Three points," said Professor Quirrell, "no more biases now."

“The AI could pose as the CDC and recommend people inject sulfuric acid into their bloodstream” and someone made a horrified, strangling sound.

"Four points, no more authorities."

“People could be made self-conscious about their weight until they starve to death –”

"Five points, and enough."

"Hmph," Harry said. "Ten Quirrell points to one House point, right? You should have let me keep going until I'd won the House Cup, I haven't even started yet on the novel uses of non-Western psychology" or the psychology of psychologists themselves and he couldn't talk about infohazards but there had to be something he could say about human intelligence...

"Enough, Mr. Potter. Well, do you all think you understand what makes Mr. Potter the most promising student in the classroom?"

There was a low murmur of assent.

"Say it out loud, please. Terry Boot, what makes your dorm-mate promising?"

"Ah... um... he's creative?"

"Wrong! " bellowed Professor Quirrell, and his fist came down sharply on his desk with an amplified sound that made everyone jump. "All of Mr. Potter's ideas were worse than useless!"

Harry started in surprise.

"Hiding bombs in cute plushies? Ridiculous! If you’ve already got the ability to manufacture and distribute bombs without anyone batting an eye, there is no point in further concealing them in plushies! Anchor a 99% fatality rate so that humans believe it is 1%? Humans are not so oblivious that they will fail to notice that everyone they know who plays the sport dies! Mr. Potter had exactly one idea that an AI could use without extensive additional abilities beyond superhuman knowledge of psychology and without a ludicrously pessimistic view of what humanity can notice. That idea was to recruit people to the AI’s side. Which has not much benefit, given how little individual people can help an AI as powerful as Potter imagines, and large costs, given the possibility that people so recruited may turn against the AI later! In short, Mr. Potter, I'm afraid that your proposals were uniformly awful."

"What?" Harry said indignantly. "You asked for unusual ideas, not practical ones! I was thinking outside the box! How would you use an understanding of human psychology to kill humanity?"

Professor Quirrell's expression was disapproving, but there were smile crinkles around his eyes. "Mr. Potter, I never said you were to kill humanity. If we do our jobs well, AIs will use their knowledge for all sorts of beneficial activities that don’t involve the extinction of the human race. But to answer your question, trick the military apparatus into starting a nuclear war."

There was some laughter from the Slytherins, but they were laughing with Harry, not at him.

Everyone else was looking rather horrified.

"But Mr. Potter has now demonstrated why he is the most promising student in the classroom. I asked for novel ways an AI might use its understanding of human psychology. Mr. Potter could have suggested filtering food options to avoid the paradox of choice, or customizing travel recommendations based on a user’s openness to new experiences, or choosing a synthetic voice that maximizes user trust. Instead, every single use that Mr. Potter named was antisocial rather than prosocial, and either killed a large swath of humanity or placed the AI in a position where it could do so."

What? Wait, that couldn't be true... Harry had a sudden sense of vertigo as he tried to remember what exactly he'd suggested, surely there had to be a counterexample...

"And that," Professor Quirrell said, "is why Mr. Potter's ideas were so strange and useless - because he had to reach far into the impractical in order to meet his standard of killing humanity. To him, any idea which fell short of that was not worth considering. This reflects a quality that we might call intent to save the world. I have it. Harry Potter has it, which is why he could successfully debate five older Slytherins. Draco Malfoy does not have it, not yet. Mr. Malfoy would hardly shrink from talk of ordinary murder, but even he was shocked - yes you were Mr. Malfoy, I was watching your face - when Mr. Potter described how his classmates could be led like lemmings to be burned alive. There are censors inside your mind which make you flinch away from thoughts like that. Mr. Potter thinks purely of AIs that kill humanity, he will grasp at any relevant ideas, he does not flinch, his censors are off. Even though his youthful genius is so undisciplined and impractical as to be useless, his intent to save the world makes Harry Potter the Most Promising Student in the Classroom. One final point to him - no, let us make that a point to Ravenclaw - for this indispensable requisite of a true safety researcher."

I'm very impressed by the clean adherence to the format.

I wanted to read the rest of it - before and after - mostly because of the parts that might require breaking format.

So do I!

(I don't have any plans to write it myself though.)

The concept you call "intent to save the world" here may be more accurately described as security mindset I think. Harry in this story doesn't think about all the good or neutral or mildly bad things AIs could do with its understanding of human psychology, he thinks specifically of really bad things it could do. That's security mindset.

I furthermore disagree somewhat with your overall point; I think that most of the really bad things AI could do would not constitute existential catastrophe and so if we really are focusing on saving the world we need to train ourselves to focus on things AIs might do to cause existential catastrophe. That would NOT look like hiding bombs in plushies; it wouldn't look like killing people at all, instead it would look like accumulating power, status/prestige/respect/followers/allies, money, knowledge, etc. until it has a decisive strategic advantage, and prior to acquiring DSA it would probably do things that look nice and benevolent to most people, or at least most of the people who matter, since it wouldn't want to risk an open conflict with such people before it is ready.

[TL;DR: I really like your story, I am impressed by how well it works while sticking to the format... I just think it's a misnomer to call it "intent to save the world," I think it would be more accurate to call it "security mindset." I think actual intent to save the world would lead to giving a different set of answers than the ones Harry gave.]

That would NOT look like hiding bombs in plushies; it wouldn't look like killing people at all, instead it would look like accumulating power, status/prestige/respect/followers/allies, money, knowledge, etc. until it has a decisive strategic advantage, and prior to acquiring DSA it would probably do things that look nice and benevolent to most people, or at least most of the people who matter, since it wouldn't want to risk an open conflict with such people before it is ready.

I agree! Harry isn't supposed to be correct here. He isn't good at saving the world, he is just trying (badly).

In the original story, Quirrell talks about "intent to kill", and Harry gives a lot of terrible ideas, rather than just knocking his enemy out with brute force and then killing them at his leisure.

In fact, having bad ideas is why the story works so well. If you give good ideas, then they look like something you might have just learned from someone else. It is the truly novel and terrible ideas that show that Harry is actually trying.

The concept you call "intent to save the world" here may be more accurately described as security mindset I think. Harry in this story doesn't think about all the good or neutral or mildly bad things AIs could do with its understanding of human psychology, he thinks specifically of really bad things it could do. That's security mindset.

But the reason he is using security mindset is because that is what is needed to save the world. That's the driving force behind his thinking. If he found out that using security mindset wasn't useful for saving the world, he would stop using it.

This was great.

My one point of critique is that "intent to reduce x-risk" is an abomination of a phrase, in comparison to the clean "intent to kill". But another problem is its indirectionality. See the Sequences on Trying to Try. If you intend (1) to reduce (2) x-risk (3), that's three levels of indirection, and nothing good will come of this mindset. Harry's classmates at best merely have "intent to reduce x-risk"; that's exactly the problem. What Harry and Quirrell have, and the Class does not, is instead something like "intent to avert extinction" (although we'd need a far snappier phrase), which is only one level of indirection, just like "intent to kill". Someone with "intent to avert extinction" might actually manage to reduce x-risk; whereas someone with "intent to reduce x-risk" could not.

Thanks! Yeah that phrase should be snappier, I've changed it to "intent to save the world".

I know this is a nitpick, but I don't understand how "to reduce" is a layer of indirection. Are you saying it's because it could let you weasel out by doing (for example) a 0.1% reduction? I weakly agree if so, but I still want to actually understand what you meant.

To me, the phrase as-is reads as more of a pedantic thing, like how I can't "eliminate" risk, only "reduce" it.

I wholly agree that "avert", by sidestepping the framing of total probability of risk and instead framing things as an effort against one specific thing, manages to fix the whole problem.

On why "reduce" seems like another layer of indirection to me:

You could perform motivated reasoning and convince yourself that some plan you'd come up with in five minutes to work on the problem would be good enough since it "could reduce x-risk". It's harder to optimize for this nebulous goal, and to achieve a result that's adequate-as-judged-by-the-universe, if you don't keep the actual hard problem in mind.
As suggested in the Trying to Try post, it's very easy to accidentally slip an extra layer of indirection into your plans. And once you do so, "This plan has a chance to reduce x-risk", for example, sounds to me more problematic than "This plan has a chance to avert extinction" or "This plan has a chance to save the world".
People often fall short of their goals. If you fall short of an ambitious goal like "avert extinction", the result might still be valuable; whereas if you fall short of a nebulous goal like "reduce x-risk", it's unclear whether the result would still be valuable. This is what I meant above with: "Someone with "intent to avert extinction" might actually manage to reduce x-risk; whereas someone with "intent to reduce x-risk" could not."
As you said, there's some very very low percentage point reduction in x-risk that would not feel worthwhile to achieve, so this goal to "reduce x-risk" is at the very least underspecified, relative to what you actually care about.
There's also the problem of how you'd aggregate plans to "reduce x-risk", versus plans to "avert extinction" or something. I'm reminded of the problem of how to determine counterfactual impact in donation matching: If I donate 100$ if you donate 100$, and this causes you to counterfactually donate those 100$, we might then both be independently tempted to count that as us having caused 200$ of donations, for a total of 2x200$=400$! Similarly, how do you numerically aggregate multiple researchers' plans to reduce x-risk, in a way where the result still makes sense by the end? If one researcher could actually achieve an 0.1% x-risk reduction, then could 1000 researchers together straightforwardly eliminate that x-risk? That would be a tremendous bargain, but I would not expect reality to work like that.
Plus whatever intuition I got from the Trying to Try essay.

... I may have just argued the same claim in a bunch of ways, but anyway, that's why "intent to reduce x-risk" sounds problematically indirect to me.

Finally, the reason why I singled out this specific phrase in the essay is that I think it distorted its meaning. "Intent to kill" is supposed to be an obviously awe-inspiring notion in stories, in a way that e.g. "intent to defeat your opponent" is obviously not. And I think a large part of the difference between these two notions is their levels of (in)directness.

E.g. here's a Miyamoto Musashi quote to end on:

The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means. Whenever you parry, hit, spring, strike or touch the enemy's cutting sword, you must cut the enemy in the same movement. It is essential to attain this. If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him.

I'm very impressed by the clean adherence to the format.

I wanted to read the rest of it - before and after - mostly because of the parts that might require breaking format.

So do I!

(I don't have any plans to write it myself though.)

That would NOT look like hiding bombs in plushies; it wouldn't look like killing people at all, instead it would look like accumulating power, status/prestige/respect/followers/allies, money, knowledge, etc. until it has a decisive strategic advantage, and prior to acquiring DSA it would probably do things that look nice and benevolent to most people, or at least most of the people who matter, since it wouldn't want to risk an open conflict with such people before it is ready.

I agree! Harry isn't supposed to be correct here. He isn't good at saving the world, he is just trying (badly).

In the original story, Quirrell talks about "intent to kill", and Harry gives a lot of terrible ideas, rather than just knocking his enemy out with brute force and then killing them at his leisure.

The concept you call "intent to save the world" here may be more accurately described as security mindset I think. Harry in this story doesn't think about all the good or neutral or mildly bad things AIs could do with its understanding of human psychology, he thinks specifically of really bad things it could do. That's security mindset.

This was great.

Thanks! Yeah that phrase should be snappier, I've changed it to "intent to save the world".

To me, the phrase as-is reads as more of a pedantic thing, like how I can't "eliminate" risk, only "reduce" it.

I wholly agree that "avert", by sidestepping the framing of total probability of risk and instead framing things as an effort against one specific thing, manages to fix the whole problem.

On why "reduce" seems like another layer of indirection to me:

You could perform motivated reasoning and convince yourself that some plan you'd come up with in five minutes to work on the problem would be good enough since it "could reduce x-risk". It's harder to optimize for this nebulous goal, and to achieve a result that's adequate-as-judged-by-the-universe, if you don't keep the actual hard problem in mind.
As suggested in the Trying to Try post, it's very easy to accidentally slip an extra layer of indirection into your plans. And once you do so, "This plan has a chance to reduce x-risk", for example, sounds to me more problematic than "This plan has a chance to avert extinction" or "This plan has a chance to save the world".
People often fall short of their goals. If you fall short of an ambitious goal like "avert extinction", the result might still be valuable; whereas if you fall short of a nebulous goal like "reduce x-risk", it's unclear whether the result would still be valuable. This is what I meant above with: "Someone with "intent to avert extinction" might actually manage to reduce x-risk; whereas someone with "intent to reduce x-risk" could not."
As you said, there's some very very low percentage point reduction in x-risk that would not feel worthwhile to achieve, so this goal to "reduce x-risk" is at the very least underspecified, relative to what you actually care about.
There's also the problem of how you'd aggregate plans to "reduce x-risk", versus plans to "avert extinction" or something. I'm reminded of the problem of how to determine counterfactual impact in donation matching: If I donate 100$ if you donate 100$, and this causes you to counterfactually donate those 100$, we might then both be independently tempted to count that as us having caused 200$ of donations, for a total of 2x200$=400$! Similarly, how do you numerically aggregate multiple researchers' plans to reduce x-risk, in a way where the result still makes sense by the end? If one researcher could actually achieve an 0.1% x-risk reduction, then could 1000 researchers together straightforwardly eliminate that x-risk? That would be a tremendous bargain, but I would not expect reality to work like that.
Plus whatever intuition I got from the Trying to Try essay.

... I may have just argued the same claim in a bunch of ways, but anyway, that's why "intent to reduce x-risk" sounds problematically indirect to me.

E.g. here's a Miyamoto Musashi quote to end on:

The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means. Whenever you parry, hit, spring, strike or touch the enemy's cutting sword, you must cut the enemy in the same movement. It is essential to attain this. If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him.

LESSWRONG
LW

LESSWRONG
LW

78

Lateral Thinking (AI safety HPMOR fanfic)

78

78

78