LESSWRONG
LW

11

New circumstances, new values?

by Stuart_Armstrong

6th Jun 2017

1 min read

11

Crossposted at the Intelligent Agents Forum.

Quick, is there anything wrong with a ten minute pleasant low-intensity conversation with someone we happen to disagree with?

Our moral intuitions say no, as do our legal system and most philosophical or political ideals since the enlightenment.

Quick, is there anything wrong with brainwashing people into perfectly obedient sheep, willing and eager to take any orders and betray all their previous ideals?

There’s a bit more disagreement there, but that generally is seen as a bad thing.

But what happens when the low-intensity conversation and the brainwashing are the same thing? At the moment, no human can overwhelm most other humans in the course of ten minutes talking, and rewrite their goals into anything else. But an AI may well be capable of doing so - people have certainly fallen in love within less than ten minutes, and we don’t know how “hard” this is to pull off, in some absolute sense.

This is a warning that relying on revealed and stated preferences or meta-preferences won’t be enough. Our revealed and (most) stated preferences are that the ten minute conversation is probably ok. But disentangling how much of that “ok” relies on our understanding the consequences will be a challenge.

New to LessWrong?

Getting Started

11

New circumstances, new values?

2Stuart_Armstrong

0Stuart_Armstrong

0entirelyuseless

0Stuart_Armstrong

0Stuart_Armstrong

2Stuart_Armstrong

New Comment

14 comments, sorted by

Click to highlight new comments since: Today at 5:36 AM

[-]cousin_it8y60

Yeah, it's hard to extrapolate how an algorithm "should" handle inputs it wasn't designed for, especially if you're the algorithm :-/ As far as I understand Wei's position, he hopes that we can formalize human philosophical abilities and delegate to that, rather than extend preferences directly which would be vulnerable. See this old thread, especially steven0461's reply, and this post by Wei which fleshes out the idea.

[-]Stuart_Armstrong8y20

Thanks for those links.

[-]Lukas_Gloor8y30

But what happens when the low-intensity conversation and the brainwashing are the same thing?

That's definitely bad in cases where people explicitly care about goal preservation. But only self-proclaimed consequentialists do.

The other cases are more fuzzy. Memeplexes like rationality, EA/utilitarianism, religious fundamentalism, political activism, or Ayn Rand type stuff, are constantly 'radicalizing' people, turning them from something sort-of-agenty-but-not-really into self-proclaimed consequentialist agents. Whether that is in line with people's 'real' desires is to a large extent up for interpretation, though there are extreme cases where the answer seems clearly 'no.' Insofar as recruiting strategies are concerned, we can at least condemn propaganda and brain washing because they are negative-sum (but the lines might again be blurry).

It is interesting that people don't turn into self-proclaimed consequentialists on their own without the influence of 'aggressive' memes. This just goes to show that humans aren't agents by nature, and that an endeavor of "extrapolating your true consequentialist preferences" is at least partially about adding stuff that wasn't previously there rather than discovering something that was hidden. That might be fine, but we should be careful to not unquestioningly assume that this automatically qualifies as "doing people a favor." This, too, is up for interpretation to at least some extent. The argument for it being a favor is presented nicely here. The counterargument is that satisficers often seem pretty happy and who are we to maneuver them into a situation where they cannot escape their own goals and always live for the future instead of the now. (Technically people can just choose whichever consequentialist goal that is best fulfilled with satisficing, but I could imagine that many preference extrapolation processes are set up in a way that make this an unlikely outcome. For me at least, learning more about philosophy automatically closed some doors.)

[-]Stuart_Armstrong8y00

"extrapolating your true consequentialist preferences" is at least partially about adding stuff that wasn't previously there rather than discovering something that was hidden.

Yes yes yes, this is a point I make often. Finding true preferences is not just a learning process, and cannot be reduced to a learning process.

As for why it needs to be done... well, for all the designs like Inverse Reinforcement Learning that involve AIs learning human preferences, it has to be done adequately if those are to work at all.

It is interesting that people don't turn into self-proclaimed consequentialists on their own without the influence of 'aggressive' memes.

Why do you think so? It's not self-evident to me. Consequentialism is... strongly endorsed by evolution. If dying is easy (as it was for most of human history), not being a consequentialist is dangerous.

[-]entirelyuseless8y00

I agree with this, except that I would go farther and add that if we had a superintelligent AI correctly calculate our "extrapolated preferences," they would precisely include not being made into agents.

Wait. this isn't a change in values, just a different anticipated effect. There is a consistent strong preference for one's agency not to be reduced, with some exceptions for cases that are expected to align with our long-term values.

Nothing wrong with a conversation with someone who's not able to memetically damage me. Lots of people voluntarily participate in group events that have brainwashing-like effects. Brainwashing someone against their will is generally seen as bad (except when you're arrogant enough to think their previous values are wrong).

None of that is values changing with circumstance, that's just different situations that interact with the core values differently.

an AI may well be capable of doing so

I don't think so, at least not for reasonable values of "pleasant" and "low-intensity".

fallen in love within less than ten minutes

Are we talking about a ten-minute conversation with a hypersexual catgirl or we are talking about a normal conversation?

relying on revealed and stated preferences or meta-preferences won’t be enough

Won't be enough for what?

[-]Stuart_Armstrong8y00

I don't think so, at least not for reasonable values of "pleasant" and "low-intensity".

We really don't know at the moment. I'd be inclined to agree with you on this point, but I'd put a very high probability that there are interactions with humans today that we'd classify as "perfectly fine" that would allow an AI to manipulate people much more effectively than any human could.

[-]ChristianKl8y00

hypersexual catgirl

Sexual arousal is not the same thing as falling in love. The two are different emotions.

They are, let's say, correlated.

But if you want the point spelled out, it is that falling in love operates mostly below the conscious level and relies on deeper and older structures in the brain, ones that react to smell and movement and textures and what's generally called "gut feelings". If the AI is merely conversing with you using words and nothing else, it does not have direct access to these deep structures. But if there happen to be an {andro|gyno}id which can produce appropriate smells and movements and textures, etc.... things change.

[-]Stuart_Armstrong8y00

But if there happen to be an {andro|gyno}id which can produce appropriate smells and movements and textures, etc.... things change.

Yep. We really don't know whether the "falling in love" is highly stochastic in these situations, or whether it can reliably provoked.

Would you volunteer for A/B testing? :-D

[-]Stuart_Armstrong8y20

Only if the robot is ineffective :-)

More from Stuart_Armstrong

80Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions

Stuart_Armstrong, rgorman

4mo

12

170Using GPT-Eliezer against ChatGPT Jailbreaking

Stuart_Armstrong, rgorman

3y

85

67Alignment can improve generalisation through more robustly doing what a human wants - CoinRun example

Stuart_Armstrong

2y

9

Curated and popular this week

114Comparing risk from internally-deployed AI to insider and outsider threats from humans

11h

10

480A case for courage, when speaking of AI danger

3d

117

268Foom & Doom 1: “Brain in a box in a basement”

7d

102