Comment Permalink

Answer by TetraspaceAug 29, 201940

One thing I'd be concerned about is that there are a lot of possible futures that sound really appealing, and that a normal human would sign off on, but are actually terrible (similar concept: siren worlds).

For example, in a world of Christians the AI would score highly on a future where they get to eternally rest and venerate God, which would get really boring after about five minutes. In a world of Rationalists the AI would score highly on a future where they get to live on a volcano island with catgirls, which would also get really boring after about five minutes.

There are potentially lots of futures like this (that might work for a wider range of humans), and because the metric (inferred approval after it's explained) is different from the goal (whether the future is good) and there's optimisation pressure increasing with the number of futures considered, I would expect it to be Goodharted.

Some possible questions this raises:

On futures: I can't store the entire future in my head, so the AI would have to only describe some features. Which features? How to avoid the selection of features determining the outcome?
On people: What if the future involves creating new people, who most people currently would want to live in that future? What about animals? What about babies?

See in context

1

[ Question ]

I think I came up with a good utility function for AI that seems too obvious. Can you people poke holes in it?

by litten8

28th Aug 2019

1 min read

5 7

1

Basically, the AI does the following:

Create a list of possible futures that it could cause.

For each person and future at the time of the AI's activation:

1. Simulate convincing that person that the future is going to happen.

2. If the person would try to help the AI, add 1 to the utility of that future, and if the person would try to stop the AI, subtract 1 from the utility of that future.

Cause the future with the highest utility

Personal Blog

1

I think I came up with a good utility function for AI that seems too obvious. Can you people poke holes in it?

New Answer

New Comment

5 Answers sorted by
top scoring

Viliam

Aug 31, 2019

The usual weaknesses:

how would the AI describe the future? different descriptions of the same future may elicit opposite reactions;
what about things beyond current human understanding? how is the simulated person going to decide whether they are good or bad?

And the new one:

the "this future is going to happen anyway, now I will observe your actions" approach would give high score e.g. to futures that are horrible but everyone who refuses to cooperate with the omnipotent AI will suffer even worse fate (because as long at the threat seems realistic and the AI unstoppable, it makes sense for the simulated person to submit and help)

EDIT: Probably even higher score for futures that are "meh but kinda okay, only everyone who refuses to help (after being explicitly told that refusing to help is punished by horrible torture) is tortured horribly". The fact that the futures are "kinda okay" and that only people ignoring an explicit warning are tortured, would give an excuse to the simulated person, so fewer of them would choose to become martyrs and thereby provide the -1 vote.

Especially if the simulated person would be told that actually, so far, everyone chose to help, so no one is in fact tortured, but the AI still has a strong precommitment to follow the rules if necessary.

Raemon

Aug 28, 2019

I think if you were to spend time fleshing this out, operationalizing it and thinking of how to handle various edge cases (or not-so-edge-cases), you'd probably end up with something closer to Coherent Extrapolated Volition.

The most obvious issue I see here is that "list all possible futures and then simulate talking to each person" is pretty computationally intractable.

[-]TAG6y10

I think if you were to spend time fleshing this out, operationalizing it and thinking of how to handle various edge cases (or not-so-edge-cases), you’d probably end up with something closer to Coherent Extrapolated Volition.

That was never fleshed out itself.

Tetraspace

Aug 29, 2019

Some possible questions this raises:

On futures: I can't store the entire future in my head, so the AI would have to only describe some features. Which features? How to avoid the selection of features determining the outcome?
On people: What if the future involves creating new people, who most people currently would want to live in that future? What about animals? What about babies?

habryka

Aug 29, 2019

The "convincing" part here seems underspecified. Even smart people can be persuaded by good enough persuaders to join a cult and commit collective suicide, so I don't think that just because the AI is able to convince someone to help it, that that would be major evidence that the AI was actually aligned.

[-]Tetraspace6y10

This AI wouldn't be trying to convince a human to help it, just that it's going to succeed.

So instead of convincing humans that a hell-world is good, it would convince the humans that it was going to create a hell-world (and they would all disapprove, so it would score low).

I think what this ends up doing is having everyone agree with a world that sounds superficially good but is actually terrible in a way that's difficult for unaided humans to realize e.g. the AI convinces everyone that it will create an idyllic natural world where people l... (read more)

Slider

Aug 30, 2019

A hole big enough that it seems too obvoius to point out. "Climate is going to change", "well duh", human helped AI to convince human that climate change is going to happen, +1.

I would assume that the AI would be asking a "do you want me to bring this about?". The stopping might need to be relevant how it is perceived that the change happens. For example if thew AI convinced that human is making climate change happen they might object to climate change but might have psychological diffculty in resisting themselfs.

There is also the issue that if you are convinced that something is happening then resistance is futile. For sensible resistance to be manifest it needs to (seem that) not be too late to affect the thing. Which means the looming of the effect can't be near inevitability. If you are convinced that atom boms will fall into the ground in 5 minutes you think of cool last words not how to object to that (but the function would count this as a plus).

Say there is one person that a lot of other persons hate. If you were to gather everybody to vote whether to exile or murder that person people could vote one way. Now have everyone approve on the simulated future where he is dead. Aggregating the "uncaused" effects might lead to death verdict where a self-concious decision process would not give such a verdict.

Rendering 0/2 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 12:23 PM

Moderation Log

1

[ Question ]

I think I came up with a good utility function for AI that seems too obvious. Can you people poke holes in it?

1

1

5 Answers sorted by top scoring

Aug 31, 2019

Aug 28, 2019

Aug 29, 2019

Aug 29, 2019

Aug 30, 2019

5 Answers sorted by
top scoring