AlphaGo versus Lee Sedol

gjm

As far as I can tell, Paul's current proposal might still suffer from blackmail, like his earlier proposal which I commented on

What problem do you have in mind here?

I thought that the previous problem was mostly psychological, i.e. that if humans were rational agents then this AI would be roughly as vulnerable to blackmail as its designers. So I thought the issue was the psychological strangeness (and great length) of the weird hypothetical.

Here we have no such hypothetical, and the system's behavior only depends on the predicted behavior of humans in the real world. That seems to address the narrow version of your concern.

I can see two analogous problems:

The learning system may believe itself to be in a simulation, as discussed here. But that seems to cash out to a straightforward technical problem: we want to ensure that as long as there is some important scenario where the AI behaves strangely, we provide feedback on one such scenario. And this technical problem both (1) looks like it should be soluble in principle, (2) looks pretty tractable right now, and (3) is needed to resolve a whole host of other problems (a bunch of exotic failures, but also a bunch of more prosaic difficulties).
The behavior of the system depends on what a human would do across many cases. We can sample from these cases at random, but maybe a small fraction of them are sufficiently surprising that they would lead a human to believe they are in a simulation. We would like to sample these situations with disproportionate probability as well so that the human can won't panic when they find themselves in one, but I haven't talked in any detail about how to do that and it's not obvious whether it is possible. (I do think it's possible.)

Did you have in mind 1, 2, or something else?

30

AlphaGo versus Lee Sedol

30

30

30

AlphaGo versus Lee Sedol

30

30