xkcd on the AI box experiment

FiftyTwo

A newbie question.

From one of Eliezer's replies:

As I presently understand the situation, there is literally nobody on Earth, including me, who has the knowledge needed to set themselves up to be blackmailed if they were deliberately trying to make that happen. Any potentially blackmailing AI would much prefer to have you believe that it is blackmailing you, without actually expending resources on following through with the blackmail, insofar as they think they can exert any control on you at all via an exotic decision theory. Just like in the oneshot Prisoner's Dilemma the "ideal" outcome is for the other player to believe you are modeling them and will cooperate if and only if they cooperate, and so they cooperate, but then actually you just defect anyway. For the other player to be confident this will not happen in the Prisoner's Dilemma, for them to expect you not to sneakily defect anyway, they must have some very strong knowledge about you.

Would this be a fair summary of why Basilisk does not work: "We don't know of a way to detect a bluff by a smarter agent, therefore the agent would prefer bluffing (easy) over true blackmail (hard), so, knowing that we would always call the bluff and therefore the agent would not even try"?

Further on:

I have written the above with some reluctance, because even if I don't yet see a way to repair this obstacle myself, somebody else might see how to repair it now that I've said what it is.

Wouldn't a trivial "way to repair this obstacle" be for the agent to appear stupid enough to be credible? Or has this already been taken into account in the original quote?

28

xkcd on the AI box experiment

28

28

28

xkcd on the AI box experiment

28

28