Review

I present here a helpful analogy to understand the corrigibility problem and the challenge raised by MIRI in their proposal. This analogy simplifies greatly some challenges of corrigibility but keeps the main problem found in the proposal, which I call Big Gambles.


You are playing Mario Kart with 2 other friends, Alice and Bob, and are playing matches to see who's the best player on average, decided by the average position between 1st, 2nd, or 3rd.

Alice has a poor internet connection and so gets disconnected sometimes. To account for that in the competition, a rule was added that her score would be 2nd if she is disconnected so that she is not penalized too much.

Bob takes the game a bit too seriously, and he absolutely wants to win. He will do anything to score first, like getting disconnected on purpose when 3rd, which grants him more points. You quickly realize that, and want to change the rule.

But what to do? If you give 3rd place in case of disconnection, then this will punish Alice unfairly for her internet connection, but if you increase the score, then Bob will start cheating, and you don’t want to have an argument with him. You also cannot make the rule for Alice only, since Bob and you sometimes get disconnected, even if less often.

 

Disconnection indifference

During interclass, you talk about this problem to Michael, the math kid, and here is what he tells you:

  • “You should have a personalized score such that no one should be interested in faking the disconnection, but also such that if it happens it won’t penalize you. Each player should be indifferent to disconnecting.”
  • “Well, this sounds like a good idea, but how to do that ?”
  • “This is simple: just give everyone its expected rank on non-disconnected games.”
  • “Expected what ?”
  • “Just play some games without disconnection, and record the position for each player at each lap and the end of the game. Now for a new game, if Alice disconnects at lap 3 in position 1, look at the average place she would be in from the database of non-disconnected games, and add this to her score.”
  • “Hmm, yes right, so if Bob is last and disconnects, he won’t win much. And if he’s first and disconnects accidentally, then he’ll only lose a few places!”
  • “Exactly. But note that this method is not perfect: in the last position, the player still has the incentive to quit to have a better score than 3 since it's the minimal score. The other way around, the first player can lose points by getting disconnected.”
  • “Thank you for the advice Micheal! I think such minor incentives won’t be a problem.”

You then meet your two friends to explain to them the new plan, they are both really excited. You start recording the game’s data to build your new tool, and soon it works pretty well: no one disconnects on purpose anymore. Well, until Bob starts disconnecting frequently again. At first, you thought this was his internet connection deteriorating since Micheal’s solution should work, but you start inquiring.

Looking at the data, you found that when he’s in 3rd place, Bob often disconnects. Micheal told you that it would still have the incentive to cheat, but he doesn’t win many points, right? No, in fact, he does. His average score when 3rd on the last lap is 1st! How is this possible? He shouldn’t be so lucky. You once again decide to talk to Micheal.
 

A loophole in disconnection indifference

The next day, you go meet Micheal again and tell him about the situation. Unable to help you, you decide to talk directly to Bob, as he must have understood what was wrong with the plan.

  • “Okay, you got me, I’ve disconnected myself on purpose to win more points. I admit this was not fair.”
  • “It's okay Bob, just stop doing it now. By the way, what was wrong with the way we distributed scores? From what Micheal told me, you shouldn’t have been able to win so many points by disconnecting.”
  • “Well, Micheal’s idea worked just fine. Until I realized something. The score was computed only from the game I finished playing, not the other ones. So what I had to do was only finish playing games where I came from last to first!”
  • “Okay I see, that’s sneaky.”
  • “In practice, I was manipulating luck: if on the last turn my luck was bad, then I disconnected myself. This way, the only games that were taken into account were the ones I had good luck with. So Micheal’s algorithm was thinking that my average game was a lucky game and so rewarded me better!”
  • “You should go see him and tell him this caveat. Who knows, maybe you will solve a great problem together.”

From this point on, you were then able to enjoy your game of Mario Kart. Maybe you should have asked Bob to stop cheating right from the beginning, rather than creating a complicated method to make him stop.

 

Moral of the story

The exploit Bob found in Micheal’s algorithm is what I call the Big Gambles problem: Bob has the incentive to disconnect in case of bad luck. The heart of the problem here is that the points are only calculated in the case where nobody disconnects the game. No matter what Bob does, he will obtain points as if he was not disconnecting. So if Bob can manage to disconnect if and only if he is not first, then he will always be rewarded by the algorithm as being first! This can make Bob take a hard shortcut where he needs to be lucky in order to be first, since in any case, he will get the points.

With AIs, we might run into a similar problem. Imagine that an AI has a goal and wants to maximize it (number of diamonds in the universe, etc.). Now, introduce a button such that, if pressed, the AI changes its goal and wants to stop itself. Then,

  • if the expected discounted reward of shutting itself down is greater than that of creating diamonds, it has the incentive to press the button. This transposes as “Bob has the incentive to disconnect”,
  • else, creating diamonds is more advantageous, then the AI will stop anyone from ever pressing the button. This transposes as “Alice has the incentive to purchase a better Wi-Fi connection to never be disconnected again”. 

The technique introduced by MIRI and presented here consists of adequately choosing the reward to have equality in both cases of the pressed/unpressed button. But this motivates the AI to gamble and press the button in case of bad luck since the reward is calculated on the unpressed worlds only. Note that this is not a problem of how well we approximate the true reward in the unpressed worlds (even if this would be a problem in practice, as Micheal said).

To solve the Big Gambles problem, and keep the requirements of a corrigible behavior, I propose here [UPCOMING POST] two solutions: reward the AI to make it act as if the button is not working (button inefficiency), or as if the button doesn’t exist (button inexistence).

New Comment