Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

DragonGod

This is a linkpost for https://arxiv.org/abs/1712.01815

The paper gives AZ's elo as strangely low (~3000) versus the top Stockfish elos of around 3400. And there's this quote about the way they configured Stockfish that I was pointed to elsewhere:

It is a nice step different direction, perhaps the start if the revolution but Alpha Zero is not yet better than Stockfish and if you keep up with me I will explain why. Most of the people are very excited now and wishing for sensation so they don't really read the paper or think about what it says which leads to uninformed opinions.

The testing conditions were terrible. 1min/move is not really suitable time for any engine testing but you could tolerate that. What is intolerable though is the hashtable size - with 64 cores Stockfish was given, you would expect around 32GB or more otherwise it fills up very quickly leading to markant reduce in strenght - 1GB was given and that far from ideal value! Also SF was now given any endgame tablebases which is current norm for any computer chess engine.

The computational power behind each entity was very different - while SF was given 64 CPU threads (really a lot I've got to say), Alpha Zero was given 4 TPUs. TPU is a specialized chip for machine learning and neural network calculations. It's estimated power compared to classical CPU is as follows - 1TPU ~ 30xE5-2699v3 (18 cores machine) -> Aplha Zero had at it's back power of ~2000 Haswell cores. That is nowhere near fair match. And yet, eventhough the result was dominant, it was not where it would be if SF faced itself 2000cores vs 64 cores, It that case the win percentage would be much more heavily in favor of the more powerful hardware.

From those observations we can make an conclusion - Alpha Zero is not so close in strenght to SF as Google would like us to believe. Incorrect match settings suggest either lack of knowledge about classical brute-force calculating engines and how they are properly used, or intention to create conditions where SF would be defeted.

With all that said, It is still an amazing achievement and definitively fresh air in computer chess, most welcome these days. But for the new computer chess champion we will have to wait a little bit longer.

So while some of those games are impossibly cool, it's likely that they are not as far along as they appear, although this is still obviously a major achievement.

On the power consumption level, TPUs and CPUs are very similar. TPUs are of no use besides matrix multiplications, and Stockfish uses many other types of computation. It's just not fair to say that Stockfish was faced against 2000 cores.

Two interesting questions arise:

could Alpha Zero beat the best human-computer team;
would human-AZ team systematically beat AZ.

I think the answer to the first question is positive, but unfortunately, I couldn't make much sense of the available raw data on Freestyle chess, so my opinion is based on the marginal revolution blog-post. The negative answer to the second question might make some optimists about human-AI cooperation like Kasparov less optimistic.

I believe the answer to your second question is probably technically "yes"; if there's any way in which AZ mispredicts relative to a human, then there's some Ensemble Learning classifier that weights AZ move choices with human move choices and performs better than AZ alone. And because Go has so many possible states and moves at each state, humans would have to be much, much, much worse at play overall for us to conclude that humans were worse along every dimension.

However, I'd bet the answer is practically "no". If AlphaZero vs the top humans is now an Elo difference of 1200, that gives a predicted human victory rate of about 1/1000. We'd never be able to play enough games to get enough systematic data to identify a dimension along which humans chose better moves. And even if we did, it's entirely possible that the best response to that identification would be "give AlphaZero more training time on those cases", not "give AlphaZero a human partner in those cases". And even if we did decide to give AlphaZero a human partner, how often would the partner's input end up overriding the move AlphaZero alone would have chosen? Would the human even be able to uselessly pay attention for game after game, just so they could be ready to contribute to a winning move on game 100?