I am an avid chess player, and was curious to read this paper as someone referred me to it. I think the claims from the paper and article are misguided, since the premise is that Leela is a superhuman chess AI, but that has actually not been proven at the level of search nodes used in the paper (referring to Appendix section B.3). It's quite possible that Leela is not a superhuman level AI even at 1600 nodes searched - the upper limit referenced in the paper; at 1 node, which is also referenced in the tables in the paper, I believe Leela has been estimated at around expert-level strength - well below superhuman. I appreciate the authors providing detailed implementation parameters for the purposes of reproducing. I did not attempt to reproduce the findings exactly based on those parameters, though I have a similar version of Leela installed locally on a relatively modest consumer GPU.
To provide some context, Leela only has correct/precise evaluations in the limit of nodes searched, and in a practical sense, doesn't have precise evaluations (which could only come from search) for moves that it wouldn't strongly consider playing. In the first mirrored example, Leela finds ...Rh4/Rh5 to be a clearly winning move in both cases in roughly 12k nodes searched, which is only a few seconds of search on my set-up. The authors indicate a hardware set-up where this node limit may be achievable in under 1 second. I found it curious that the choice was made to spend days of GPU time on the other problems, but not even seconds worth of compute time for specific chess positions, rendering Leela in the range of strong amateur to probably human grandmaster level play.
Regarding the Bf4 move from the article, I do show that Leela considers Bf4 as a decent move with around ~81% win probability in a shallow search. However, my local version does not consider Bf4 to be one of the top few moves with even a bit more search. It is important to note that the policy output is only a part of the process; Leela is built to play chess well, which requires search of the position. In this case at least, my local version would not choose Bf4, so it doesn't search the move deeply.
Regarding Table 5 in Appendix B.3, I would be interested to know what percentage of the positions in each category are the synthetic positions without pawns; as Leela is only trained on moves from games (and almost all from above human master-level play), some of the synthetic positions could be nuanced in a way that would make them unlikely to arise from an actual game of chess.
As for the last position from the article without pawns, my local version not only produces the highest policy output for Rg7+/Rb7+ but produces a high probability evaluation without many nodes searched (at most 125 nodes). I am a bit intrigued that there is this much difference in the weights file used with a relatively recent version of Leela.
Leela does play chess at a superhuman level, as shown in that it can play chess in computer chess tournaments on the relative level of the best chess engines that exist. However, constraining the nodes searched to the range in the paper handicaps it to a sub-superhuman state, which is quite misleading based on the premise of the paper and some of the claims in the article and paper.
EDIT: after conversing with an author of the paper privately, I think I've underestimated the strength of Leela now on quite low node counts, and probably not 1 node, but definitely hundreds or up to 1.6k nodes, as shown in the paper, is almost certainly superhuman level play.
I am an avid chess player, and was curious to read this paper as someone referred me to it. I think the claims from the paper and article are misguided, since the premise is that Leela is a superhuman chess AI, but that has actually not been proven at the level of search nodes used in the paper (referring to Appendix section B.3). It's quite possible that Leela is not a superhuman level AI even at 1600 nodes searched - the upper limit referenced in the paper; at 1 node, which is also referenced in the tables in the paper, I believe Leela has been estimated at around expert-level strength - well below superhuman. I appreciate the authors providing detailed implementation parameters for the purposes of reproducing. I did not attempt to reproduce the findings exactly based on those parameters, though I have a similar version of Leela installed locally on a relatively modest consumer GPU.
To provide some context, Leela only has correct/precise evaluations in the limit of nodes searched, and in a practical sense, doesn't have precise evaluations (which could only come from search) for moves that it wouldn't strongly consider playing. In the first mirrored example, Leela finds ...Rh4/Rh5 to be a clearly winning move in both cases in roughly 12k nodes searched, which is only a few seconds of search on my set-up. The authors indicate a hardware set-up where this node limit may be achievable in under 1 second. I found it curious that the choice was made to spend days of GPU time on the other problems, but not even seconds worth of compute time for specific chess positions, rendering Leela in the range of strong amateur to probably human grandmaster level play.
Regarding the Bf4 move from the article, I do show that Leela considers Bf4 as a decent move with around ~81% win probability in a shallow search. However, my local version does not consider Bf4 to be one of the top few moves with even a bit more search. It is important to note that the policy output is only a part of the process; Leela is built to play chess well, which requires search of the position. In this case at least, my local version would not choose Bf4, so it doesn't search the move deeply.
Regarding Table 5 in Appendix B.3, I would be interested to know what percentage of the positions in each category are the synthetic positions without pawns; as Leela is only trained on moves from games (and almost all from above human master-level play), some of the synthetic positions could be nuanced in a way that would make them unlikely to arise from an actual game of chess.
As for the last position from the article without pawns, my local version not only produces the highest policy output for Rg7+/Rb7+ but produces a high probability evaluation without many nodes searched (at most 125 nodes). I am a bit intrigued that there is this much difference in the weights file used with a relatively recent version of Leela.
Leela does play chess at a superhuman level, as shown in that it can play chess in computer chess tournaments on the relative level of the best chess engines that exist. However, constraining the nodes searched to the range in the paper handicaps it to a sub-superhuman state, which is quite misleading based on the premise of the paper and some of the claims in the article and paper.
EDIT: after conversing with an author of the paper privately, I think I've underestimated the strength of Leela now on quite low node counts, and probably not 1 node, but definitely hundreds or up to 1.6k nodes, as shown in the paper, is almost certainly superhuman level play.