Straight out of the box, the new machine plays at the same level as the best conventional chess engines, many of which have been fine-tuned over many years. On a human level, it is equivalent to FIDE International Master status, placing it within the top 2.2 percent of tournament chess players. But even with this disadvantage, it is competitive. “Giraffe is able to play at the level of an FIDE International Master on a modern mainstream PC,” says Lai. By comparison, the top engines play at super-Grandmaster level.
That's a pretty hard contradiction right there. The latter quote is probably the correct one. Modern chess engines beat any human player these days, even running on relatively modest hardware. That's assuming full length games. At blitz games computers are much better still, compared to humans, because humans are much more error prone.
So if this neural net is playing at master level it's still much, much weaker than the best computers. From master to grandmaster is a big leap, from grandmaster to world top is another big leap, and the best computers are above even that.
Still interesting of course.
Yes, I noticed this too. The paper itself compares Giraffe (ELO ~2400) to 8 other chess engines (page 25 of the PDF), and decides that
It is clear that Giraffe's evaluation function now has at least comparable positional understanding compared to evaluation functions of top engines in the world.
For comparison, a frequently updated list of chess engines and their (approximate) ELO ratings, which would list Giraffe around shared 165'th place. It seems that it is the reporting, instead of the paper, that is making the exaggeration.
I think this is pretty cool and interesting, but I feel compelled to point out that all is not as it seems:
Its worth noting, though, that just the evaluation function is a neural network. The search, while no long iteratively deepening, is still recursive. Also, the evaluation function is not a pure neural network. It includes a static exchange evaluation.
It's also worth noting that doubling the amount of computing time usually increasing a chess engine's score by about 60 points. International masters usually have a rating below 2500. Though this is sketchy, the top chess engines are rated at around 3300. Thus, you could make a top-notch engine approximately 10,000 times slower and achieve the same performance.
Now, that 3300 figure is probably fairly inaccurate. Also, its quite possible that if the developer tweaked their recursive search algorithm, they could improve it. Thus that 10,000 figure I came to above is probably fairly inaccurate. Regardless, it is not clear to me that the neural network itself is proving terribly useful.
Although it's not better than existing solutions, it's a cool example of how good results can be achieved in a relatively automatic way - by contrast, the evaluation functions of the best chess engines have been carefully engineered and fine-tuned over many years, at least sometimes with assistance from people who are themselves master-level chess players. On the other hand this neural network approach took a relatively short time and could have been applied by someone with little chess skill.
edit: Reading the actual paper, it does sound like a certain amount of iteration, and expertise on the author's part, was still required.
edit2: BTW, the paper is very clear and well written. I'd recommend giving it a read if you're interested in the subject matter.
Thank you for recommending to read the paper, I don't think I would have otherwise and I greatly enjoyed reading it!
Although it's not better than existing solutions, it's a cool example of how good results can be achieved in a relatively automatic way - by contrast, the evaluation functions of the best chess engines have been carefully engineered and fine-tuned over many years, at least sometimes with assistance from people who are themselves master-level chess players. On the other hand this neural network approach took a relatively short time and could have been applied by someone with little chess skill.
But how much of its performance comes from the neural network learning some non-trivial evaluation function and how much comes from brute-forcing the game tree on a modern computer?
If the neural network was replaced by a trivial heuristic, say, material balance, how would the engine perform?
In the paper they start with just material balance - then via the learning process, their score on the evaluation test goes from "worse than all hand-written chess engines" to "better than all except the very best one" (and the best one, while more hand-crafted, also uses some ML/statistical tuning of numeric params, and has had a lot more effort put into it).
The reason why the NN solution currently doesn't do as well in real games is because it's slower to evaluate and therefore can't brute-force as far.
This is pitched as unsupervised from scratch, but on page 18, it talks about training to mimic another engine's board evaluation function.
I don't follow how the various networks and training regimes relate to each other. If the supervised training only produced the features listed on 18-19, that doesn't seem like a big deal, because those are very basic chess concepts that exhibit very little expertise. (It did jump out to me that the last item, the lowest-value piece among attackers of a square is much more complicated than any of the other features.) And the point is to make a much better evaluation function and trade off against brute force. But I'm not sure what's going on.
The main training isn't entirely from scratch, either, but makes use of random positions from expert games, implicitly labeled balanced. It would be interesting to know how important that was, and what would happen if the system purely bootstrapped, generating typical positions by playing against itself.
It looks to me as if they did the following:
That sounds correct, but in the first step, I think what they are optimizing is not (just) features but representations of those features. The natural language descriptions of the features are very simple ones that require no expertise, but some representations and ways of wiring them together are more conducive to learning and more conducive to building higher level features that Stockfish uses. But, again, it doesn't sound like they are applying much optimization power at this step.
Also, one thing that I think is just omitted is whether the network in the first step is the same network in the later steps, or whether the later steps introduce more layers to exploit the same features.
Features and representations: agreed. (I wasn't trying to be precise.)
I assumed the same network in the first step as later, but agree that it isn't made explicit in the paper.
I am confused by section 5 in the paper about probabilistic generation of the search tree - the paper states:
Testing showed that a naive implementation of probability-limited search is slightly (26 +- 12 rating points) stronger than a naive implementation of depth-limited search.
But the creators of the most popular engines literally spend hours a day trying to increase the rating of their engine, and 26 rating points is massive. Is this probabilistic search simply that unknown and good? Or is does the trick lie in the "stronger than a naive implementation of depth-limited search", and is there some reason why we expect depth-limited search to have sophisticated implementations, but do not expect this for probablistic search?
Or is does the trick lie in the "stronger than a naive implementation of depth-limited search", and is there some reason why we expect depth-limited search to have sophisticated implementations, but do not expect this for probablistic search?
Something like that I think. The paper suggests that optimizations applied to depth-based search techniques in more sophisticated engines are already effectively like an approximation of probability-based search.
Should in this case the probabilistic search not already be comparable in performance with non-naive depth-based search, if most of the sophistication in the latter just serves to approximate the former? Since the probabilistic search seems relatively simple the argument above seems insufficient to explain why probabilistic search is not used more widely, right?
This and other recent work with deep learning raises a substantial question: how much of intelligence is simply raw processing power? Many of the ideas and algorithms used in these recent advances have been around for a while, but they are taking advantage of much more processing power than they would have had available 10 or 20 years ago. This suggests that we may have most of the ingredients for intelligence and can't implement it or can't recognize it due to processing limitations.
http://www.technologyreview.com/view/541276/deep-learning-machine-teaches-itself-chess-in-72-hours-plays-at-international-master/
H/T http://lesswrong.com/user/Qiaochu_Yuan