Personal evaluation of LLMs, through chess
Lots of people seem to resonate with the idea that AI benchmarks are getting more and more meaningless – either because they're being run in a misleading way, or because they aren't tracking important things. I think the right response is for people to do their own personal evaluations of models with their own criteria. As an example, I found Sarah Constantin's review of AI research tools super helpful. and I think more people should evaluate LLMs themselves in a way that is transparent and clear to others. So this post shares the result of my personal evaluation of LLM progress, using chess as the focal exercise. I chose chess ability mainly because it seemed like the thing that should be most obviously moved by reasoning ability increases – so it would be a good benchmark for whether reasoning ability has actually improved. Plus, it was a good excuse to spend all day playing chess. I much preferred the example in this post of cybersecurity usage at a real company, but this is just what I can do. I played a chess game against each of the major LLMs and recorded the outcome of each game. Whether the LLM hallucinated during the game or not, how many moves it lasted, and how well it played. Here were my results: Full results in this spreadsheet. My takeaways: 1. Claude 3.7 Sonnet and o3 were the only models able to complete a game without hallucination. (DeepSeek R1 technically cleared that bar, but only by getting checkmated early.) But they didn't play particularly well, and they still lost. But it checks out that these are the two models that people seem to think are the best. 2. o4-mini played the best. Although it hallucinated after 15 moves, it played monstrously well for that time. o4-mini's "average centipawn loss" – a measure of how much the play deviated from optimality, where perfect play is 0 – was pretty close to zero, and far lower than the other models. I am a pretty good player (2100 elo on lichess.org, above the 90th percentile) and I was at a