I just bought a subscription to access GPT-4 and played the following chess game against it, with me playing white. (No particular agenda, was just curious how good it is.)
At this point (move 31), GPT-4 suggested Kxc4, which is not legal, and when I asked it to correct, it suggested Kd5 and Kb6 which aren't legal either (only legal move here is Kc6.)
Stuff I noticed:
- As was pointed out before, it's much better than GPT3.5, which started playing illegal moves much earlier. But it still started playing illegal moves eventually, so I'm not sure if it makes sense to assign it a rating.
- It missed the early "removing the defender" tactic where I could exchange my bishop for its knight, which was defending its bishop; otherwise it played ok till the end
- Move 29 and 30 (last two before it tried illegal moves) were just giving stuff away.
- It explained both my and its moves every time; those explanations got wrong earlier. (After it recaptured my queen on move 17, it said it maintained material balance; after move 20 it said it pinned my knight to the rook on c1, but there was no rook on c1; from there, most of it was wrong.)
- I wrote 19. Rfd8 instead of 19. Rfd1 by accident, it it replied with "I assume you meant 19. Rfd1, placing your rook on the open d-file opposing my rook. I'll respond with 19...e5, attacking your knight on d4 and trying to grab some space in the center.". Very helpful!
- After move 14 (the first move with the black rook), I asked it to evaluate the position, and it said that white has a small advantage. But it blundered a piece, so this position is completely winning for white (Stockfish says +5.2)
(PGN: 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 $2 Na6 $9 7. e3 c6 $6 8. Nc3 Nc7 9. Rc1 $6 b6 10. Qb3 Ba6 11. Qa4 $6 Qd7 $4 12. Bxc7 $1 Qxc7 13. Qxa6 dxc4 14. Qxc4 Rac8 15. Bd3 c5 16. O-O cxd4 17. Qxc7 Rxc7 18. Nxd4 Rd8 19. Rfd1 e5 20. Nf5 Bb4 21. Ng3 Rcd7 22. Bb5 Rxd1+ 23. Rxd1 Rxd1+ 24. Nxd1 Kf8 25. Nc3 Ke7 26. a3 Bxc3 27. bxc3 Kd6 28. Kf1 Kc5 29. c4 a6 $6 30. Bxa6 Ne4 31. Nxe4+)
That is odd. I certainly had a much, much higher completion rate than 1 in 40; in fact I had no games that I had to abandon with my prompt. However, I played manually, and played well enough that it mostly did not survive beyond move 30 (although my collection has a blindfold game that went beyond move 50), and checked at every turn that it reproduced the game history correctly, reprompting if that was not the case. Also, for GPT3.5 I supplied it with the narrative fiction that it could access Stockfish. Mentioning Stockfish might push it towards more precise play.
Trying again today, ChatGPT 3.5 using the standard chat interface did however seem to have a propensity to listing only White moves in its PGN output, which is not encouraging.
For exact reproducibility, I have added a game played via the API at temperature zero to my collection and given exact information on model, prompt and temperature in the PGN:
https://lichess.org/study/ymmMxzbj/SyefzR3j
If your scripts allow testing this prompt, I'd be interested in seeing what completion rate/approximate rating relative to some low Stockfish level is achieved by chatgpt-3.5-turbo.