I just bought a subscription to access GPT-4 and played the following chess game against it, with me playing white. (No particular agenda, was just curious how good it is.)

At this point (move 31), GPT-4 suggested Kxc4, which is not legal, and when I asked it to correct, it suggested Kd5 and Kb6 which aren't legal either (only legal move here is Kc6.)

Stuff I noticed:

  • As was pointed out before, it's much better than GPT3.5, which started playing illegal moves much earlier. But it still started playing illegal moves eventually, so I'm not sure if it makes sense to assign it a rating.
  • It missed the early "removing the defender" tactic where I could exchange my bishop for its knight, which was defending its bishop; otherwise it played ok till the end
  • Move 29 and 30 (last two before it tried illegal moves) were just giving stuff away.
  • It explained both my and its moves every time; those explanations got wrong earlier. (After it recaptured my queen on move 17, it said it maintained material balance; after move 20 it said it pinned my knight to the rook on c1, but there was no rook on c1; from there, most of it was wrong.)
  • I wrote 19. Rfd8 instead of 19. Rfd1 by accident, it it replied with "I assume you meant 19. Rfd1, placing your rook on the open d-file opposing my rook. I'll respond with 19...e5, attacking your knight on d4 and trying to grab some space in the center.". Very helpful!
  • After move 14 (the first move with the black rook), I asked it to evaluate the position, and it said that white has a small advantage. But it blundered a piece, so this position is completely winning for white (Stockfish says +5.2)

(PGN: 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 $2 Na6 $9 7. e3 c6 $6 8. Nc3 Nc7 9. Rc1 $6 b6 10. Qb3 Ba6 11. Qa4 $6 Qd7 $4 12. Bxc7 $1 Qxc7 13. Qxa6 dxc4 14. Qxc4 Rac8 15. Bd3 c5 16. O-O cxd4 17. Qxc7 Rxc7 18. Nxd4 Rd8 19. Rfd1 e5 20. Nf5 Bb4 21. Ng3 Rcd7 22. Bb5 Rxd1+ 23. Rxd1 Rxd1+ 24. Nxd1 Kf8 25. Nc3 Ke7 26. a3 Bxc3 27. bxc3 Kd6 28. Kf1 Kc5 29. c4 a6 $6 30. Bxa6 Ne4 31. Nxe4+)

New Comment
23 comments, sorted by Click to highlight new comments since:

I'm very eager to see its performance once we can use visual inputs and showing it board states visually after each move.

If I get early access to the visual model, I will definitely try this

It explained both my and its moves every time; those explanations got wrong earlier.

Note that at least for ChatGPT (3.5), telling it to not explain anything and only output moves apparently helps. (It can play legal moves for longer that way). So that might be worth trying if you want to get better performance. Of course, giving it the board state after each move could also help but might require trying a couple different formats.

To describe the current board state, something like this seems reasonable.

I've created an interface for playing against llm powered chess agents. Here is the link https://llmchess.org/. 

I had it play hundreds of games against stockfish, mostly at lowest skill level, using the API. After a lot of experimentation, I was giving it a fresh slate every prompt. The prompt was basically telling it it was playing chess, what color it was, and the PGN (it did not do as well with the FEN or both in either order). If it made invalid move (s), the next prompt (s) for that turn I added a list of the invalid moves it had attempted. After a few tries I had it forfeit the game.

I had a system set up to rate it, but it wasn't able to complete nearly enough games. As described, it finished maybe 1 in 40. I added a list of all legal moves on second and third attempt for a turn. It was then able to complete about 1 in 10 and won about half of them. Counting the forfeits and calling this a legal strategy, that's something like a 550 iirc? But. It's MUCH worse in the late-middle and end games, even with the fresh slate every turn. Until that point - including well past any opening book it could possibly have "lossless in its database" (not how it works) - it plays much better, subjectively 1300-1400.

That is odd. I certainly had a much, much higher completion rate than 1 in 40; in fact I had no games that I had to abandon with my prompt. However, I played manually, and played well enough that it mostly did not survive beyond move 30 (although my collection has a blindfold game that went beyond move 50), and checked at every turn that it reproduced the game history correctly, reprompting if that was not the case. Also, for GPT3.5 I supplied it with the narrative fiction that it could access Stockfish. Mentioning Stockfish might push it towards more precise play.

Trying again today, ChatGPT 3.5 using the standard chat interface did however seem to have a propensity to listing only White moves in its PGN output, which is not encouraging.

For exact reproducibility, I have added a game played via the API at temperature zero to my collection and given exact information on model, prompt and  temperature in the PGN:

https://lichess.org/study/ymmMxzbj/SyefzR3j

If your scripts allow testing this prompt, I'd be interested in seeing what completion rate/approximate rating relative to some low Stockfish level is achieved by chatgpt-3.5-turbo.

Did you and GPT4 only output the moves, or did you also output the board state after each turn?

Moves only

Here's an example for how to play chess with GPT4, using only txt in https://chat.openai.com ...

Me: Can you play chess? If so, start a chess board here using:
- a letter for each player.. e.g. K = King, Q = Queen, P = Pawn, etc
- I will be capital letters, your players can be lower case letters

GPT4: As a text-based AI, I am not able to display a full chess board or play chess with you in real time. However, I can help you understand chess strategies, explain the rules, and discuss famous games. Let me know if you have any questions or need assistance with chess concepts.

Me: Can you draw a chess board from text? e.g. 
- a letter for each player.. e.g. K = King, Q = Queen, P = Pawn, etc
- I will be capital letters, your players can be lower case letters

GPT4: Sure! I can display the initial position of a chess board using text representation. Here is a standard chess board at the beginning of a game:

R N B Q K B N R
P P P P P P P P
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
p p p p p p p p
r n b q k b n r

Me: Ok, thanks I'll go first.  Move my pawn on e4 to d4.

GPT4: I apologize for any confusion, but it seems there is a mistake in your move description. The pawn on e4 is not yet on the board. The initial position of the pawns is on the second rank. To move a pawn to e4, you can move the pawn from e2 to e4. Here is the updated board after your move:

R N B Q K B N R
P P P P . P P P
. . . . . . . .
. . . . P . . .
. . . . . . . .
. . . . . . . .
p p p p p p p p
r n b q k b n r

Now it's my turn. I'll move my pawn from e7 to e5:

R N B Q K B N R
P P P P . P P P
. . . . . . . .
. . . . P . . .
. . . . . p . .
. . . . . . . .
p p p p . p p p
r n b q k b n r

Me: This is good. Can you add the place position labels to the outside of the board?

GPT4: Certainly! Here's the updated chessboard with rank and file labels after your move (e2 to e4) and my move (e7 to e5):

8 r n b q k b n r
7 p p p p . p p p
6 . . . . . . . .
5 . . . . . p . .
4 . . . . P . . .
3 . . . . . . . .
2 P P P P . P P P
1 R N B Q K B N R
  a b c d e f g h

Now it's your turn.

Me: Huh, you flipped the board so now I'm on the bottom?

GPT4: Apologies for the confusion. I misunderstood your original intent. Here's the corrected board with your pieces at the top and the rank and file labels:

8 r n b q k b n r
7 p p p p . p p p
6 . . . . . . . .
5 . . . . . p . .
4 . . . . P . . .
3 . . . . . . . .
2 P P P P . P P P
1 R N B Q K B N R
  a b c d e f g h

Now it's your turn.

Me: Ok my pawn on e4 takes your pawn on f5.

GPT4: After your pawn on e4 captures my pawn on f5, the board looks like this:

8 r n b q k b n r
7 p p p p . p p p
6 . . . . . . . .
5 . . . . . P . .
4 . . . . . . . .
3 . . . . . . . .
2 P P P P . P P P
1 R N B Q K B N R
  a b c d e f g h
  
Now it's my turn. I'll move my knight from g8 to f6:

8 r n b q k b . r
7 p p p p . p p p
6 . . . . . n . .
5 . . . . . P . .
4 . . . . . . . .
3 . . . . . . . .
2 P P P P . P P P
1 R N B Q K B N R
  a b c d e f g h
  
Me: Ok, move my pawn on f2 to f4

etc, etc...

does it play better / make legal moves for longer this way?

Caleb Parikh and I were curious about GPT-4's internal models of chess as a result of this post, so we asked it some questions about the state partway through this game:

The following is a partial chess transcript in PGN. Please describe all legal moves that could come next. 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 $2 Na6 $9

It replied:

The current position after 1. d4 Nf6 2. c4 e6 3. Nf3 d5 4. Nc3 Be7 5. Bf4 O-O 6. Nb5 Na6 is as follows:

8 r b q r p b p
7 p p b n p p p
6 n . . p . n .
5 . N . . p . .
4 . . P P . B .
3 . . . . N . .
2 P P . . . P P
1 R . B Q K . R
 a b c d e f g h

(and explained that lowercase letters were black pieces and uppercase letters were white pieces, which I didn't know; I don't play chess).

This... is not an accurate picture of the game board (what are all those pawns doing on Black's back row?) We also asked it for a list of legal moves that White could make next, and it described some accurately but some inaccurately (e.g. listed one as a capture even though it wasn't).

This is pretty funny because the supposed board state has only 7 columns. Yet it's also much better than random. A lot of the pieces are correct... that is, if you count from the left (real board state is here).

Also, I've never heard of using upper and lowercase to differentiate white and black, I think GPT-4 just made that up. (edit: or not; see reply.)

Extra twist: I just asked a new GPT-4 instance whether any chess notation differentiates lower and upper case, and it told me algebraic notation does, but that's the standard notation, and it doesn't. Wikipedia article also says nothing about it. Very odd.

Also, I've never heard of using upper and lowercase to differentiate white and black, I think GPT-4 just made that up.

No, this is common. E.g. https://github.com/niklasf/python-chess

This is pretty funny because the supposed board state has only 7 columns

Hah, I didn't even notice that.

Also, I've never heard of using upper and lowercase to differentiate white and black, I think GPT-4 just made that up.

XD

On a retry, it didn't decide to summarize the board and successfully listed a bunch of legal moves for White to make. Although I asked for all legal moves, the list wasn't exhaustive; upon prompting about this, it apologized and listed a few more moves, some of which were legal and some which were illegal, still not exhaustive.

it still started playing illegal moves eventually

I was just talking with Bing about how quickly transformer AI might surpass human intelligence, and it was a sensible conversation until it hallucinated a nonexistent study in which GPT-4 was tested on 100 scenarios and dilemmas and performed badly. 

What these interactions have in common, is that GPT-4 does well for a while, then goes off the rails. It makes me curious about the probability of going wrong - is there a constant risk per unit time, or does the risk per unit time actually increase with the length of the interaction, and if so, why? 

The probability of going wrong increases as the novelty of the situation increases. As the chess game is played, the probability that the game is completely novel or literally never played before increases. Even more so at the amateur level. If a Grandmaster played GPT3/4, it's going to go for much longer without going off the rails, simply because the first 20 something moves are likely played many times before and have been directly trained on. 

Right, though 20 moves until a new game is very rare afaik (assuming the regular way of counting, where 1 move means one from both sides). But 15 is commonplace. According to chess.com (which I think only includes top games, though not sure) this one was new up from move 6 by white.

How did your prompt GPT4?

If you mean how I accessed it at all, I used the official channel from OpenAI: https://chat.openai.com/chat

If you have a premium account (20$/month), you can switch to GPT-4 after starting a new chat.

I think with the right prompting, it is around 1400 Elo, at least against strong opponents. Note, however, that this is based on a small sample; on the flip side, all my test games (against myself and three relatively weak computer opponents, with the strongest computer opponent tried being fairly strong club player level) are in a lichess study linked to from here:

https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/gpt-4?commentId=TaaAtoM4ahkfc37dR

The prompting used is heavily inspired by Bucky's comments from the Sydney-and-chess thread. I haven't optimised it for GPT-4 in any way.

I also tested if GPT-4 can play a game taking queen odds against an opponent that is strong compared to most humans (Leela Chess Zero at a few nodes per move). This was the case, with GPT-4 winning. However, I haven't documented that game.

It is much weaker at commenting than at playing under these conditions. However, it does know when its position is very bad, as I have seen it resign at a late but reasonable point when I worked the possibility to resign into the setup prompt.

I wonder if you were to take GPT-4 and train it in self play, how good it would get, and how quickly...