There's a nice recent paper whose authors did the following:
- train a small GPT model on lists of moves from Othello games;
- verify that it seems to have learned (in some sense) to play Othello, at least to the extent of almost always making legal moves;
- use "probes" (regressors whose inputs are internal activations in the network, trained to output things you want to know whether the network "knows") to see that the board state is represented inside the network activations;
- use interventions to verify that this board state is being used to decide moves: take a position in which certain moves are legal, use gradient descent to find changes in internal activations that make the output of the probes look like a slightly different position, and then verify that when you run the network but tweak the activations as it runs the network predicts moves that are legal in the modified position.
In other words, it seems that their token-predicting model has built itself what amounts to an internal model of the Othello board's state, which it is using to decide what moves to predict.
The paper is "Emergent world representations: Exploring a sequence model trained on a synthetic task" by Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg; you can find it at https://arxiv.org/abs/2210.13382.
There is a nice expository blog post by Kenneth Li at https://thegradient.pub/othello/.
Some details that seem possibly-relevant:
- Their network has a 60-word input vocabulary (four of the 64 squares are filled when the game starts and can never be played in), 8 layers, an 8-head attention mechanism, and a 512-dimensional hidden space. (I don't know enough about transformers to know whether this in fact tells you everything important about the structure.)
- They tried training on two datasets, one of real high-level Othello games (about 140k games) and one of synthetic games where all moves are random (about 20M games). Their model trained on synthetic games predicted legal moves 99.99% of the time, but the one trained on real well-played games only predicted legal moves about 95% of the time. (This suggests that their network isn't really big enough to capture legality and good strategy at the same time, I guess?)
- They got some evidence that their network isn't just memorizing game transcripts by training it on a 20M-game synthetic dataset where one of the four possible initial moves is never played. It still predicted legal moves 99.98% of the time when tested on the full range of legal positions. (I don't know what fraction of legal positions are reachable with the first move not having been C4; it will be more than 3/4 since there are transpositions. I doubt it's close to 99.98%, though, so it seems like the model is doing pretty well at finding legal moves in positions it hasn't seen.)
- Using probes whose output is a linear function of the network activations doesn't do a good job of reconstructing the board state (error rate is ~25%, barely better than attempting the same thing from a randomly initialized network), but training 2-layer MLPs to do it gets the error rate down to ~5% for the network trained on synthetic games and ~12% for the one trained on championship games, whereas it doesn't help at all for the randomly trained network. (This suggests that whatever "world representation" the thing has learned isn't simply a matter of having an "E3 neuron" or whatever.)
I am not at all an expert on neural network interpretability, and I don't know to what extent their findings really justify calling what they've found a "world model" and saying that it's used to make move predictions. In particular, I can't refute the following argument:
"In most positions, just knowing what moves are legal is enough to give you a good idea of most of the board state. Anything capable of determining which moves are legal will therefore have a state from which the board state is somewhat reconstructible. This work really doesn't tell us much beyond what the fact that the model could play legal moves already does. If the probes are doing something close to 'reconstruct board state from legal moves', then the interventions amount to 'change the legal moves in a way that matches those available in the modified position', which of course will make the model predict the moves that are available in the modified position."
(It would be interesting to know whether their probes are more effective at reconstructing the board state in positions where the board state is closer to being determined by the legal moves. Though that seems like it would be hard to distinguish from "the model just works better earlier in the game", which I suspect it does.)
I agree that the network trained on the large random-game dataset shows every sign of having learned the rules very well, and if I implied otherwise then that was an error. (I don't think I ever intended to imply otherwise.)
The thing I was more interested in was the difference between that and the network trained on the much smaller championship-game dataset, whose incorrect-move rate is much much higher -- about 5%. I'm pretty sure that either (1) having a lot more games of that type would help a lot or (2) having a bigger network would help a lot or (3) both; my original speculation was that 2 was more important but at that point I hadn't noticed just how big the disparity in game count was. I now think it's probably mostly 1, and I suspect that the difference between "random games" and "well played games" is not a major factor, and in particular I don't think it's likely that seeing only good moves is leading the network to learn a wrong ruleset. (It's definitely not impossible! It just isn't how I'd bet.)
Vaniver's suggestion was that the championship-game-trained network had learned a wrong ruleset on account of some legal moves being very rare. It doesn't seem likely to me that this (as opposed to 1. not having learned very well because the number of games was too small and/or 2. not having learned very well because the positions in the championship games are unrepresentative) is the explanation for having illegal moves as top prediction 5% of the time.
It looked as if you were disagreeing with that, but the arguments you've made in support all seem like cogent arguments against things other than what I was intending to say, which is why I think that at least one of us is misunderstanding the other.
In particular, at no point was I saying anything about the causes of the nonzero but very small error rate (~0.01%) of the network trained on the large random-game dataset, and at no point was I saying that that network had not done an excellent job of learning the rules.