There's a nice recent paper whose authors did the following:
- train a small GPT model on lists of moves from Othello games;
- verify that it seems to have learned (in some sense) to play Othello, at least to the extent of almost always making legal moves;
- use "probes" (regressors whose inputs are internal activations in the network, trained to output things you want to know whether the network "knows") to see that the board state is represented inside the network activations;
- use interventions to verify that this board state is being used to decide moves: take a position in which certain moves are legal, use gradient descent to find changes in internal activations that make the output of the probes look like a slightly different position, and then verify that when you run the network but tweak the activations as it runs the network predicts moves that are legal in the modified position.
In other words, it seems that their token-predicting model has built itself what amounts to an internal model of the Othello board's state, which it is using to decide what moves to predict.
The paper is "Emergent world representations: Exploring a sequence model trained on a synthetic task" by Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg; you can find it at https://arxiv.org/abs/2210.13382.
There is a nice expository blog post by Kenneth Li at https://thegradient.pub/othello/.
Some details that seem possibly-relevant:
- Their network has a 60-word input vocabulary (four of the 64 squares are filled when the game starts and can never be played in), 8 layers, an 8-head attention mechanism, and a 512-dimensional hidden space. (I don't know enough about transformers to know whether this in fact tells you everything important about the structure.)
- They tried training on two datasets, one of real high-level Othello games (about 140k games) and one of synthetic games where all moves are random (about 20M games). Their model trained on synthetic games predicted legal moves 99.99% of the time, but the one trained on real well-played games only predicted legal moves about 95% of the time. (This suggests that their network isn't really big enough to capture legality and good strategy at the same time, I guess?)
- They got some evidence that their network isn't just memorizing game transcripts by training it on a 20M-game synthetic dataset where one of the four possible initial moves is never played. It still predicted legal moves 99.98% of the time when tested on the full range of legal positions. (I don't know what fraction of legal positions are reachable with the first move not having been C4; it will be more than 3/4 since there are transpositions. I doubt it's close to 99.98%, though, so it seems like the model is doing pretty well at finding legal moves in positions it hasn't seen.)
- Using probes whose output is a linear function of the network activations doesn't do a good job of reconstructing the board state (error rate is ~25%, barely better than attempting the same thing from a randomly initialized network), but training 2-layer MLPs to do it gets the error rate down to ~5% for the network trained on synthetic games and ~12% for the one trained on championship games, whereas it doesn't help at all for the randomly trained network. (This suggests that whatever "world representation" the thing has learned isn't simply a matter of having an "E3 neuron" or whatever.)
I am not at all an expert on neural network interpretability, and I don't know to what extent their findings really justify calling what they've found a "world model" and saying that it's used to make move predictions. In particular, I can't refute the following argument:
"In most positions, just knowing what moves are legal is enough to give you a good idea of most of the board state. Anything capable of determining which moves are legal will therefore have a state from which the board state is somewhat reconstructible. This work really doesn't tell us much beyond what the fact that the model could play legal moves already does. If the probes are doing something close to 'reconstruct board state from legal moves', then the interventions amount to 'change the legal moves in a way that matches those available in the modified position', which of course will make the model predict the moves that are available in the modified position."
(It would be interesting to know whether their probes are more effective at reconstructing the board state in positions where the board state is closer to being determined by the legal moves. Though that seems like it would be hard to distinguish from "the model just works better earlier in the game", which I suspect it does.)
Sure, I'm not saying they should've done that instead. In addition, but probably they didn't have the time/energy. My point is just that the illegal-move error rate is ambiguous if you (gjm) are interested in whether it has perfectly learned the rules (which is different from what the authors are going after), because there are sources of error beyond "it has failed to learn the rules", like errors reconstructing the board state leading to misapplication of potentially-perfectly-learned rules. To my eyes, a legal move error rate as low as 0.01% in this setup, given the burden of state reconstruction in a unnatural and difficult way, strongly suggests it's actually doing a great job of learning the rules. I predict that if you set it up in a way which more narrowly targeted rule learning (eg behavior cloning: just mapping full game state->expert-action, no history at all), you would find that its illegal move rate would approach 0% much more closely, and you'd have to find some really strange edge-cases like my chess promotion examples to trip it up, (at which point one would be satisfied, because how would one ever learn those unobserved things offline without priors).