Could you explain how this differs from the standard Reinforcement Learning formulation? (See eg. http://incompleteideas.net/book/first/ebook/node28.html for an introduction)
This is indeed amusing. In reality, the action space can be taken to be of size 256 (the number of possible byte values), with the number of bytes in the solution as the episode length. Note also that 256 is an upper bound, not all byte values are valid at all points, and most of the time only the 128 ASCII values are used. Using a tokenizer as is standard in language models simply reduces the episode length by increasing the action space, it does not change the size of the overall state space.
This also means that, despite their claims, the search space for the example solutions shown on their website is similar or smaller than for board games such as Chess and Go :D
Nice summary! I agree, this is an interesting paper :)
But learning to be predictive of such random future states seems like it falls subject to exactly the same problem as learning to be predictive of future observations: you have no guarantee that EfficientZero will be learning relevant information, which means it could be wasting network capacity on irrelevant information. There's a just-so story you could tell where adding this extra predictive loss results in worse end-to-end behavior because of this wasted capacity, just like there's a just-so story where adding this extra predictive loss results in better end-to-end behavior because of faster training. I'm not sure why one turned out to be true rather than the other.
This mostly depends on the size of your dataset. For very small datasets (100k frames here), the network is overparameterized and can easily overfit, adding the consistency loss provides regularisation that can prevent this.
For larger datasets (eg standard 200 million frame setting in Atari) you'll see less overfitting, and I would expect the impact of consistency loss to be much smaller, possibly negative. The paper doesn't include ablations for this, but I might test it if I have time.
To phrase differently: the less data you have for your real objective the more you can benefit from auxiliary losses and regularisation.
Does Omicron already having spread through community transmission in the Netherlands (and other European countries) before the reports from South Africa, yet still not being as widespread in Europe, suggest that it's not that transmissive after all?
The difference in compute between AlexNet and AlphaZero is because for AlexNet you are only counting the flops during training, while for AlphaZero you are counting both the training and the self-play data generation (which does 800 forwards per move * ~200 moves to generate each game).
If you were to compare supervised training numbers for both (e.g. training on human chess or Go games) then you'd get much closer.
The TOPS numbers from the wiki page seem wrong. TPUv1 had 92 TOPS (uint8); for TPUv3 the "90 TOPS" refers to a single chip, but I'm fairly sure that when the paper says "8 TPUv3s" they mean 8 cards, as that's how they are available on Google Cloud (1 card = 4 chips).
Only Anakin actually runs the environment on the TPU, and this only works for pretty simple environments (basically: can you implement it in JAX?) Sebulba runs environments on the host, which is what would have been done for this paper too (no idea if they used Sebulba or had a different setup).
This doesn't really matter though, because for these simulated environments it's fairly simple to fully utilize the TPUs by running more (remote) environments in parallel.
An important distinction here is that the number of tokens a model was trained for should not be confused with the number of tokens in a dataset: if each token is seen exactly once during training then it has been trained for one "epoch".
In my experience scaling continues for quite a few epochs over the same datset, only if the model has more parameters than the datset tokens and training for >10 epochs does overfitting kick in and scaling break down.