PhD Student at Mila, Montreal
Any plans on evaluating RETRO (the retrieval augmented transformer from DeepMind) on TruthfulQA? I'm guessing it should perform similarly to WebGPT but would be nice to get a concrete number.
Great post! I think you might wanna emphasize just how crucial ReAnalyse is for data-efficiency (the default MuZero is quite sample in-efficient), and how the reanalyse-ratio can be tuned easily for any data budget using a log-linear scaling law. You can also interpret the off-policy correction thing as running ReAnalyse twice, so my TL;DR of EfficientZero would be "MuZero ReAnalyse + SPR".
Regarding contrastive vs SPR, I don't think you would find a performance boost using a contrastive loss compared to SPR on Atari at least. We did an ablation for this in the SPR paper (Table 6, appendix). I suspect the reason contrastive works (slightly) better on Procgen is because of the procedural diversity there making negative examples much more informative.
Definitely agree about moving to multi-task test beds as the next frontier in RL. I also suspect we would see more non tabula-rasa RL methods, ones that start off with general-purpose pre-trained models or representations and then only do a tiny amount of fine-tuning on the actual RL task.
Thanks, glad you liked it, I really like the recent RL directions from OpenAI too! It would be interesting to see the use of model-based RL for the "RL as fine-tuning paradigm": making large pre-trained models more aligned/goal-directed efficiently by simply searching over a reward function learned from humans.
The Q-Learning baseline is a model-free control of MuZero. So it shares implementation details of MuZero (network architecture, replay ratio, training details etc.) while removing the model-based components of MuZero (details in sec A.2) . Some key differences you'd find vs a typical Q-learning implementation:
I would guess larger networks + higher sample reuse have the biggest effect size compared to standard Q-learning implementations.
The ProcGen competition also might have used the easy difficulty mode compared to the hard difficulty mode used in our paper.
Worth noting that they already use BERT in Search. https://blog.google/products/search/search-language-understanding-bert/
My main takeaway from Gato: If we can build specialized AI agents for 100s/1000s of tasks, it's now pretty straightforward to make a general agent that can do it all in a single model. Just tokenize data from all the tasks and feed into a transformer.