I'm talking about these agents (LW thread here)
I'd love an answer either in operations (MIPS, FLOPS, whatever) or in dollars.
Follow-up question: How many parameters did their agents have?
I just read the paper (incl. appendix) but didn't see them list the answer anywhere. I suspect I could figure it out from information in the paper, e.g. by adding up how many neurons are in their LSTMs, their various other bits, etc. and then multiplying by how long they said they trained for, but I lack the ML knowledge to do this correctly.
Some tidbits from the paper:
For multi-agent analysis we took the final generation of the agent(generation5)andcreatedequallyspacedcheckpoints (copies of the neural network parameters) every 10 billion steps, creating a collection of 13 checkpoints.
This suggests 120 billion steps of training for the final agents. But elsewhere in the post they state each agent in the final generation experienced 200 billion training steps, so.... huh?
Anyhow. Another tidbit:
In addition to the agent exhibiting zero-shot capabilities across a wide eval-uation space, we show that finetuning on a new task for just 100 million steps (around 30 minutes of compute in our setup) can lead to drastic increases in performance relative to zero-shot, and relative to training from scratch which often fails completely.
So, if 100 million steps takes 30min in their setup, and they did 200 billion steps for the final generation, that means the final generation took 30 x 2,000 = 41 days. Makes sense. So the whole project probably took something like 100 - 200 days, depending on whether generations 1 - 4 were quicker.
How much does that cost though??? In dollars or FLOPs? I have no idea.
EDIT: It says each agent was trained on 8 TPUv3's. But how many agents were there? I can't find anything about the population size. Maybe I'm not looking hard enough.
Yes, I see that they used Unity, so the TPUs themselves couldn't run the env, but the TPU CPU VM* could run potentially a lot of copies (with that like 300GB of RAM it's got access to), and that'd be a lot nicer than running remote VMs. At least in Tensorfork, when we try to use TPU pods, a lot of time goes into figuring out correct use of the interconnect & traffic because the on-TPU ops are so optimized by default.
(And regardless of which of those tricks this open-ended paper uses, this is a point well worth knowing about how research could potentially gets way more performance out of a TPU pod than one would expect from knowing TPU usage of old stuff like AlphaStar.)
* advertisement: access to the VM was recently unlocked for non-Google TPU users. It really changes how you treat TPU use!