I'm impressed by the number of different training regimes stacked on top of each other.
-Train a model that detects whether a Minecraft video on Youtube is free of external artifacts like face cams.
-Then feed the good videos to a model that's been trained using data from contractors to guess what key is being pressed each frame.
-Then use the videos and input data to train a model that, in any game situation, does whatever inputs it guesses a human would be most likely to do, in an undirected shortsighted way.
-And then fine-tune that model on a specific subset of videos that feature the early game.
-And only then use some mostly-standard RL training to get good at some task.
It's impressive. So far we see capabilities like this only in domains with loads of data. The models seem to be able to do anything if scaled, but the data dictates the domains where this is possible.
It really doesn't seem that far away until there's pre-trained foundation models for most modalities... Google's "Pathways" project is definitely doing it as we speak IMO.
This generates a decent approximation of the distribution of human actions in an open world situation. Is it usable for empirical quantillizer experiments?
https://openai.com/blog/vpt/
And from the Conclusion: