This is a linkpost for https://openai.com/blog/dall-e/
My own take: Cool, not super surprising given GPT-3 and Image GPT. I look forward to seeing what a bigger version of this would do, so that we could get a sense of how much it improves with scale. I'm especially interested in the raven's progressive matrices performance.
What happens when OpenAI simply expands this method of token prediction to train with every kind of correlated multi-media on the internet? Audio, video, text, images, semantic web ontologies, and scientific data. If they also increase the buffer size and token complexity, how close does this get us to AGI?
I expect getting a dataset an order of magnitude larger than The Pile without significantly compromising on quality will be hard, but not impractical. Two orders of magnitude (~100 TB) would be extremely difficult, if even feasible. But it's not clear that this matters; per Scaling Laws, dataset requirements grow more slowly than model size, and a 10 TB dataset would already be past the compute-data intersection point they talk about.
Note also that 10 TB of text is an exorbitant amount. Even if there were a model that would hit AGI with, say, a PB of text,... (read more)