Very interesting!
The approach to images here is very different from Image GPT. (Though this is not the first time OpenAI has written about this approach -- see the "Image VQ" results from the multi-modal scaling paper.)
In Image GPT, an image is represented as a 1D sequence of pixel colors. The pixel colors are quantized to a palette of size 512, but still represent "raw colors" as opposed to anything more abstract. Each token in the sequence represents 1 pixel.
In DALL-E, an image is represented as a 2D array of tokens from a latent code. There are 8192 possible tokens. Each token in the sequence represents "what's going on" in a roughly 8x8 pixel region (because they use 32x32 codes for 256x256 images).
(Caveat: The mappings from pixels-->tokens and tokens-->pixels are contextual, so a token can influence pixels outside "its" 8x8 region.)
This latent code is analogous to the BPE code used to represent tokens (generally words) for text GPT. Like BPE, the code is defined before doing generative training, and is presumably fixed during generative training. Like BPE, it chunks the "raw" signal (pixels here, characters in BPE) into larger, more meaningful units.
This is like a vocabulary of 8192 "image words." DALL-E "writes" an 32x32 array of these image words, and then a separate network "decodes" this discrete array to a 256x256 array of pixel colors.
Intuitively, this feels closer than Image GPT to mimicking what text GPT does with text. Pixels are way lower-level than words; 8x8 regions with contextual information feel closer to the level of words.
As with BPE, you get a head start over modeling the raw signal. As with BPE, the chunking may ultimately be a limiting factor. Although the chunking process here is differentiable (a neural auto-encoder), so it ought to be adaptable in a way BPE is not.
(Trivia: I'm amused that one of their visuals allows you to ask for images of triangular light bulbs -- the example Yudkowsky used in LOGI to illustrate the internal complexity of superficially atomic concepts.)
This is the link to Yudkowsky discussion of concept merging with the triangular lightbulb example: https://intelligence.org/files/LOGI.pdf#page=10
Generated lightbulb images: https://i.imgur.com/EHPwELf.png
This is pretty nuts to me. I've played around with GPT-3 and can't really use the copy it generates for much. But if DALL-E were to generate larger images, I could see this easily replacing all the stock photos and icons I use.
Given that the details in generated objects are often right, you can use superresolution neural models to upscale the images to a needed size.
On prior work: they cited l-lxmert (Sep 2020) and TReCS (Nov 2020) in the blogpost. These are the baselines it seems.
https://arxiv.org/abs/2011.03775
https://arxiv.org/abs/2009.11278
The quality of objects and scenes there is far below the new model. They are often just garbled and not looking quite right.
But more importantly, the best they could sometimes understand from the text is something like "a zebra is standing in the field", i.e. the object and the background, all the other stuff was lost. With this model, you can actually use much more language features for visualization. Specifying spatial relations between the objects, specifying attributes of objects in a semantically precise way, camera view, time and place, rotation angle, printing text on objects, introducing text-controllable recolorations and reflections. I may be mistaken but I think I haven't seen any convincing demonstrations of any of these capabilities in an open-domain image+text generation before.
One evaluation drawback that I see is they haven't included any generated human images in the blogpost besides busts. Because of this, there's a chance scenes with humans are of worse quality, but I think they would nevertheless be very impressive compared to prior work, given how photorealistic everything else looks.
I'm not sure what accounts for this performance, but it may well mostly be more parameters (2-3 orders of magnitude more compared to previous models?) plus more and better data (that new dataset of image-text pairs they used for CLIP?)
I wonder if something like this could be pared with AI Dungeon? If they do release a image generator model for public or private use I think it would be fun to see an image accompany the last line(s) of the text output that has been generated for the story thus far.
Then more complex AI generated games wouldn't be too far away either.
Taking a sentence output by AI Dungeon and feeding it into DALL-E is totally possible (if and when the DALL-E source code becomes available). I'm not sure how much money it would cost. DALL-E has about 7% of the parameters that the biggest model of GPT-3 has, though I doubt AI Dungeon uses the biggest model. Generating an entire image with DALL-E means predicting 1024 tokens/codewords, whereas predicting text is at most 1 token per letter. All in all, it seems financially plausible. I think it would be fun to see the results too.
What seems tricky to me is that a story can be much more complex than the 256-token text input that DALL-E accepts. Suppose the last sentence of the story is "He picks up the rock." This input fits into DALL-E easily, but is very ambiguous. "He" might be illustrated by DALL-E as any arbitrary male figure, even though in the story, "He" refers to a very specific character. ("The rock" is similarly ambiguous. And there are more ambiguities, such as the location and the time of day that the scene takes place in.) If you scan back a couple of lines, you may find that "He" refers to a character called Fredrick. His name is not immediately useful for determining what he looks like, but knowing his name, we can now look through the entire story to find descriptions of him. Perhaps Fredrick was introduced in the first chapter as a farmer, but became a royal knight in Chapter 3 after an act of heroism. Whether Fredrick is currently wearing his armor might depend on the last few hundred words, and what his armor looks like was probably described in Chapter 3. Whether his hair is red could depend on the first few hundred words. But maybe in the middle of the story, a curse turned his hair into a bunch of worms.
All this is to say that to correctly interpret a sentence in a story, you potentially have to read the entire story. Trying to summarize the story could help, but can only go so far. Every paragraph of the story could contain facts about the world that are relevant to the current scene. Instead, you might want to summarize only those details of the story that are currently relevant.
Or maybe you can somehow train an AI that builds up a world model from the story text, so that it can answer the questions necessary for illustrating the current scene. It's worth noting that GPT-3 has something akin to a world model that it can use to answer questions about Earth, as well as fictional worlds it's been exposed to during training. However, its ability to learn about new worlds outside of training (so, during inference) is limited, since it can only remember the last 2000 tokens. To me this kind of seems like they need to give the AI its own memory, so that it can store long-term facts about the text to help predict the next token. I wonder if something like that has been tried yet.
One way you might be able to train such a model is to have it generate movie frames out of subtitles, since there's plenty of training data that way. Then you're pretty close to illustrating scenes from a story.
What happens when OpenAI simply expands this method of token prediction to train with every kind of correlated multi-media on the internet? Audio, video, text, images, semantic web ontologies, and scientific data. If they also increase the buffer size and token complexity, how close does this get us to AGI?
Audio, video, text, images
While other media would undoubtedly improve the model's understanding of concepts hard to express through text, I've never bought the idea that it would do much for AGI. Text has more than enough in it to capture intelligent thought; it is the relations and structure that matters, above all else. If this weren't true, one wouldn't expect competent deafblind people, but there are. Their successes are even in spite of an evolutionary history with practically no surviving deafblind ancestors! Clearly the modules that make humans intelligent, in a way that other animals and things are not, are not dependent on multisensory data.
A few of points. First I've heard several AI researchers say that GPT-3 is already close to the limit of all high quality human generated text data. While the amount of text on the internet will continue to grow, it might not grow fast enough for major continued improvement. Thus additional media might be necessary for training input.
Second deaf blind people still have multiple senses that allow them to build 3D sensory-motor models of reality (touch, smell, taste, proprioception, vestibular, sound vibrations). Correlations among these senses gives rise to understanding causality. Moreover, human brains might have evolved innate structures for things like causality, agency, objecthood, etc which don't have to be learned.
Third, as DALL-E illustrates, intelligence is not just about learning knowledge it is also about expressing that learning in a medium. It is hard to see how an AI trained only on text could paint a picture or sing a song.
I expect getting a dataset an order of magnitude larger than The Pile without significantly compromising on quality will be hard, but not impractical. Two orders of magnitude (~100 TB) would be extremely difficult, if even feasible. But it's not clear that this matters; per Scaling Laws, dataset requirements grow more slowly than model size, and a 10 TB dataset would already be past the compute-data intersection point they talk about.
Note also that 10 TB of text is an exorbitant amount. Even if there were a model that would hit AGI with, say, a PB of text, but not with 10 TB of text, it would probably also hit AGI with 10 TB of text plus some fairly natural adjustments to its training regime to inhibit overfitting. I wouldn't argue this all the way down to human levels of data, since the human brain has much more embedded structure than we assume for ANNs, but certainly huge models like GPT-3 start to learn new concepts in only a handful of updates, and I expect that trend of greater learning efficiency to continue.
I'm also skeptical that images, video, and such would substantially change the picture. Images are very information sparse. Consider the amount you can learn from 1MB of text, versus 1MB of pixels.
Correlations among these senses gives rise to understanding causality. Moreover, human brains might have evolved innate structures for things like causality, agency, objecthood, etc which don't have to be learned.
Correlation is not causation ;). I think it's plausible that agenthood would help progress towards some of those ideas, but that doesn't much argue for multiple distinct senses. You can find mere correlations just fine with only one.
It's true that even a deafblind person will have mental structures that evolved for sight and hearing, but that's not much of an argument that it's needed for intelligence, and given the evidence (lack of mental impairment in deafblind people), a strong argument seems necessary.
For sure I'll accept that you'll want to train multimodal agents anyway, to round out their capabilities. A deafblind person might still be intellectually capable, but it doesn't mean they can paint.
Fascinating stuff!
I looked through many combinations of the images, and found myself having a distinctive disgust reaction to quite a few, to the point of feeling slightly nauseous. Curious if anyone else has experienced this? I imagine it could be related to the uncanny valley in robotics... which is a little frightening.
Interesting! I didn't feel that all, I thought things were pretty artsy/aesthetically pleasing on the whole. Any examples of things that felt nauseating?
I agree, many were quite pleasing as well, especially the adorable avocado armchairs and many of the macro photographs. Another personal favorite are the tetrahedra made of fire - they are exactly how I would picture Sauron, if Tolkien had described him as a tetrahedron.
The nauseating ones included:
On another note, an example I found really impressive was how every other country I looked at had only generic stadium images, but China's were instantly recognizable as the Bird's Nest from the Olympics.
(But I wonder if the architect of the Bird's Nest would look at those images and say, those beams would never support the weight of the structure! Look, that one's cracked! It's so wrong!)
I've learned to be resilient against AI distortions, but 'octagonal red stop sign' really got me. Which is ironic, you'd think that prompt would be particularly easy for the AI to handle. The other colours and shapes didn't have a strong effect, so I guess the level of familiarity makes a difference.
I think the level of nausea is a function of the amount of meaning that is being distorted, eg. distorted words, faces or food have a much stronger effect than warped clock faces or tables, for example. (I would also argue there is more meaning to the shape of a golf club than a clock face.)
I also have this and have had it for a long time, starting with Google DeepDream. (Or perhaps that animation where you stare ahead while on the edges of your field of view a series of faces is shown, which then start to subjectively look nightmarish/like caricatures.) It lessens with exposure, and returns, somewhat weaker, with each new type of generated image. It feels like neurons burning out from overactivation, as though I was Dracula being shown a cross.
This is a linkpost for https://openai.com/blog/dall-e/
My own take: Cool, not super surprising given GPT-3 and Image GPT. I look forward to seeing what a bigger version of this would do, so that we could get a sense of how much it improves with scale. I'm especially interested in the raven's progressive matrices performance.