I got access to DALL-E 2 earlier this week, and have spent the last few days (probably adding up to dozens of hours) playing with it, with the goal of mapping out its performance in various areas – and, of course, ending up with some epic art. 

Below, I've compiled a list of observations made about DALL-E, along with examples. If you want to request art of a particular scene, or to test see what a particular prompt does, feel free to comment with your requests. 

DALL-E's strengths 

Stock photography content 

It's stunning at creating photorealistic content for anything that (this is my guess, at least) has a broad repertoire of online stock images – which is perhaps less interesting because if I wanted a stock photo of (rolls dice) a polar bear, Google Images already has me covered. DALL-E performs somewhat better at discrete objects and close-up photographs than at larger scenes, but it can do photographs of city skylines, or National Geographic-style nature scenes, tolerably well (just don't look too closely at the textures or detailing.) Some highlights: 

  • Clothing design: DALL-E has a reasonable if not perfect understanding of clothing styles, and especially for women's clothes and with the stylistic guidance of "displayed on a store mannequin" or "modeling photoshoot" etc, it can produce some gorgeous and creative outfits. It does especially plausible-looking wedding dresses – maybe because wedding dresses are especially consistent in aesthetic, and online photos of them are likely to be high quality? 
a "toga style wedding dress, displayed on a store mannequin"
  • Close-ups of cute animals. DALL-E can pull off scenes with several elements, and often produce something that I would buy was a real photo if I scrolled past it on Tumblr.
"kittens playing with yarn in a sunbeam"
  • Close-ups of food. These can be a little more uncanny valley – and I don't know what's up with the apparent boiled eggs in there – but DALL-E absolutely has the plating style for high-end restaurants down.
"dessert special, award-winning chef five star restaurant, close-up photograph"
  • Jewelry. DALL-E doesn't always follow the instructions of the prompt exactly (it seems to be randomizing whether the big pendant is amber or amethyst) but the details are generally convincing and the results are almost always really pretty. 
"silver statement necklace with amethysts and an amber pendant, close-up photograph"

 

Pop culture and media 

DALL-E "recognizes" a wide range of pop culture references, particularly for visual media (it's very solid on Disney princesses) or for literary works with film adaptations like Tolkien's LOTR. For almost all media that it recognizes at all, it can convert it in almost-arbitrary art styles. 

"art nouveau stained glass window depicting Marvel's Captain America"
"Elsa from Frozen, cross-stitched sampler"
Sesame Street, screenshots from the miyazaki anime movie

[Tip: I find I get more reliably high-quality images from the prompt "X, screenshots from the Miyazaki anime movie" than just "in the style of anime",  I suspect because Miyazaki has a consistent style, whereas anime more broadly is probably pulling in a lot of poorer-quality anime art.]

Art style transfer

Some of most impressively high-quality output involves specific artistic styles. DALL-E can do charcoal or pencil sketches, paintings in the style of various famous artists, and some weirder stuff like "medieval illuminated manuscripts". 

"a monk riding a snail, medieval illuminated manuscript"

IMO it performs especially well with art styles like "impressionist watercolor painting" or "pencil sketch", that are a little more forgiving around imperfections in the details.  

"A woman at a coffeeshop working on her laptop and wearing headphones, painting by Alphonse Mucha"
"a little girl and a puppy playing in a pile of autumn leaves, photorealistic charcoal sketch"

 

Creative digital art

DALL-E can (with the right prompts and some cherrypicking) pull off some absolutely gorgeous fantasy-esque art pieces. Some examples: 

"a mermaid swimming underwater, photorealistic digital art"
"a woman knitting the Milky Way galaxy into a scarf, photorealistic digital art"

The output when putting in more abstract prompts (I've run a lot of "[song lyric or poetry line], digital art" requests) is hit-or-miss, but with patience and some trial and error, it can pull out some absolutely stunning – or deeply hilarious – artistic depictions of poetry or abstract concepts. I kind of like using it in this way because of the sheer variety; I never know where it's going to go with a prompt. 

"an activist destroyed by facts and logic, digital art"
"if the lord won't send us water, well we'll get it from the devil, digital art"
"For you are made of nebulas and novas and night sky You're made of memories you bury or live by, digital art" (lyric from Never Look Away by Vienna Teng)

The future of commercials 

This might be just a me thing, but I love almost everything DALL-E does with the prompt "in the style of surrealism" – in particular, its surreal attempt at commercials or advertisements. If my online ads were 100% replaced by DALL-E art, I would probably click on at least 50% more of them. 

"an advertisement for sound-cancelling headphones, in the style of surrealism"

DALLE's weaknesses

I had been really excited about using DALL-E to make fan art of fiction that I or other people have written, and so I was somewhat disappointed at how much it struggles to do complex scenes according to spec. In particular, it still has a long way to go with:

Scenes with two characters 

I'm not kidding. DALL-E does fine at giving one character a list of specific traits (though if you want pink hair, watch out, DALL-E might start spamming the entire image with pink objects). It can sometimes handle multiple generic people in a crowd scene, though it quickly forgets how faces work. However, it finds it very challenging to keep track of which traits ought to belong to a specific Character A versus a different specific Character B, beyond a very basic minimum like "a man and a woman." 

The above is one iteration of a scene I was very motivated to figure out how to depict, as a fan art of my Valdemar rationalfic. DALL-E can handle two people, check, and a room with a window and at least one of a bed or chair, but it's lost when it comes to remembering which combination of age/gender/hair color is in what location. 

"a young dark-haired boy resting in bed, and a grey-haired older woman sitting in a chair beside the bed underneath a window with sun streaming through, Pixar style digital art"

Even in cases where the two characters are pop culture references that I've already been able to confirm the model "knows" separately – for example, Captain America and Iron Man – it can't seem to help blending them together. It's as though the model has "two characters" and then separately "a list of traits" (user-specified or just implicit in the training data), and reassigns the traits mostly at random.

"Captain America and Iron Man standing side by side" which is which????

Foreground and background

A good example of this: someone on Twitter had commented that they couldn't get DALL-E to provide them with "Two dogs dressed like roman soldiers on a pirate ship looking at New York City through a spyglass". I took this as a CHALLENGE and spent half an hour trying; I, too, could not get DALL-E to output this, and end up needing to choose between "NYC and a pirate ship" or "dogs in Roman soldier uniforms with spyglasses". 

DALL-E can do scenes with generic backgrounds (a city, bookshelves in a library, a landscape) but even then, if that's not the main focus of the image then the fine details tend to get pretty scrambled. 

Novel objects, or nonstandard usages 

Objects that are not something it already "recognizes." DALL-E knows what a chair is. It can give you something that is recognizably a chair in several dozen different art mediums. It could not with any amount of coaxing produce an "Otto bicycle", which my friend specifically wanted for her book cover. Its failed attempts were both hilarious and concerning. 

prompt was something like "a little girl with dark curly hair riding down a barren hill on a magical rickshaw with enormous bicycle wheels, in the style of Bill Watterson"
An actual Otto bicycle, per Google Images

Objects used in nonstandard ways. It seems to slide back toward some kind of ~prior; when I asked it for a dress made of Kermit plushies displayed on a store mannequin, it repeatedly gave me a Kermit plushie wearing a dress. 

"Dress made out of Kermit plushies, displayed on a store mannequin"

DALL-E generally seems to have extremely strong priors in a few areas, which end up being almost impossible to shift. I spent at least half an hour trying to convince it to give me digital art of a woman whose eyes were full of stars (no, not the rest of her, not the background scenery either, just her eyes...) and the closest DALL-E ever got was this.

I wanted: the Star-Eyed Goddess
I got: the goddess-eyed goddess of recursion

Spelling

DALL-E can't spell. It really really cannot spell. It will occasionally spell a word correctly by utter coincidence. (Okay, fine, it can consistently spell "STOP" as long as it's written on a stop sign.) 

It does mostly produce recognizable English letters (and recognizable attempts at Chinese calligraphy in other instances), and letter order that is closer to English spelling than to a random draw from a bag of Scrabble letters, so I would guess that even given the new model structure that makes DALL-E 2 worse than the first DALL-E, just scaling it up some would eventually let it crack spelling.  

At least sometimes its inability to spell results in unintentionally hilarious memes? 

EmeRAGEencey!

Realistic human faces

My understanding is that the face model limitation may have been deliberate to avoid deepfakes of celebrities, etc. Interestingly, DALL-E can nonetheless at least sometimes do perfectly reasonable faces, either as photographs or in various art styles, if they're the central element of a scene. (And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.

Even more interestingly, it seems to specifically alter the appearance of actors even when it clearly "knows" a particular movie or TV show. I asked it for "screenshots from the second season of Firefly", and they were very recognizably screenshots from Firefly in terms of lighting, ambiance, scenery etc, with an actor who looked almost like Nathan Fillion – as though cast in a remake that was trying to get it fairly similar – and who looked consistently the same across all 10 images, but was definitely a different person. 

There are a couple of specific cases where DALL-E seems to "remember" how human hands work. The ones I've found so far mostly involve a character doing some standard activity using their hands, like "playing a musical instrument." Below, I was trying to depict a character from A Song For Two Voices who's a Bard; this round came out shockingly good in a number of ways, but the hands particularly surprised me. 

Limitations of the "edit" functionality 

DALL-E 2 offers an edit functionality – if you mostly like an image except for one detail, you can highlight an area of it with a cursor, and change the full description as applicable in order to tell it how to modify the selected region. 

It sometimes works - this gorgeous dress (didn't save the prompt, sorry) originally had no top, and the edit function successfully added one without changing the rest too much.

This is how people will dress in the glorious transhumanist future. 

It often appears to do nothing. It occasionally full-on panics and does....whatever this is. 

I was just trying to give the figure short hair!

There's also a "variations" functionality that lets you select the best image given by a prompt and generate near neighbors of it, but my experience so far is that the variations are almost invariably less of a good fit for the original prompt, and very rarely better on specific details (like faces) that I might want to fix.

Some art style observations 

DALL-E doesn't seem to hold a sharp delineation between style and content; in other words, adding stylistic prompts actively changes the some of what I would consider to be content

For example, asking for a coffeeshop scene as painted by Alphonse Mucha puts the woman in in a long flowing period-style dress, like in this reference painting, and gives us a "coffeeshop" that looks a lot to me like a lady's parlor; in comparison, the Miyazaki anime version mostly has the character in a casual sweatshirt. This makes sense given the way the model was trained; background details are going to be systematically different between Nouveau Art paintings and anime movies. 

"A woman at a coffeeshop working on her laptop and wearing headphones, painting by Alphonse Mucha"
"A woman at a coffeeshop working on her laptop and wearing headphones, screenshots from the miyazaki anime movie"

DALL-E is often sensitive to exact wording, and in particular it's fascinating how "in the style of x" often gets very different results from "screenshot from an x movie". I'm guessing that in the Pixar case, generic "Pixar style" might capture training data from Pixar shorts or illustrations that aren't in their standard recognizable movie style. (Also, sometimes if asked for "anime" it gives me content that either looks like 3D rendered video game cutscenes, or occasionally what I assume is meant to be people at an anime con in cosplay.) 

"A woman at a coffeeshop working on her laptop and wearing headphones, screenshots from the Pixar movie"
"A woman at a coffeeshop working on her laptop and wearing headphones, in the style of Pixar"

Conclusions

How smart is DALL-E? 

I would give it an excellent grade in recognizing objects, and most of the time it has a pretty good sense of their purpose and expected context. If I give it just the prompt "a box, a chair, a computer, a ceiling fan, a lamp, a rug, a window, a desk" with no other specification, it consistently includes at least 7 of the 8 requested objects, and places them in reasonable relation to each other – and in a room with walls and a floor, which I did not explicitly ask for. This "understanding" of objects is a lot of what makes DALL-E so easy to work with, and in some sense seems more impressive than a perfect art style. 

The biggest thing I've noticed that looks like a ~conceptual limitation in the model is its inability to consistently track two different characters, unless they differ on exactly one trait (male and female, adult and child, red hair and blue hair, etc) – in which case the model could be getting this right if all it's doing is randomizing the traits in its bucket between the characters. It seems to have a similar issue with two non-person objects of the same type, like chairs, though I've explored this less. 

It often applies color and texture styling to parts of the image other than the ones specified in the prompt; if you ask for a girl with pink hair, it's likely to make the walls or her clothes pink, and it's given me several Rapunzels wearing a gown apparently made of hair. (Not to mention the time it was confused about whether, in "Goldilocks and the three bears", Goldilocks was also supposed to be a bear.) 

The deficits with the "edit" mode and "variations" mode also seem to me like they reflect the model failing to neatly track a set of objects-with-assigned-traits. It reliably holds the non-highlighted areas of the image constant and only modifies the selected part, but the modifications often seem like they're pulling in context from the entire prompt – for example, when I took one of my room-with-objects images and tried to select the computer and change it to "a computer levitating in midair", DALL-E gave me a levitating fan and a levitating box instead. 

Working with DALL-E definitely still feels like attempting to communicate with some kind of alien entity that doesn't quite reason in the same ontology as humans, even if it theoretically understands the English language. There are concepts it appears to "understand" in natural language without difficulty – including prompts like "advertising poster for the new Marvel's Avengers movie, as a Miyazaki anime, in the style of an Instagram inspirational moodboard", which would take so long to explain to aliens, or even just to a human from 1900. And yet, you try to explain what an Otto bicycle is – something which I'm pretty sure a human six-year-old could draw if given a verbal description – and the conceptual gulf is impossible to cross. 

"advertising poster for the new Marvel's Avengers movie, as a Miyazaki anime, in the style of an Instagram inspirational moodboard"
What DALL-E 2 can and cannot do
New Comment
303 comments, sorted by Click to highlight new comments since:
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]gwern1100

Swimmer963 highlights DALL-E 2 struggling with anime, realistic faces, text in images, multiple characters/objects arranged in complex ways, and editing. (Of course, many of these are still extremely good by the standards of just months ago, and the glass is definitely more than half full.) itsnotatumor asks:

How many of these "cannot do's" will be solved by throwing more compute and training data at the problem? Anyone know if we've started hitting diminishing returns with this stuff yet?

In general, we have not topped out on pretty much any scaling curve. Whether it's language modeling, image generation, DRL, or whathaveyou, AFAIK, not a single modality can be truly said to have been 'solved' with the scaling curve broken. Either the scaling curve is flat, or we're still far away. (There are some sound-related ones which seem to be close, but nothing all that important.) Diffusion models' only scaling law I know of is an older one which bends a little but probably reflects poor hyperparameters, and no one has tried eg. Chinchilla on them yet.

So yes, we definitely can just make all the compute-budgets 10x larger without wasting it.

To go through the specific issues (caveat: we do... (read more)

Google Brain just announced Imagen (Twitter), which on skimming appears to be not just as good as DALL-E 2 but convincingly better. The main change appears to be reducing the CLIP reliance in favor of a much larger and more powerful text encoder before doing the image diffusion stuff. They make a point of noting superiority on "compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts." The samples also show text rendering fine inside the images as well.

I take this as strong support (already) for my claims 2-3: the problems with DALL-E 2 were not major or deep ones, do not require any paradigm shift to fix, or even any fix, really, beyond just scaling the components almost as-is. (In Kuhnian terms, the differences between DALL-E 2 and Imagen or Make-A-Scene are so far down in the weeds of normal science/engineering that even people working on image generation will forget many of the details and have to double-check the papers.)

EDIT: Google also has a more traditional autogressive DALL-E-1-style 1024px model, "Parti", competing with diffusion Imagen; it is slightly better in COCO FID than Imagen. It likewise does well on all those issues, with again no special fancy engineering aimed specifically at those issues, mostly just scaling up to 20b.

Will future generative models choke to death on their own excreta? No.

Now that goalposts have moved from "these neural nets will never work and that's why they're bad" to "they are working and that's why they're bad", a repeated criticism of DALL·E 2 etc is that their deployment will 'pollute the Internet' by democratizing high-quality media, which may (given all the advantages of machine intelligence) quickly come to exceed 'regular' (artisanally-crafted?) media, and that ironically this will make it difficult or impossible to train better models. I don't find this plausible at all but lots of people seem to and no one is correcting all these wrong people on the Internet, so here's a quick rundown why:

  1. It Hasn't Happened Yet: there is no such thing as 'natural' media on the Internet, and never has been. Even a smartphone photograph is heavily massaged by a pipeline of algorithms (increasingly DL-based) before it is encoded into a codec designed to throw away human-perceptually-unimportant data such as JPEG. We are saturated in all sorts of Photoshopped, CGIed, video-game-rendered, Instagram-filtered, airbrushed, posed, lighted, (extremely heavily) curated media. If these models

... (read more)
2lc
Can I get a link to someone who actually believes this? I'm honestly a little skeptical this is a common opinion, but wouldn't put it past people I guess.

I've seen it several times on Twitter, Reddit, and HN, and that's excluding the people like Jack Clark who has pondered it repeatedly in his Import.ai newsletter & used it as theme in some of his short stories (but much more playfully & thoughtfully in his case so he's not the target here). I think probably the one that annoyed me enough to write this was when Imagen hit HN and the second lengthy thread was all about 'poisoning the well' with most of them accepting the premise. It has also been asked here on LW at least twice in different places. (I've also since linked this writeup at least 4 times to various people asking this exact question about generative models choking on their own exhaust, and the rise of ChatGPT has led to it coming up even more often.)

Wow, this is going to explode picture books and book covers.

Hiring an illustrator for a picture book costs a lot, as it should given it's bespoke art.

Now publishers will have an editor type in page descriptions, curate the best and off they go. I can easily imagine a model improvement to remember the boy drawn or steampunk bear etc.

Book cover designers are in trouble too. A wizard with lighting in hands while mountain explodes behind him - this can generate multiple options.

It's going to get really wild when A/B split testing is involved. As you mention regarding ads you'd give the system the power to make whatever images it wanted and then split test. Letting it write headlines would work too.

Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins - animated, eight seconds. And so on. At that point it's like Pixar in a box. We'll see an explosion of directors who work alone, typing descriptions, testing camera angles, altering scenes on the fly. Do that again but more violent. Do that again but with more blood splatter.

Animation in the style of Family Guy seems a natural first ... (read more)

Perhaps a full animated movie down the line. There are already programs that fill in gaps for animation poses. Boy running across field chased by robot penguins - animated, eight seconds.

Video is on the horizon (video generation bibliography eg. FDM), in the 1-3 year range. I would say that video is solved conceptually in the sense that if you had 100x the compute budget, you could do DALL-E-2-but-for-video right now already. After all, if you can do a single image which is sensible and logical, then a video is simply doing that repeatedly. Nor is there any shortage of video footage to work with. The problem there is that a video is a lot of images: at least 24 images per second, so you could have 192 different samples, or 1 8s clip. Most people will prefer the former: decorating, say, a hundred blog posts with illustrations is more useful than a single OK short video clip of someone dancing.

So video's game is mostly about whether you can come up with an approach which can somehow economize on that, like clever tricks in reusing frames to update only a little while updating a latent vector, as a way to take a shortcut to that point in the future where you had so much compute tha... (read more)

2Sable
Following up on your logic here, the one thing that DALLE-2 hasn't done, to my knowledge, is generate entirely new styles of art, the way that art deco or pointillism were truly different from their predecessors. Perhaps that'll be the new of of human illustrators?  Artists, instead of producing their own works to sell, would instead create their own styles, generating libraries of content for future DALLEs to be trained against.  They then make a percentage on whatever DALLE makes from image sales if the style used was their own.

Can DALL·E Create New Styles?

Most DALL·E questions can be answered by just reading the paper of it or its competitors, or are dumb. This is probably the most interesting question that can't be, and also one of the most common: can DALL·E (which we'll use just as a generic representative of image generative models, since no one argues that one arch or model can and the others cannot AFAIK) invent a new style? DALL·E is, like GPT-3 in text, admittedly an incredible mimic of many styles, and appears to have gone well beyond any mere 'memorization' of the images depicting styles because it can so seamlessly insert random objects into arbitrary styles (hence all the "Kermit Through The Ages" or "Mughal space rocket" variants); but simply being a gifted epigone of most existing styles is not guarantee you can create a new one.

If we asked a Martian what 'style' was, it would probably conclude that "'style' is what you call it when some especially mentally-ill humans output the same mistakes for so long that other humans wearing nooses try to hide the defective output by throwing small pieces of green paper at the outputs, and a third group of humans wearing dresses try to exchange large ... (read more)

6gwern
An interesting example of what might be a 'name-less style' in a generative image model, Stable Diffusion in this case (DALL-E 2 doesn't give you the necessary access so users can't experiment with this sort of thing): what the discoverer calls the "Loab" (mirror) image (for lack of a better name - what text prompt, if any, this image corresponds to is unknown, as it's found by negation of a text prompt & search). 'Loab' is an image of a creepy old desaturated woman with ruddy cheeks in a wide face, which when hybridized with other images, reliably induces more images of her, or recognizably in the 'Loab style' (extreme levels of horror, gore, and old women). This is a little reminiscent of the discovered 'Crungus' monster, but 'Loab style' can happen, they say, even several generations of image breeding later when any obvious part of Loab is gone - which suggests to me there may be some subtle global property of descendant images which pulls them back to Loab-space and makes it 'viral', if you will. (Some sort of high-frequency non-robust or adversarial or steganographic phenomenon?) Very SCP. Apropos of my other comments on weird self-fulfilling prophecies and QAnon and stand-alone-complexes, it's also worth noting that since Loab is going viral right now, Loab may be a name-less style now, but in future image generator models feeding on the updating corpus, because of all the discussion & sharing, it (like Crungus) may come to have a name - 'Loab'.
1alexlyzhov
I wonder what happens when you ask it to generate > "in the style of a popular modern artist <unknown name>" or > "in the style of <random word stem>ism". You could generate both types of prompts with GPT-3 if you wanted so it would be a complete pipeline. "Generate conditioned on the new style description" may be ready to be used even if "generate conditioned on an instruction to generate something new" is not. This is why a decomposition into new style description + image conditioned on it seems useful. If this is successful, then more of the high-level idea generation involved can be shifted onto a language model by letting it output a style description. Leave blanks in it and run it for each blank, while ensuring generations form a coherent story. >"<new style name>, sometimes referred to as <shortened version>, is a style of design, visual arts, <another area>, <another area> that first appeared in <country> after <event>. It influenced the design of <objects>, <objects>, <more objects>. <new style name> combined <combinatorial style characteristic> and <another style characteristic>. During its heyday, it represented <area of human life>, <emotion>, <emotion> and <attitude> towards <event>." ---------------------------------------- DALL-E can already model the distribution of possible contexts (image backgrounds, other objects, states of the object) + possible prompt meanings. An go from the description 1) to high-level concepts, 2) to ideas for implementing these concepts (relative placement of objects, ideas for how to merge concepts), 3) to low-level details. All within 1 forward pass, for all prompts! This is what astonished me most about DALL-E 1. Importantly, placing, implementing, and combining concepts in a picture is done in a novel way without a provided specification. For style generation, it would need to model a distribution over all possible styles and use each style, all without a style specification. This doesn't seem much harder to me an
6Swimmer963 (Miranda Dixon-Luinenburg)
...Hmm now I'm wondering if feeding DALL-E an "in the style of [ ]" request with random keywords in the blank might cause it do replicable weird styles, or if it would just get confused and do something different every time. 
3Sable
I'd love to see it tried.  Maybe even ask for "in the style of DALLE-2"?
8Swimmer963 (Miranda Dixon-Luinenburg)
"A woman riding a horse, in the style of DALLE-2"
1Sable
I have no idea how to interpret this.  Any ideas? It seems like we got a variety of different styles, with red, blue, black, and white as the dominant colors. Can we say that DALLE-2 has a style of its own?

I think DALL-E has been nerfed (as a sort of low-grade "alignment" effort) and some of what you're talking about as "limitations" are actually bugs that were explicitly introduced with the goal of avoiding bad press.

OpenAI has made efforts to implement model-level technical mitigations that ensure that DALL·E 2 Preview cannot be used to directly generate exact matches for any of the images in its training data. However, the models may still be able to compose aspects of real images and identifiable details of people, such as clothing and backgrounds. (sauce)

It wouldn't surprise me if they just used intelligibility tools to find the part of the vectorspace that represents "the face of any famous real person" and then applied some sort of noise blur to the model itself, as deployed?

Except! Maybe not a "blur" but some sort of rotation of a subspace or something? This hint is weirdly evocative:

they were very recognizably screenshots from Firefly in terms of lighting, ambiance, scenery etc, with an actor who looked almost like Nathan Fillion – as though cast in a remake that was trying to get it fairly similar – and who looked consistently the same across all 10 images, but was definite

... (read more)
7gwern
Yes, I thought their 'horse in ketchup' example made the point well that it's an 'artificial stupidity' Harrison-Bergeron sort of approach rather than a genuine solution. (And then, like BPEs, there seems to be unpredictable fallout which would be hard to benchmark and which no one apparently even thought to benchmark - despite whatever they did on May 1st to upgrade quality, the anime examples still struggle to portray specific characters like Kyuubey, where Swimmer's examples are all very Kyuubey-esque but never actually Kyuubey. I am told the CLIP used is less degraded, and so we're probably seeing the output of 'CLIP models which know about characters like Kyuubey combined with other models which have no idea'.)

Thread of all known anime examples.

whereas anime more broadly is probably pulling in a lot of poorer-quality anime art...(Also, sometimes if asked for “anime” it gives me content that either looks like 3D rendered video game cutscenes, or occasionally what I assume is meant to be people at an anime con in cosplay.)

That's how you know it's not a problem of pulling in lots of poorer-quality anime art. First, poorer-quality doesn't impede learning that much; remember, you just prompt for high-quality. Compute allowing, more n is always better. And second, if it was a master of poorer-quality anime drawings, it wouldn't be desperately 'sliding away', if you will, like squeezing a balloon, from rendering true anime, as opposed to CGI of anime or Western fanart of anime or photographs of physical objects related to anime. It would just do it (perhaps generating poorer-quality anime), not generate high-quality samples of everything but anime. (See my comment there for more examples.)

The problem is it's somehow not trained on anime. Everything it knows about anime seems to come primarily from adjacent images and the CLIP guidance (which does know plenty about anime, but we also know that pixel generation from CLIP guidance never works as well).

A prompt i'd love to see: "Anomalocaris Canadensis flying through space." I'm really curious how well it does with an extinct species which has very little existing artistic depictions. No text->image model i've played with so far has managed to create a convincing anomalocaris, but one interestingly did know it was an aquatic creature and kept outputting lobsters.

Going by the Wikipedia page reference, I think it got it somewhat closer than "lobsters" at least? 

I'd rate these highly, there are many forms of anomalocarids (https://en.m.wikipedia.org/wiki/Radiodonta#/media/File%3A20191201_Radiodonta_Amplectobelua_Anomalocaris_Aegirocassis_Lyrarapax_Peytoia_Laggania_Hurdia.png) and it looks to have picked a wide variety aside from just candensis, but I'm thoroughly impressed that it got the form right in nearly all 10.

Challenging prompt ideas to try:

  • A row of five squares, in which the rightmost four squares each have twice the area of the square to their  immediate left.
  • Screenshots from a novel game comparable in complexity to tic-tac-toe sufficient to demonstrate the rules of the game.
  • Elon Musk signing his own name in ASL.
  • The hands of a pianist as they play the first chord from Chopin's Polonaise in Ab major, Op. 53
  • Pages from a flip book of a water glass spilling.

First one: ....yeah no, DALL-E 2 can't count to five, it definitely doesn't have the abstract reasoning to double areas. Image below is literally just "a horizontal row of five squares". 

6DirectedEvolution
Very interesting that it can't manage to count to five. That to me is strong evidence that DALL-E's not "constructing" the scenes it depicts. I guess it has more of a sense of relationships among scene element components? Like, "coffee shop" means there's a window-like element, and if there's a window element, then there's some sort of scene through the window, and that's probably some sort of rectangular building shape. Plausible guesses all the way down to the texture and color of skin or fur. Filling in the blanks on steroids, but with a complete lack of design or forethought. 

Yeah, this matches with my sense. It has a really extensive knowledge of the expected relationships between elements, extending over a huge number of kinds of objects, and so it can (in one of the areas that are easy for it) successfully fill in the blanks in a way that looks very believable, but the extent to which it has a gears-y model of the scene seems very minimal. I think this also explains its difficulty with non-stereotypical scenes that don't have a single focal element – if it's filling in the blanks for both "pirate ship scene" and "dogs in Roman uniforms scene" it gets more confused. 

4DirectedEvolution
You're making my dreams come true. I really want to see the Elon Musk one :) Edit: or the waterglass spilling. That's the one with my most uncertainty about its performance.

The Elon Musk one has realistic faces so I can't share it; I have, however, confirmed that DALL-E does not speak ASL with "The ASL word for "thank you"":

6DirectedEvolution
We've got some funky fingers here. Six six fingers, a sort of double-tipped finger, an extra joint on the index finger on picture (1, 4). Fascinating.
2Measure
It seems to be mostly trying to go for the "I love you" sign, perhaps because that's one of the most commonly represented ones.
1jasperdale
I'm curious why this prompt resulted in overwhelmingly black looking hands. Especially considering that all the other prompts I see result in white subjects being represented. Any theories?
5gwern
It's unnatural, yes: ASL is predominantly white, and people involved in ASL are even more so (I went to NTID and the national convention, so can speak first-hand, but you can also check Google Image for that query and it'll look like what you expect, which is amusing because 'Deaf' culture is so university & liberal-centric). So it's not that ASL diagrams or photographs in the wild really do look like that - they don't. Overrepresentation of DEI material in the supersekrit licensed databases would be my guess. Stock photography sources are rapidly updated for fashions, particularly recent ones, and you can see this occasionally surfacing in weird queries. (An example going around Twitter which you can check for yourself: "happy white woman" in Google will turn up a lot of strange photos for what seems like a very easy straightforward query.) Which parts are causing it is a better question: I wouldn't expect there to be much Deaf stock photo material which had been updated, or much ASL material at all, so maybe there's bleedthrough from all of the hand-centric (eg 'Black Power salute', upraised Marxist fists, protests) iconography? There being so much of the latter and so little of the former that the latter becomes the default kind of hand imagery.
3jasperdale
It must be something like that, but it still feels like there's a hole there.  The query is for "ASL", not "Hands", and these images don't look like something from a protest. The top left might be vaguely similar to some kind of street gesture.  I'm curious what the role of the query writer is. Can you ask DALL-E for "this scene, but with black skin colour"? I got a sense that updating areas was possible but inconsistent. Could DALL-E learn to return more of X to a given person by receiving feedback? I really don't know how complicated the process gets.
2gwern
ASL will always be depicted by a model like DALL-E as hands; I am sure that there are non-pictorial ways to write down ASL but I can't recall them, and I actually took ASL classes. So that query should always produce hands in it. Then because actual ASL diagrams will be rare and overwhelmed by leakage from more popular classes (keep in mind that deafness is well under 1% of the US population, even including people like me who are otherwise completely uninvolved and invisible, and basically any political fad whatsoever will rapidly produce vastly more material than even core deaf topics), and maybe some more unCLIP looseness...
7gwern
OA announced its new 'reducing bias' DALL-E 2 today. Interestingly, it appears to do so by secretly editing your prompt to inject words like 'black' or 'female'.

"Pages from a flip book of a water glass spilling" I...think DALL-E 2 does not know what a flip book is. 

9Swimmer963 (Miranda Dixon-Luinenburg)
I...think it just does not understand the physics of water spilling, period. 
7Swimmer963 (Miranda Dixon-Luinenburg)
Relatedly, DALL-E is a little confused about how Olympic swimming is supposed to work.
5DirectedEvolution
This is interesting, because you'd think it would at least understand that the cup should be tipping over. Makes me think it is considering the cup and the water as two distinct objects, and doesn't really understand that the cup tipping over would be what causes the water to spill. But it does understand that the water should be located "inside" the cup, but probably purely in a "it looks like the water is inside the cup" sense. I don't think DALL-E seems to understand the idea of "inside" as an actual location.
1Nazarii
I wonder if its understanding of the world is just 2D or semi-3D. Perhaps training it on photogrammetry datasets (photos of the same objects but from multiple points of view) would improve that?

Slightly reworded to "a game as complex tic-tac-toe, screenshots showing the rules of the game", I am pretty sure DALL-E is not able to generate and model consistent game rules though. 

3DirectedEvolution
At least it seems to have figured out we wanted a game that was not tic-tac-toe.
6Charlie Steiner
Depends on if it generates stuff like this if you ask it for tic-tac-toe :P
1kjz
What about the combo: a tic-tac-toe board position, a tic-tac-toe board position with X winning, and a tic-tac-toe board position with O winning. Would it give realistic positions matching the descriptions?
2Swimmer963 (Miranda Dixon-Luinenburg)
I really doubt it but I'll give it a try once I'm caught on on all the requested prompts here! 

Thanks for this thorough account. The bit where you tried to shorten the hair really made me laugh.

DALL-E is often sensitive to exact wording, and in particular it’s fascinating how “in the style of x” often gets very different results from “screenshot from an x movie”. I’m guessing that in the Pixar case, generic “Pixar style” might capture training data from Pixar shorts or illustrations that aren’t in their standard recognizable movie style.

I've seen this prompt programming bug noted on Twitter by DALL-E 2 users as well. With earlier models, there didn't seem to be that much difference between 'by X' vs 'in the style of X', but with the new high-e... (read more)

Thanks for that awesome sumup,
I tried to generate character (Dark Elf / Drow), Magic Items and Scene in a Dungeon and Dragon or Magic the Gathering style like so many cool images on Pinterest :
https://www.pinterest.fr/rbarlow177/dd-character-art/

It was very very difficult !
- Character style is very crappy like old Google Search clipart
- Some "technical term" like Dark Elf or Drow match nothing

The Idea was to generate Medieval Fantasy style for Card Game like Magic but it's very hard to get something good. I fail after 30+ attempt

This is great! I'm generally most interested to see people finding weaknesses of new DL tools, which in and of itself is a sign of how far the technology has progressed.

I'm having real trouble finding out about Dall E and copyright infringement.  There are several comments about how Dall E can "copy a style" without it being a violation to the artist, but seriously, I'm appalled.  I'm even having trouble looking at some of the images without feeling "the death of artists."  It satisfies the envy of anyone who every wanted to do art without making the effort, but on whose backs?  Back in the day, we thought that open source would be good advertising, but there is NO reference to any sources.  I'm a... (read more)

2Daphne_W
Sorry that automation is taking your craft. You're neither the first nor the last this will happen to. Orators, book illuminators, weavers, portrait artists, puppeteers, cartoon animators, etc. Even just in the artistic world, you're in fine company. Generally speaking, it's been good for society to free up labor for different pursuits while preserving production. The art can even be elevated as people incorporate the automata into their craft. It's a shame the original skill is lost, but if that kept us from innovating, there would be no way to get common people multiple books or multiple pictures of themselves or CGI movies. It seems fair to demand society have a way to support people whose jobs have been automated, at least until they can find something new to do. But don't get mad at the engine of progress and try to stop it - people will just cheer as it runs you over.
4abramdemski
It's not just a question of automation eliminating skilled work. Deep learning uses the work of artists in a significant sense. There is a patchwork of law and social norms in place to protect artists, EG, the practice of explicitly naming major inspirations for a work. This has worked OK up to now, because all creative re-working of other art has either gone through relatively simple manipulation like copy/paste/caption/filter, or thru the specific route of the human mind taking media in and then producing new media output which takes greater or smaller amounts of inspiration from media consumed.  AI which learns from large amounts of human-generated content, is legitimately a new category here. It's not obvious what should be legal vs illegal, or accepted vs frowned upon by the artistic community.  Is it more like applying a filter to someone else's artwork and calling it your own? Or is it more like taking artistic inspiration from someone else's work? What kinds of credit are due?
3gbear605
It seems to me that the only thing that seems possible is to treat it like a human that took inspiration from many sources. In the vast majority of cases, the sources of the artwork are not obvious to any viewer (and the algorithm cannot tell you one). Moreover, any given created piece is really the combination of the millions of pieces of the art that the AI has seen, just like how a human takes inspiration from all of the pieces that it has seen. So it seems most similar to the human category, not the simple manipulations (because it isn’t a simple manipulation of any given image or set of images). I believe that you can get the AI to output an image that is similar to an existing one, but a human being can also create artwork that is similar to existing art. Ultimately, I think the only solution to rights protection must be handling it at that same individual level. Another element that needs to be considered is that AI generated art will likely be entirely anonymous before long. Right now, anyone can go to http://notarealhuman.com/ and share the generated face to Reddit. Once that’s freely available with DALL-E 2 level art and better (and I don’t think that’s avoidable at this point), I don’t think any social norms can hinder it. The other option to social norms is to outlaw it. I don’t think that a limited regulation would be possible, so the only possibility would be a complete ban. However, I don’t think all the relevant governments will have the willpower to do that. Even if the USA bans creating image generation AIs like this (and they’d need to do so in the next year or two to stop it from already being widely spread), people in China and Russia will surely develop them within a decade. Determining that the provenance of an artwork is a human rather than an AI seems impossible. Even if we added tracing to all digital art tools, it would still be possible to create an image with an AI, print and scan it, and then claim that you made it yourself. In some
2abramdemski
I agree that this is a plausible outcome, but I don't think society should treat it as a settled question right now. It seems to me like the sort of technology question which a society should sit down and think about.  It is most similar to the human category, yes absolutely, but it enables different things than the human category. The consequences are dramatically different. So it's not obvious a priori that it should be treated legally the same.  You argue against a complete ban by pointing out that not all relevant governments would cooperate. I don't think all governments have to come to the same decision here. Copyright enforcement is already not equal across countries. I'm not saying I think there should be a complete ban, but again, I don't think it's totally obvious either, and I think artists calling for a ban should have a voice in the decision process.  But I also don't agree with your argument that the only two options are a complete ban or treating it exactly like human-generated art. I don't agree with your argument that a requirement to display the closest images from the training data would be useless. I agree that it is easily circumvented, but it does make it much easier to avoid accidental infringement by putting in prompts which happen to be good at pulling out exact duplicates of some datapoint, unbeknownst to you.  I also think it would be within the realm of reasonable possibility to simply apply different legal standards for infringement in the two cases. Perhaps it's fine for human artists to copy a style, but because it's so easy to do it with an AI, it is considered a form of infringement to copy a style that way. IDK, possibly that is a terrible idea, but my point is that it's not clear to me that there are no options at all.

I wonder if you could get it to generate Minecraft screenshots, such as:

  • A log cabin in a a clearing in a dark forest, as a screenshot from Minecraft

It would also be interesting to see how “as a screenshot from Minecraft“ combines with other styles:

  • A wagon caravan approaches a ruined city in the desert, as a Miyazaki anime, as a screenshot from Minecraft

You could also append “as a screenshot from Minecraft” to more abstract prompts, for example:

  • A machine that harvests luck from four leaf clovers, as a screenshot from Minecraft

Finally, some other miscellaneo... (read more)

The "one character" limitation makes it look like DALL-E was spawned from ongoing, massive programs to develop object recognizing systems, not any sort of general generative system. 

Would it be accurate to characterize DALL-E as "basically inverted object recognition"?

My understanding is that the face model limitation may have been deliberate to avoid deepfakes of celebrities, etc. Interestingly, DALL-E can nonetheless at least sometimes do perfectly reasonable faces, either as photographs or in various art styles, if they're the central element of a scene. (And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.)

FWIW, OpenAI just changed the requirements on face samples, loosening it consi... (read more)

Prompt from my brother:

What people from 1920 thought 2020 would look like. 1920's Artist's depiction of 2020

4Swimmer963 (Miranda Dixon-Luinenburg)
"What people from 1920 thought 2020 would look like. 1920's Artist's depiction of 2020"

When they released the first Dall-E, didn't OpenAI mention that prompts which repeated the same description several times with slight re-phrasing produced improved results?

I wonder how a prompt like:

"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney."

-would compare with something like:

"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney.  A painting of an ornate robotic feline made of brass and a man wearing futuristic tribal clothing.  A steampunk scene by James Gurney featuring a robot shaped like a panther and a high-tech shaman."

5Swimmer963 (Miranda Dixon-Luinenburg)
"A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney."  Vs "A post-singularity tribesman with a pet steampunk panther robot. Illustration by James Gurney.  A painting of an ornate robotic feline made of brass and a man wearing futuristic tribal clothing.  A steampunk scene by James Gurney featuring a robot shaped like a panther and a high-tech shaman." Huh! Yeah, the second one definitely does seem to incorporate more detail.
1artifex0
Thanks! I'm not sure how much the repetitions helped much with accuracy for this prompt- it's still sort of randomizing traits between the two subjects.  Though with a prompt this complex, the token limit may be an issue- it might be interesting to test at some point whether very simple prompts get more accurate with repetitions. That said, the second set are pretty awesome- asking for a scene may have helped encourage some more interesting compositions.  One benefit of repetition may just be that you're more likely to include phrases that more accurately describe what you're looking for.
4Shai Noy
Good point. I've also noticed good results for adding multiple details by mentioning each individually. E.g. instead of "tribesman with you blue robe, holding a club, looking angry, with a pet robot tiger" try "A tribesman with a pet tiger. The tribesman wears a blue robe. The tribesman is angry. The tribesman is holding a club. The tiger is a cyberpunk robot robot."

even if it theoretically understands the English language.

If you mix up a prompt into random words so that it's no longer grammatically correct English, does it give worse results? That is, I wonder how much it's basically just going off keywords.

1Dirichlet-to-Neumann
That's an interesting question ! Although it clearly understand things like spatial positioning so it must understand some grammar.

Prompt suggestion: "A drawing of an animal which has no resemblance to a cat"

5Swimmer963 (Miranda Dixon-Luinenburg)
Yeahhhh, as I expected DALL-E cannot super follow the negation here. (We also tried to ask it for "a stop sign, spelled incorrectly" and it just gave us stop signs.) 
2Vanessa Kosoy
Hmm, theoretically, DALL-E might be assuming the prompt is irony. What about this: "Apparently, this is a cat???"
4Swimmer963 (Miranda Dixon-Luinenburg)
Yeah, no, it just gives me...cats.
1Garrett Baker
Perhaps there is no operation of negation on cats in it's model. I'd predict it'd have an easier time just taking things out of pictures, so the prompt "a picture of my bed with no sheets" should produce a bed with no sheets. Perhaps if you wrote "This picture has no cats in it. The title is 'the opposite of a cat'", then I am uncertain about the output.

Curated. I think this post is a great demonstration of what our last curation choice suggested

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals. 

I'm not yet convinced this will especially fruitful, but t... (read more)

I gather we're allowed to suggest prompts we wish to see? Here's a prompt trying to create fanart for my favourite web serial, Pale by Wildbow:

"A girl with red-blonde hair, in a forest. The girl is wearing a deer-mask with short antlers, a cape over a jersey and shorts, and a witch's hat. The girl is holding a hockey stick. Every branch of every tree has a bright ribbon tied to it. The cape rests atop her shoulders and falls over one arm like a musketeer's cape. The witch's hat and the cape are both navy-blue."

3Swimmer963 (Miranda Dixon-Luinenburg)
The AI is sort of trying to make this photographs, but I am judging that none of them are in danger of being photorealistic faces... 
3ArisKatsaris
Lol. Thank you. To make it look like fanart I should have probably specified something about it, because these currently look more like photos of LARPing. Many thanks and I appreciate the effort. If it's not overtaxing your generosity can we then also attempt the following tweaks? "Highly detailed and beautiful digital art of a fantasy character: A 13-year-old girl with red-blonde hair, in a forest. The girl is wearing a deer-mask with short antlers, a cape over a jersey, and a witch's hat. The girl is holding a hockey stick. Every branch of every tree has a bright ribbon tied to it. The cape rests atop her shoulders and falls over one arm like a musketeer's cape." Many thanks in advance!
7Swimmer963 (Miranda Dixon-Luinenburg)
Ooooh! Yeah, definitely much more fantasy fan art style. 

Thanks Swimmer963! This was very interesting.

I have a general question for the community. Does anyone know of any similar such descriptions of model limitations with so many examples performed for any language models such as GPT-3?

My personal experience is that visual output is inherently more intuitive, but I'd love to explore my intuition around language models with an equivalent article for GPT-3 or PaLM for example.

I'd predict such articles exist with high confidence but finding the appropriate article with sufficient quality might be trickier. I'm curious which articles commenters here would select in particular.

Some prompts:

The Last Supper by Leonardo Da Vinci, but painted from behind.

The Last Supper by Leonardo Da Vinci, but painted from above, looking straight downwards.

The Last Supper by Leonardo Da Vinci, as an X-ray image.

Relativity by Escher, as a high-resolution photograph.

Boris Johnson dressed as a clown and riding a unicycle along a tightrope, spray-painted onto a wall, in the style of Banksy.

"The Last Supper by Leonardo Da Vinci, as an X-ray image" It's trying! 

I especially like this one (close-up): https://labs.openai.com/s/QsWCxHvbwRaIJEB7xbTCnvwx

1bfinn
Thanks very much - yes, that one is pretty remarkable, as are several of them. On the close-up I see loaves, some kind of gadget left of centre, and is that the baby Jesus (with beard?) they're about to tuck into?! (I assume DALLE-2 is not always sure how to show people from this perspective.)
6Swimmer963 (Miranda Dixon-Luinenburg)
"The Last Supper by Leonardo Da Vinci, but painted from behind". (Based on previous playing around, I think that DALL-E does not have a super strong conception of "The Last Supper" in general, and sort of defaults to a generic supper table.) 
2bfinn
Thanks. Interesting that it gets the general idea of 'from behind' but the specifics garbled - eg bottom left the people should be sitting on the bench, not the other side of the table!

This is great! Thanks.

A nitpick:

adding stylistic prompts actively changes the some of what I would consider to be content

Your examples here are not good since e.g. "...painting by Alphonse Mucha" is not just a rewording of "...in the style of Alphonse Mucha": the former isn't a purely stylistic prompt. For a [painting by x], x gets to decide what is in the painting - so it should be expected that this will change the content.
Similarly for "screenshots from the miyazaki anime movie".

Of course it's still a limitation if you can only get really good style results by using such not-purely-stylistic prompts.

3Swimmer963 (Miranda Dixon-Luinenburg)
That's a reasonable point. I have definitely found that saying "a painting by X" or a "a movie by X" gets results that a) I personally like much better, and b) are much more consistently and recognizably in the requested style!  I'm not sure whether "in the style of X" just ends up being less of a strong hint for DALL-E, or whether it's pulling on a much bigger set of training data. Maybe there are all sorts of images online labeled as "in the style of Alphonse Mucha" by people who don't actually know how to assess styles? Anyway, this is "A woman at a coffeeshop working on her laptop and wearing headphones, in the style of Alphonse Mucha" and it's fine but it's much less what I ordered! 

Some prompts I’d love to see: “Infinite Jest” “Bedroom with impossible geometry” “Coffeeshop in non-Euclidian hyperbolic space” “Screenshot of Wikipedia front page” “The shadow in the corner of the room stared at me”

"A screenshot of the Wikipedia home page" this is one of the results that makes me feel ~anthropomorphized fondness for DALL-E. It's trying so hard! 

8cwillu
It's basically what text looks like when I dream.
2philip_b
"A screenshot of the Wikipedia home page, Halloween version" please.
4Swimmer963 (Miranda Dixon-Luinenburg)
This came out super cute! Thanks for the prompt idea :) 
1PoignardAzur
Fascinating. Dall-E seems to have a pretty good understanding of "things that should be straight lines", at least in this case.
4Swimmer963 (Miranda Dixon-Luinenburg)
I ran "Bedroom with impossible geometry by MC Escher" to give DALL-E more hints, because the first run was really not very impossible-geometry, I'm not sure if DALL-E was managing to parse that as a clause or just hearing 'geometry.' 
4Swimmer963 (Miranda Dixon-Luinenburg)
"Infinite Jest"
2Swimmer963 (Miranda Dixon-Luinenburg)
"The shadow in the corner of the room stared at me, digital art"

Dall-E knows locations. We put a watercolor painting I did of our cabin on a lake and asked Dall-E to create a "variation". The watercolor image Dall-e created was literally my next door neighbors cabin which is a few hundred feet away from ours. Blew my mind how Dall-E even knew the location just based on the image I put in. 

8Ben Pace
I currently roll to disbelieve, and suspect that it just thought a cabin should be there. What I'm saying is, pics or it didn't happen ;)
4gwern
I agree. What sort of images would it even be trained on in the first place which would allow that? It can't train on a big montage or landscape shot because the dimensions are wrong and the core model is trained on very small samples to boot, with upscalers handling most of the pixel generation. I would check Google & Yandex image search to see if there are any photographs online with the two cabins in the same photograph which could hypothetically enable that. I would also try using the closest street addresses to see if one can prompt it directly, since that is likely what would be in the text caption of hypothetical images. Also, testing photograph rather than watercolor is an obvious change. A more stringent test would be to do inpainting/uncropping of photographs of both: if it really does 'know', it should be highly likely to fill in the other cabin in the right location and surroundings when you 'pan left' or whatever. Otherwise, 'cabins' are a fairly stereotypical kind of architecture and it just got lucky. OA says DALL-E 2 is well into the low millions of images generated and climbing as fast as overloaded GPUs can spit them out (<=50 completions per day per >30k invited people thus far...), so we're not even appealing that hard to chance here.
1Swimmer963 (Miranda Dixon-Luinenburg)
W h a t  that's wild, wow, I would absolutely not have predicted DALL-E could do that! (I'm curious whether it replicates in other instances.) 

I'd love to see:

>A group of happy people does Circling and Authentic Relating in a park

2Swimmer963 (Miranda Dixon-Luinenburg)
"A group of happy people does Circling and Authentic Relating in a park"
2ChristianKl
Thank you very much. It's interesting how Dalle got the idea that people are either holding hands or doing hollahoops.

Big black furry monsters with tall heads are wearing construction outfits and are dripping with water and seaweed. They are using a dolly to pick up a dumpster in an alley and pointing at where to bring it. Realistic HD photo.

9Swimmer963 (Miranda Dixon-Luinenburg)

I am so confused by two completions of a human girl here. How is this possibly close in image-space to all the other images, especially given this prompt?

5gwern
That's an unusually realistic face, and a distinct hairstyle. I suspect that's a real person and if it is, knowing who might shed some light on how the prompt could possibly be tapping into her - that she shows up twice (and it's obviously the same girl twice given the hair style and clothing are the same) suggests there is some sort of real connection, like she's an animator famous for cartoon monsters or something.
7habryka
This also replicated when I asked someone else to generate new images for the same prompt (one image out of the 10 was again in this very different style and displaying approximately the same person).
2gwern
Very strange. I did some searching in Google Images & Yandex for the cropped face and for 'furry black monsters', and asides from being impressed just how many more women Yandex turns up who do in fact look a lot like the sample, didn't find anything obviously relevant.
1AttentionResearcher
Interesting. Both the 2 images of her have a white house wall to the left with same lighting, same hair color and bottom colored hair, same shirt color, and same skin color. Maybe the words 'wearing' and 'outfits' and 'black' and 'alley' and even 'dolly' and 'photo' may have triggered it to give us an alley - but one that has a clothing fashionist in it lol. It still seems to be choosing a single source though mostly.
3PoignardAzur
This seems like a major case study for interpretability. What you'd really want is to be able to ask the network "In what ways is this woman similar to the prompt?" and have it output a causality chain or something.
1AttentionResearcher
1. Glossy black crystal temples with silver gates smoking and huge spiked metal worms drilling through the temples. A layer of smoke sits on the glossy black floor and there is chains everywhere. A huge bridge made of metal spikes connects to this world. Realistic HD photo. 2. Up close shot of tall pikachus that have short white peach fuzz fur are wearing full furry white robes and are placing large gold keys into a white furry chest in utopia heaven. It's shining bright morning hour and everything has gold plated and crystal rimmed features. Realistic HD photo.

What if the prompt literally doesn't make sense? Like having a coherent prompt structure, but the content isn't logically valid.

For example, "A painting of a woman drawing herself, in the style of clocks"

4Swimmer963 (Miranda Dixon-Luinenburg)
It tries! 

Thank you for sharing all of these DALL-E tests!

I wonder whether it can reproduce three objects that reliably appear together in images.  How about one of these prompts:

A bronze statue of three wise monkeys.

See no evil, hear no evil, speak no evil, statue of monkeys.

3Swimmer963 (Miranda Dixon-Luinenburg)
"A bronze statue of three wise monkeys." Pretty solid!  "See no evil, hear no evil, speak no evil, statue of monkeys."
1PoignardAzur
Interesting. It seems to understand that the pattern should be "Three monkeys with hands on their heads somehow", but it doesn't seem to get that each monkey should have hands in a different position. I wonder if that means gwern is wrong when he says DALL-E 2's problem is that the text model compresses information, and the underlying "representation" model genuinely struggles with composition and "there must be three X with only a single Y among them" type of constraints.
1gturk1
Thank you so much for this!  It did do quite well. I have been trying to think of another set of three items that are reliably found together, but this is all I could come up with.  Pairs of items are much easier to come up with.
1TibuAI
This is so good.

The Bill Watterson one requires me to request black bears attacking a black forest campground at midnight.

Optionally: "...as pixel art".

I have to ask, how does one get hold of any of the programs in this vein? I've seen Gwern's TWDNE, and now your experiments with DALL-E, and I'd love to mess with them myself but have no idea where to go. A bit of googling suggests one can buy GPT3 time from OpenAI, but I gather that's for text generation, which I can do just fine already.

2Swimmer963 (Miranda Dixon-Luinenburg)
OpenAI has a waitlist you can sign up for to get early access to DALL-E. 
2Error
Ah, that put me on the right track. I've been asking google the wrong questions; I was looking for a downloadable program that I could run, but it looks like some (all?) of the interesting things in this space are server-side-only. Which I guess makes sense; presumably gargantuan hardware is required.
2MikkW
In the case of OpenAI, the server-side-only constraint, IIRC, is intentional, to prevent people from modifying the model, for AI safety reasons. My understanding is that usually running a model isn't as compute-intensive as training it in the first place, so I'd expect a user-side application to be viable; just not in line with OpenAI's modus operandi.
2ChristianKl
I asked a while ago https://www.lesswrong.com/posts/HnD8pqLKGn2bCbXJr/what-s-the-easiest-way-to-currently-generate-images-with There are a few Google Colab notebooks that you can run online but where you could also run the code offline if you desire.

It'd be interesting to see (e.g.):

Full body x-ray scan of a {X}. Detailed, medical professional scan.

Medical illustration of {X} skeleton, with labels. High quality, detailed, professional medical illustration.

Where X is some fictional creature, such as: mermaid, Pikachu, dragon.

6Swimmer963 (Miranda Dixon-Luinenburg)
"Medical illustration of a gryphon's skeleton, with labels. High quality, detailed, professional medical illustration." The labels are cute! 
5Swimmer963 (Miranda Dixon-Luinenburg)
I had to fiddle with the prompt some, but "Detailed high quality full-body x-ray scan of a mermaid with fins and tail, medical records" gets at least a few results that are what I asked for. 
3Shai Noy
Wow, those and the gryphon above are both awesome! Thanks! Would you be kind enough to share a high res versions of your picks from both? With your permission, I'd love to share those on the Dalle subreddit.
3Swimmer963 (Miranda Dixon-Luinenburg)
Pic from the mermaid one: https://labs.openai.com/s/fSTlhqXtpZee9Vedy9xMfsZD And from the gryphon one: https://labs.openai.com/s/JydvuNEv6TCozRECbE4WygQB
1Shai Noy
🙏
1Zachary MacLeod
Oh dang! Would it be too much to ask to see what some of those might look like if they were uncropped by AI?

Could you please return 10 for each of these prompts, I give you my best, ones that should get out of it interesting vividness:

1) Bright macro shot of a plush toy robot pikachu eating a hamburger in a nurse outfit against a white brick wall with mud splashed on pikachu from a tire on the road. 8K HD incredibly detailed.

2) Macro shot of the cool pikachu wearing black chains and laughing as seen in a truck selfie in the desert next to a sand castle with piranha plants seen through the heat. 8K incredibly detailed.

3) Future 2377 hospital with beds in glass co... (read more)

3Swimmer963 (Miranda Dixon-Luinenburg)
"Video game case rated M, dark red rimmed, macro shot, a glossy black world that endlessly goes back into the distance with many black temples, gates, and chests. HD photo."
3Swimmer963 (Miranda Dixon-Luinenburg)
"A glossy black temple surrounded by lava and thunder with silver spiked chests on the ground next to the gate. HD, detailed." It's not super coping with all the details – could maybe do better with more repetition in the prompt? – but it's got the vibe. 
3Swimmer963 (Miranda Dixon-Luinenburg)
I am going to register an advance prediction that many of these contain way too many details (both in terms of number of objects requested, and in terms of specific relationships between said objects) and are going to overwhelm the poor image model. I'll run them as-is, but I might also try modified/simplified versions if I think I can get something more in the spirit of your requests that way. 
2Swimmer963 (Miranda Dixon-Luinenburg)
"A plate with fries, nuggets, steak and pikachu-shaped cake covered in ketchup and salt topped with ice cream. Close-up photograph."
2Swimmer963 (Miranda Dixon-Luinenburg)
"Video game case rated E, grey rimmed, macro shot, metal temples along a concrete river with silver gates, chests, and floating gold keys. HD photo."
2Swimmer963 (Miranda Dixon-Luinenburg)
"A motor connects to a hydraulics pump, which connects to blue energy rods soaking in pink liquid. It's smoking. Macro, detailed."
2Swimmer963 (Miranda Dixon-Luinenburg)
"Big black furry monsters with tall heads and white patches dripping with water and seaweed are picking up a dumpster in an alley. Realistic HD photo."
2Swimmer963 (Miranda Dixon-Luinenburg)
"Microscopic water bear meets a bacteria that looks like pikachu next to a bacteria hospital pouring out different colored creatures with spikes, furr, etc. HD detailed."
2Swimmer963 (Miranda Dixon-Luinenburg)
"Future 2377 hospital with beds in glass containers, white spheres that hold tools, robot maids, cameras everywhere, and blue scrubbing systems moving around the walls. Lots of detail and systems."  I think this is my favourite: https://labs.openai.com/s/KQfiNLLHurkhwSW7Cj38GWA8
2Swimmer963 (Miranda Dixon-Luinenburg)
With some changes to the prompt, "A cool goth pikachu wearing black chains and laughing, sitting in a truck in the desert, next to a heat-shimmery sand castle with piranha plants. 8K incredibly detailed."
1localdeity
It tends to depict Pichu rather than Pikachu.  But I note that Pichu's electric attacks damage itself, at least in Super Smash Bros (and I find a quote from the Bulbapedia article saying "it cannot discharge without being shocked itself"), which caused a friend to refer to Pichu as "emo Pikachu".  Perhaps "goth Pikachu" ended up referring to the same thing...
2Swimmer963 (Miranda Dixon-Luinenburg)
"Bright macro shot of a plush toy robot pikachu eating a hamburger in a nurse outfit against a white brick wall with mud splashed on pikachu from a tire on the road. 8K HD incredibly detailed." Yeah - DALL-E seems to be landing at best a handful of the details you wanted, and in some of these it seems to be returning something almost random!
2Swimmer963 (Miranda Dixon-Luinenburg)
"A plush toy robot pikachu wearing a mud-splashed nurse outfit and eating a hamburger, against a white brick backdrop. Detailed HD footage." It's done much better here! I'm not sure any of the images managed the "mud-splashed" bit, but they've all got a reasonable Pikachu-robot, plus the hamburger and the white brick wall, and some of them are managing the nurse outfit. 
1AttentionResearcher
Could you do the other prompts in my post, I want to push the model, maybe you missed them due to comment collapse. Or if want me to pick only a few let me know. This is so cool.
2Swimmer963 (Miranda Dixon-Luinenburg)
I'll come back to them! There's just a whole lot of comments on this post to process.
1AttentionResearcher
Wow! The glossy black temple one, wow! This is beyond belief, impossible! It not only came close to what I had imagined but forget the lava, it's better! Just what I want. Looks like a hard game. The others are also very impressive, I Love the dumpster one it came very good, and the hospital one, and the food one is just grand. The 2 video game case ones, good but not good haha I meant not those cases, how did your brain interpret the outputs - you saw it was wrong right (lol) ?. Here is a few more and let's try to get one of those games made right this time. Also I'm adding onto the food one something interesting and attempting to elongate that good one: 1) A video game sitting against a wall, rated E, grey rimmed, metal temples along a concrete river with silver gates, chests, and floating gold keys. HD photo, macro. 2) A plate with fries, nuggets, gold fork and knife, steak and pikachu-shaped cake covered in ketchup and salt topped with ice cream. Close-up photograph. Pikachu is leaning into the plate eating the food. 3) A gold room full of red rubies, gold coins, white crystals, silver spiked chests, and gold toilets lined up. 4) A floating liquid metal blob in a laboratory is 3D printing cameras, memory, and sensors. Scientists are trying to guide it. HD, detailed. 5) Arial view over a world consisting of glossy black temples, thunder, round purple chambers, spiked lava rivers, and flat paths that maze around and monsters guarding gates. HD, detailed. 6) Inside the bright gold temple restaurant is gold tables, crystal walls, waterfalls coming out the walls, robot maids, and lots of fries and red ruby decorations. HD, detailed. 7) Glossy black dragon statue shooting red laser beams from its eyes into a glossy black wall, making it crack open exposing a gold vault. In the rain at night, HD, detailed. 8) A glossy black temple surrounded by lava and thunder with silver spiked chests on the ground next to the gate. Big black bosses wearing gold chains and cr
3Swimmer963 (Miranda Dixon-Luinenburg)
Aaaaaaand final one. (I would kind of prefer if you keep any future requests to one or two prompts.)  "A glossy black temple surrounded by lava and thunder with silver spiked chests on the ground next to the gate. Big black bosses wearing gold chains and crowns are walking into the temple. Raining, HD, detailed."
1AttentionResearcher
Ok! One last one to document its limits further: Black robots wearing gold chains and red robes sitting in thrones made of white crystal with gold spikes lined up. The robots are holding plates with fries and ice cream over white sinks in front of their thrones facing a mirror, in a red luxury bathroom full of gold coins and doors, and white and red ruby pots.   Also can you do two Variations below showing all 10 results? (Note: I super-resolutioned one, so if you have the full version saved, check which is more detailed truly): https://ibb.co/jVFpQP8 https://ibb.co/Tb36ZQw (uploaded using imgBB)
2Swimmer963 (Miranda Dixon-Luinenburg)
Plus your other request, "Black robots wearing gold chains and red robes sitting in thrones made of white crystal with gold spikes lined up. The robots are holding plates with fries and ice cream over white sinks in front of their thrones facing a mirror, in a red luxury bathroom full of gold coins and doors, and white and red ruby pots." Honestly pretty impressed with the level of detail in the image! 
2Swimmer963 (Miranda Dixon-Luinenburg)
For the second request, I'm not sure I follow - are these results from previous prompt rounds that I ran? 
1AttentionResearcher
The gold room one, yes please, and the other is a mario game that would be interesting to see if it can make Variations of too. (show all 10)
2Swimmer963 (Miranda Dixon-Luinenburg)
Gotcha! Gold room variations here: And the Mario game variations: 
1AttentionResearcher
Was a text prompt used along side the image input to make these Variations? Or just image input? Very interesting results BTW.
2Swimmer963 (Miranda Dixon-Luinenburg)
just the image - I had uploaded them as new images bc it cleared my session and I didn't have the originals anymore. 
1AttentionResearcher
Ok. That's good no text prompt was used. I wonder what would happen if you now tried the gold room image again with it's text prompt below, maybe it would guide the 10 Variations better? Though it seems as if you have, the Variations show toilets even though there is none in the input image, why is that? Here was the prompt, please try it (or without if you think you included text): 'A gold room full of red rubies, gold coins, white crystals, silver spiked chests, and gold toilets lined up.'
1AttentionResearcher
The gold room one, yes please, and the other is a mario game that would be interesting to see if it can make Variations of too. (show all 10)
2Swimmer963 (Miranda Dixon-Luinenburg)
"Black robots wearing gold chains and red robes sitting in thrones made of white crystal with gold spikes lined up. The robots are holding plates with fries and ice cream over white sinks in front of their thrones facing a mirror, in a red luxury bathroom full of gold coins and doors, and white and red ruby pots."
3Swimmer963 (Miranda Dixon-Luinenburg)
"Inside the bright gold temple restaurant is gold tables, crystal walls, waterfalls coming out the walls, robot maids, and lots of fries and red ruby decorations. HD, detailed."
3Swimmer963 (Miranda Dixon-Luinenburg)
"A gold room full of red rubies, gold coins, white crystals, silver spiked chests, and gold toilets lined up." I think it's confused on the color scheme - the room itself doesn't appear to be gold in any of these. 
3Swimmer963 (Miranda Dixon-Luinenburg)
"A plate with fries, nuggets, gold fork and knife, steak and pikachu-shaped cake covered in ketchup and salt topped with ice cream. Close-up photograph. Pikachu is leaning into the plate eating the food." I think this is closer to what you were envisioning? Though, uh, mildly horrifying in a few, and also one of them made Pikachu a rubber duck? 
2Swimmer963 (Miranda Dixon-Luinenburg)
Minor edit because 'shooting' appears to be a banned keyword: "Glossy black dragon statue flinging red laser beams from its eyes into a glossy black wall, making it crack open exposing a gold vault. In the rain at night, HD, detailed."
2Swimmer963 (Miranda Dixon-Luinenburg)
"Arial view over a world consisting of glossy black temples, thunder, round purple chambers, spiked lava rivers, and flat paths that maze around and monsters guarding gates. HD, detailed."
2Swimmer963 (Miranda Dixon-Luinenburg)
I modified #4 a bit to try to hint harder, since the initial round mostly gave me only the liquid blobs. It's still struggling with the details, especially at including any scientists, so I think there are too many weird/not-usually-combined elements here for it to manage without much more skilled and careful prompting.  "There is a floating liquid metal blob in a laboratory. The floating liquid metal blob is is 3D printing cameras, memory, and sensors. There are scientists in the laboratory trying to guide the metal blob. HD, detailed."
2Swimmer963 (Miranda Dixon-Luinenburg)
This is a lot of requests and I'm at work, so I'll run them over the next few hours. (Honestly I'm not a video games person and had no idea that "case" was the same thing as...rating? and also I have no idea what an E rating is, I don't recognize that one from movies.) "A video game sitting against a wall, rated E, grey rimmed, metal temples along a concrete river with silver gates, chests, and floating gold keys. HD photo, macro." I don't think it super knows what you want here... 
2Measure
A "case" in this context is the plastic clamshell that holds/protects the disc when not in use (DALL-E thinks this instead means some sort of container found within a video game environment). The E rating (for "Everyone") is similar to the G rating for movies.
1AttentionResearcher
What it should be creating is this below (a video game case) ... XD lol: https://ibb.co/9TtJbqJ
1Andrew Currall
I hypothesise that the more details a prompt contains, the more likely DALL-E is to throw a wobbly and produce something almost totally random. But honestly, I'm very impressed with the outcome of most of these prompts. The picachu eating a hamburger is the only one of the above that really "failed", and a couple of the outputs picked up about half the requested details. 
[-]a_l10

Could you try this?

"A DJ stands mightily on a festival stage with thousands of people cheering and dancing. The DJ's T-Shirt reads "CHUNTED". Drawn in the style of Bruno Mangyoku."

2Swimmer963 (Miranda Dixon-Luinenburg)
Tragically DALL-E still cannot spell, but here you go:

Could you run,

"A graphical sketch of the Pythagorean theorem"

?

3Swimmer963 (Miranda Dixon-Luinenburg)

You say it performs well when two characters have a single trait which is different between them. I wonder how much better it performs better when you give character A many masculine traits, and B many feminine traits, without directly stating A is male and B is female, compared to if you randomize those traits for A or B. 

In general, assigning traits which correlate highly with each other should give better results. Perhaps a problem is that the more characters and traits you assign, the less correlated those traits are with all the other traits, and so far lower performance is seen.

Some prompt requests for my daughter:

"A wild boar and an angel walking side by side along the beach - beautiful hyperrealistic art"

"A piggo-saurus - a pig-like dinosaur - hyper realistic art"

"A piggo-saurus - an illustration of a pig-like dinosaur"

"A little forest gnome leaving through his magic book - beautiful and detailed illustration"

4Swimmer963 (Miranda Dixon-Luinenburg)
"A little forest gnome leaving through his magic book - beautiful and detailed illustration"
1p.b.
Thanks a lot! 
2Swimmer963 (Miranda Dixon-Luinenburg)
"A piggo-saurus - an illustration of a pig-like dinosaur"
2Swimmer963 (Miranda Dixon-Luinenburg)
"A piggo-saurus - a pig-like dinosaur - hyper realistic art"
2Swimmer963 (Miranda Dixon-Luinenburg)
"A wild boar and an angel walking side by side along the beach - beautiful hyperrealistic art"

Can it in some way describe itself? Something like "picture of DALL-E 2".

I wanted: the Star-Eyed Goddess

Maybe DALL-E thought you meant Movie-Star-Eyed Goddess? 'Cause that's what the picture looks like to me :)

Regarding text, if the problem comes from encoding, does that mean the model does better with individual letters and digits? Eg

"The letter A"

"The letters X Y and Z"

"Number 8"

"A 3D rendering of the number 5"

2Swimmer963 (Miranda Dixon-Luinenburg)
"A 3D rendering of the number 5"
2Swimmer963 (Miranda Dixon-Luinenburg)
"Number 8". Huh I think these are almost all street numbers on houses/buildings? 
2Swimmer963 (Miranda Dixon-Luinenburg)
"The letters X Y and Z" ok it's starting to get confused here.... (My prediction is that it'll manage the number 8 and number 5 in the next prompts, but if I try a 3-digit number it might flail).
2Swimmer963 (Miranda Dixon-Luinenburg)
Let's see!  "The letter A"

Awesome writeup!

To further explore the interplay between style and content, how about trying something not very specific that could gain specificity from the style context?

For example "Aliens are conducting experiments on human subjects":

  • as a screenshot from South Park (will these mostly feature the anal probe?)
  • as a medieval painting (will these be mostly dissection?)
  • as a screenshot from the movie Prometheus (will these be too scary to look at?)
3Swimmer963 (Miranda Dixon-Luinenburg)
"Aliens are conducting experiments on human subjects, as a screenshot from the movie Prometheus" came out weirdly video-game-esque? 
3Swimmer963 (Miranda Dixon-Luinenburg)
"Aliens are conducting experiments on human subjects, as a medieval painting" And this didn't come out all that medieval-style, so I tried again with "Aliens are conducting experiments on human subjects, as a medieval illuminated manuscript"
3Swimmer963 (Miranda Dixon-Luinenburg)
"Aliens are conducting experiments on human subjects, as a screenshot from South Park"

Prompt: A cartoon honey badger wearing a Brazilian Jiu Jitsu GI with a black belt, shooting in for a wrestling takedown

2Swimmer963 (Miranda Dixon-Luinenburg)
Slightly modified because 'shooting' is a banned keyword: "A cartoon honey badger wearing a Brazilian Jiu Jitsu GI with a black belt, jumping in for a wrestling takedown"

Can you try this one:

Glossy black crystal temples with silver barred gates releasing smoke along a metal path with spikes along it next to a red river, and a layer of smoke. Chains everywhere. A black portal is at the end with heavy glossy techno bosses guarding it. Realistic HD photo.

Zz

[This comment is no longer endorsed by its author]Reply
6Swimmer963 (Miranda Dixon-Luinenburg)
Tweaked the prompt multiple times and this is the best I got re: tights and not stockings, I think DALL-E just has very strong priors on "stockings" going with this art style. "Girl wearing a beautiful white dress over white leggings. She is beside another happy girl with black hair wearing a dress over black leggings. The sun is behind the two, dramatic lighting, Anime fanart, safebooru, deviantart, advanced digital art settings, behance 8k super-quality beautiful"
1Zachary MacLeod
Have you considered using Dall-E 2's inpainting to "uncrop" the image? Take the picture, scale it down to leave some empty space outside the frame, then place it back in?
1Evidential
Dall-e 2 is so mean to me lol. I like the dresses though, especially on the bottom far-left. If you can send me that and the fourth one on the top I will be happy, thank you (going to try to edit it on photoshop or something)

Here is an idea that I hope will give some interesting results:

A complex Rube Goldberg machine.

Some possible variations:

A Rube Goldberg machine made out of candy.

A photograph of a steampunk Rube Goldberg machine.

4Swimmer963 (Miranda Dixon-Luinenburg)
I've been experimenting with some style prompts suggested on Twitter, so have "A complex Rube Goldberg machine, Sigma 85mm f/1.4 high quality photograph"
2Swimmer963 (Miranda Dixon-Luinenburg)
"A Rube Goldberg machine made out of candy, Sigma 85mm f/1.4 high quality photograph"
1gturk1
Thank you so much for running these prompts!  The extra prompt details that you included about photography really adds to the results.  Great depth of field effects. It is interesting to see the difference between the regular machines and the ones made out of candy. What might have been a tube or a coil in the regular machine gets replaced by a necklace of round candies or a candy spiral.

"White haired girl wearing white tights with a girl with black hair wearing opaque black tights and blushing, Anime fanart, danbooru, deviantart, advanced digital art settings"

 

(since there is 2 girls, it doesn't qualify as "explicit" and more just anime fanart)

6Swimmer963 (Miranda Dixon-Luinenburg)
"A white haired girl wearing white tights. She is beside another girl with black hair wearing opaque black tights and blushing. Anime fanart, danbooru, deviantart, advanced digital art settings"
1Evidential
I tried fixing the prompt you can try seeing if it will work this time

As a cinematographer now I'm curious of how much it can understand more advanced photography techniques. For example can it do something like "Double exposure photo of the silhouette of a man with fireworks in the background"? I made a similar photo two years ago and I'll leave it here as reference to see how similar it can get: https://i.gyazo.com/ace7c2bd76a8f2710859362314a1f8c0.jpg

3Swimmer963 (Miranda Dixon-Luinenburg)
Well, this is the DALL-E attempt! not quite the same but definitely intriguing. 
1TibuAI
That's cool! It understands the silhouette request and the fact that a double exposure will overlap the subjects, but it doesn't work within the physical rules of the thecnique. Makes complete sense and creates very interesting results. The 3rd one is the closest to what a physical double exposure would look like. Very nice.

This is so incredible. I'm a cinematographer and I'm looking forward to having access because I'm curious how it'll perform in using it to make references for projects. I'm curious if it can take any specific film (not franchise) and take that style. An example of this would be something like "A man with a blue shirt walking through a dark hallway, in the style of Blade Runner 2049". If this works it would also explain why it is a bit loose when you mention Pixar the production company instead of a specific film with a more consistent style. A lot like the... (read more)

3Swimmer963 (Miranda Dixon-Luinenburg)
"A man with a blue shirt walking through a dark hallway, in the style of Blade Runner 2049" Well, it apparently thinks I just want the hallway lighting to be blue, which is a pretty common sort of thing for it. Otherwise seems at least kind of Blade Runner-esque? 
2gbear605
It seems like the atmosphere is right, and technically the shirt could be blue, we just can't tell.
1TibuAI
Wow, this is really interesting. I agree with gbear605 the atmosphere is right with the backlit silhouette style of a lot of the film. The 10th one is really really good. It's doing the usual thing of taking the properties of one element and applying it to the other things like the color of the lighting here. I'm still curious about the inpainting approach to do images piece by piece. Similar to what I mentioned for the 2 characer problem. Maybe using inpainting you could go element by element in other instances of this problem so it doesn't get so confused. Seeing these results is very satisfying and insightful, thank you!

Heyyyy I got a prompt request:

Illustrated artwork by Hirohiko Araki depicting Shrek and Donkey in the style of Jojo's Bizarre Adventure.

7Swimmer963 (Miranda Dixon-Luinenburg)
Here you go! 
2Zachary MacLeod
Oh my god that worked well :O

If you want specific words spelled correctly try putting quotations on the specific words in the prompt

2Swimmer963 (Miranda Dixon-Luinenburg)
I have tried that! As far as I can tell it doesn't make much of a difference.

Prompt:

Axis and Allies board game 2022 setup. Digital image official concept

(Remove some words if it doesn't work)

3Swimmer963 (Miranda Dixon-Luinenburg)
"Axis and Allies board game 2022 setup. Digital image official concept." (I'll maybe play around a bit with the wording to see if I can get something more dramatic.) 
1Evidential
Yes, it looks like it has some concept of the game. Tell me how it goes with changing the wording

Small white cat wearing a red collar with a bell on it hugging a shadow person. Cute digital art, enhanced digital image

2Swimmer963 (Miranda Dixon-Luinenburg)
It's having some trouble with the shadow person, but definitely a cute cat! 

Cute White Cat Plushie On A Bed, 4K resolution, amateur photography

2Swimmer963 (Miranda Dixon-Luinenburg)
"Cute White Cat Plushie On A Bed, 4K resolution, amateur photography"

Prompt request!

  • "Dystopian hellscape" and/or "Dystopian hellscape, painted by William Blake" (Someone had to ask.  If the resulting images are too gross/disturbing, feel free to skip.)
  • "She made broken look beautiful and strong look invincible. She walked with the Universe on her shoulders and made it look like a pair of wings." (Quote from Ariana Dancu)
  • “But the stars that marked our starting fall away. We must go deeper into greater pain, for it is not permitted that we stay.” (Quote from Dante Alighieri, Inferno)
  • "How can a man die better, than facing
... (read more)
9Swimmer963 (Miranda Dixon-Luinenburg)
"But the stars that marked our starting fall away. We must go deeper into greater pain, for it is not permitted that we stay. Hyperrealistic digital art." Some of these are gorgeous! Let me know if you want full-size versions for any! (Not sure how well they capture Dante, but still.) 
1Bezzi
In order to better capture Dante, I would suggest trying with "Engraving by Gustave Doré" instead of "Hyperrealistic digital art".
2Swimmer963 (Miranda Dixon-Luinenburg)
Here we are! 
1Sable
...um, all of them? :) Holy crap I did not expect this.  I think my favorites are the top middle three and the second from the right on the bottom.  Which were yours?
2Swimmer963 (Miranda Dixon-Luinenburg)
(Oops, really sorry, it closes out my session every so often and I don't have the originals for this anymore.)
5Swimmer963 (Miranda Dixon-Luinenburg)
"Good versus evil in a climactic battle, epic matte painting" 
1Sable
You had discussed how DALLE-2 seems to struggle with assigning traits to more than a single person.  It seems to have done well here, with "good" getting more knight-like appearances and "evil" being more consistently demonic. I wonder how much further we could push with anthropomorphized concepts?
5Swimmer963 (Miranda Dixon-Luinenburg)
"She made broken look beautiful and strong look invincible. She walked with the Universe on her shoulders and made it look like a pair of wings." Tried with both just 'digital art' and 'hyperrealistic digital art', I find that works best for poetic-quote-prompts.   
1Sable
These are gorgeous!
3Swimmer963 (Miranda Dixon-Luinenburg)
"How can a man die better, than facing fearful odds, for the ashes of his fathers, and the temples of his gods? Hyperrealistic digital art."
1Sable
It looks like DALLE-2 is pulling from several different genres?  The top left two are very man-of-tomorrow, whereas the three on the top right are more fantastical.  And the bottom five are all very distinct.
2Swimmer963 (Miranda Dixon-Luinenburg)
"Dystopian hellscape, painted by William Blake" Honestly not very disturbing? 
1Sable
I'll admit, I'm pleasantly surprised.  DALLE-2 seems to be pulling from Dante's Inferno cover art, honestly. Especially because it seems to have spit out a number of book titles?

Prompt:

 

"Chi in Chi's Sweet Home japanese animation. Streaming service Crunchyroll. Screenshot of episode with Chi, who is a cute tabby-white mixed cat. 2D, Google Search Screenshot, Pinterest"

4Swimmer963 (Miranda Dixon-Luinenburg)
Not sure what the deal is with top right... 

Very insightful post. May I use your images in my PhD dissertation to illustrate limitations of current image generation methods? Thanks!

Gabriel Huang

2Swimmer963 (Miranda Dixon-Luinenburg)
You may! Just make sure to keep the DALL-E signature block (bottom right) and attribute it.  Also feel free to request a couple of prompts if you want. 

Reference Picture of Kyubey. Drawn By Puella Magi Madoka Magica. Digital Art Clip Studio Paint Anime, Pretty and Shining. Advanced Image of Kyubey. The character is Kyubey from Puella Magi Madoka Magica

2Swimmer963 (Miranda Dixon-Luinenburg)
1Evidential
Looking at the boobs on the first picture, I feel like the AI can do it but since it is an anime, it is mixed in with hentai pictures and animal-humans. The AI must get anime animals and humans confused. The sad thing is that the AI knows what Kyubey is but it adds a bunch of random anime context. Maybe it needs words like "pokemon cat" just to understand it's not some sort of catgirl body mixed with kyubey

A Cute Cat Creature Character: Kyubey, Anime Show: Puella Magi Madoka Magica, Style: Screenshot From Anime Show. Exact screenshot, no variations from original artwork

3Swimmer963 (Miranda Dixon-Luinenburg)
Here you go! 
1Evidential
The third on the top is very cute
2Swimmer963 (Miranda Dixon-Luinenburg)
This one? https://labs.openai.com/s/lFZ3rLh0ozneh0m5BHJ8q58G
1Evidential
Yes! Thank you. I also think I will give up trying to get kyubey to generate but maybe whenever I get access I will try more idk

Prompt Idea:

Exact Picture of Kyubey, 2 Cat Ears, 2 Bunny Ears. Red Eyed Cat Antagonist From Puella Magi Madoka Magica. Specific Puella Magi Madoka Magica Anime Screenshot, No Variations

3Swimmer963 (Miranda Dixon-Luinenburg)
1Evidential
Maybe the word specific and exact picture throw it off. This actually makes this type of prompt very helpful for product / character design

Prompt idea: "a model of a human cell with all the organelles as a snow globe".

6Swimmer963 (Miranda Dixon-Luinenburg)
Wow this came out pretty cute! 

Thanks to Benjamin Hilton on Twitter, I've been able to run some prompts despite not having access to DALLE 2 personally, and we noticed some interesting edge cases with DALLE's facial filter. Obviously in general DALLE is fine with animal faces and not fine with human faces, but there was one prompt I suggested, "a painting of a penguin jazz band, in the style of Edward Hopper's 'Nighthawks,'" that gave a bunch of penguins with eldritch abominations of faces. Another prompt, "a painting of a penguin in a suit, in ukiyo-e style," had no issues with generat... (read more)

2Swimmer963 (Miranda Dixon-Luinenburg)
Plain "penguins playing poker": And "penguins playing poker, in the style of Edward Hopper's 'Nighthawks'": It doesn't seem like it's especially face-abomination-y in either case? The second one is slightly iffier/weirder on close-up details generally, which fits with my observation that DALL-E gets worse at this if there are more things going on in a scene.
1Ryan Talvola
If I had to guess, it was that it was going for a painting before versus the broader style and the texturing got messed up. That probably implies that it's better to simply prompt with the style of painting you want instead of asking specifically for a painting, if you want coherent results. I also think it's interesting to note that with the second prompt, DALLE struggles immensely to figure out what belongs on a table when playing poker than compared to the first, supporting your assertion that the more complicated scene causes some details to collapse. If you're still taking suggestions for prompts, I think these turned out so well I'd be curious to explore more variations on the theme. Could you try "penguins playing poker, in the style of Salvador Dali's 'The Persistence of Memory'" and "penguins playing poker, in the style of Grant Wood's 'American Gothic'"? These should be styles it can handle well that purposely aren't suited to this subject matter. 
4Swimmer963 (Miranda Dixon-Luinenburg)
"penguins playing poker, in the style of Grant Wood's 'American Gothic'"
4Swimmer963 (Miranda Dixon-Luinenburg)
"penguins playing poker, in the style of Salvador Dali's 'The Persistence of Memory'" honestly I really like this one! This in particular came out as just a pretty cool art piece: https://labs.openai.com/s/YeoG5VGOv8tJ3QOLOhB3lRFq
1Ryan Talvola
These are too good. I like how for all of these different styles so far, it's at least making an honest attempt to match them, and that painting you specifically highlighted is excellent (as much as I don't think they're quite playing poker as I know it). If you haven't hit your tolerance of poker-playing penguins, how about "penguins playing poker, in the style of Rene Magritte's 'The Son of Man'" (my friend's suggestion) and "penguins playing poker, in the style of The Simpsons"?  My original rationale with penguins as a subject is that they're black-and-white bipedal creatures, so hopefully not too hard to draw doing human-like things, that also aren't likely to have much existing artwork of them out there. The drawings I could find of penguins playing poker online were far worse IMO than any of these. 
2Swimmer963 (Miranda Dixon-Luinenburg)
"penguins playing poker, in the style of The Simpsons". The art style is definitely more ~cartoon, but otherwise seems pretty generic and not especially Simpsons-y?  I also ran "penguins playing poker, screenshot from The Simpsons TV show" for comparison, and it seems iffier/less consistent on details, but maybe more Simpsons-flavored?
2Swimmer963 (Miranda Dixon-Luinenburg)
"penguins playing poker, in the style of Rene Magritte's 'The Son of Man'" okay I have no idea what's up with bottom left, and bottom right has some face-monstrosities going on, but otherwise these are pretty well executed (though I am not sure how well they match the art style requested.) 
1Ryan Talvola
I've never seen that degree of screw-up in any DALLE generation before. Wonder what could have happened there. So I think that's the extent of "penguins playing poker" as an artistic subject for now (although it was very nice seeing the contrasts in style, and if I ever get access to DALLE myself there are some other variations I might try), so I'm curious now to see what exactly the limits of penguin generation can be (and perhaps if anything trips the content filters). There's this lovely Claymation sketch on YouTube that remakes The Thing with Pingu, so I'd be curious to see if DALLE can handle "penguins in John Carpenter's 'The Thing'" or 'penguins in the chestburster scene from Alien". I suspect these might be too complex/specific for it to handle, but if either of them were to work... a third one that could be worth a try, too, is "penguins performing an exorcism".

I would be interested in two kinds of prompts:

First, can it reproduce something really popular like:
"V-J Day in Times Square - Alfred Eisenstaedt, 1945"
I know, that original has some faces, so it would be impossible to share, but still interesting to know the result.

Second, does it know some of the not so mainstream video game "styles"? Screenshots from any of the following would be perfect: "Don't starve", "Heroes of Might & Magic III", "Sid Meier's Civilization III", and "StarCraft".

3Swimmer963 (Miranda Dixon-Luinenburg)
"V-J Day in Times Square - Alfred Eisenstaedt, 1945" 
1Mikhail Doroshenko
Interesting. It's actually much worse than I expected it to be. Maybe there was some sort of cleaning to remove duplicate images from the dataset. A few more requests, I would really like to see if you decide to do them. "Simple red dice showing six on top" This is to see whether other dice sides would be coherent with what's on top. "Very cool car" This one is tongue in cheek to see whether it would generate a frozen supercar to maximize both meanings of "cool".
3Swimmer963 (Miranda Dixon-Luinenburg)
"Very cool car" Nope, not frozen! 
3Swimmer963 (Miranda Dixon-Luinenburg)
"Simple red dice showing six on top" Hmmmmmmm. I don't think DALL-E can count to six. 
1Mikhail Doroshenko
Is it fails, if asked for "one on top" as well? If yes, then can you also try "Domino with 2 spots and 1 spot" or "Domino 2 and 1"?
4Swimmer963 (Miranda Dixon-Luinenburg)
Pffft it's really flailing here! "Simple red dice showing a one on top". 1/10! also one of them has nine on top, oops. 
1Mikhail Doroshenko
Huh, it really can't do the math. I wonder if Flamingo is any better at it.

Suggestion: Can it do Kyubey from Madoka Magica?

  1. Kyubey from Madoka Magica, photorealistic, high quality anime, 4K, pixiv, digital picture

  2. Kyubey from Madoka Magica swimming in a pool of soul-gems, 4K anime, digital art, pixiv, hyperrealistic beautiful

  3. Kyubey from Puella Magi Madoka Magica in the style of Chi's Sweet Home Anime, 4K digital art anime, pixiv

(Feel free to change these around)

7Swimmer963 (Miranda Dixon-Luinenburg)
"Kyubey from Madoka Magica, white creature with four ears, 4k high quality anime, screenshot from Puella Magi Madoka Magica" (I fiddled with the prompt because I don't think it knows Madoka quiiiiite well enough, and was giving me vaguely Kyubey-themed anime girls.) 
1Evidential
Since OpenAI optimized it's output for things as you suggested in your article (dresses, animals) I believe hidden in the depths of the AI, it can pull pictures such as Kyubey but requires an un-optimized input (as in broken English or maybe in Japanese for this specific one) So basically telling an alien what to generate in their own language... And the problem is that we don't know what this is with the information currently available (with tests from DALL-E 1)
1Evidential
So it knows the color-scheme and then tries to make some sort of Pokémon off of it. I think maybe the AI believes it is creating a fictional screenshot concept art type thing. Even when you give it a show to go off of, it doesn't understand what to pull off of. I think whenever I get access to dall-e 2, I will try figuring out key buzz words to give it. I also think since I added "pixiv" since most of the pictures there are anthropomorphic, it kinda just tagged it in. I believe the prompting is much more complicated than we think and requires further evaluation. There has to be certain phrases in specific orders the AI can use better that the community doesn't know yet.
4Swimmer963 (Miranda Dixon-Luinenburg)
"Kyubey from Madoka Magica swimming in a pool of soul-gems, 4K anime, digital art, pixiv, hyperrealistic beautiful". It's definitely confused on some of these about whether they're anime girls, but it gets the vibe!
1Evidential
Ohhhh so maybe next prompt we can specify that they are not "anime girls" Also, these are very cute lol. The first one is the most accurate and the fact that the AI understands what Kyubey looks like means that it is probably looking for very specific wording to get it accurate
3Swimmer963 (Miranda Dixon-Luinenburg)
"Kyubey from Puella Magi Madoka Magica in the style of Chi's Sweet Home Anime, 4K digital art anime, pixiv"
1Evidential
It looks like all the other modifiers override the "in the style of chis sweet home" I think the AI needs: "A Cute Cat Creature Character: Kyubey, Anime Show: Puella Magi Madoka Magica, Style: Screenshot From Anime Show. Exact screenshot, no variations from original artwork"

I used nightcafe.studio, a VQGAN+CLIP webservice a bunch in March for the worldbuilding.ai entry I was working on. I found it.. okay for generating images that I could then edit in photoshop, but it took many many tries to get something decent. I'd be particularly interested in seeing what DALLE-E 2 does with these prompts:

"Beautiful giant sunset over the saltwater marsh with tiny abandoned buildings in the distance" "Glass greenhouse with a beautiful forest inside, with people and drones flying" "People dropping into a beautiful marsh from flying drones on a sunny day" "Happy children hanging from flying drones on a sunny day beautiful storybook illustration"

4Swimmer963 (Miranda Dixon-Luinenburg)
"People falling from robotic flying drones into a beautiful marsh, on a sunny day, matte painting" I think some of the "people" are also robotic? DALL-E is trying though! 
4Swimmer963 (Miranda Dixon-Luinenburg)
"People and drones flying around inside a giant glass greenhouse with a beautiful forest inside, 3D rendering".  I swapped the order because when entered verbatim, the prompt you gave had DALL-E forgetting to include any people or drones. I find it's more likely to actually include smaller or foreground features of a scene if I put them at the front and describe the larger backdrop after.  "3D rendering" is the best I got out of several style prompts (I tried "digital art" and "screenshots from a scifi blockbuster movie" as well.) 
1Randomized, Controlled
Oooooh, these are much better than the ones I was got from nightcafe (I just checked, I was actually using "CLIP guided diffusion".) DALL-E 2's marshes and sunset marshes are slightly better than what I was getting.
3Swimmer963 (Miranda Dixon-Luinenburg)
"Beautiful giant sunset over the saltwater marsh with tiny abandoned buildings in the distance, matte painting" (came out better IMO than the original prompt with no style guidance, which sort of forgot about the buildings.) 
2Swimmer963 (Miranda Dixon-Luinenburg)
"Happy children hanging from flying quadcopter drones on a sunny day, beautiful storybook illustration".  Adding "quadcopter" made the drones much easier to recognize! 

And it keeps giving me photorealistic faces as a component of images where I wasn't even asking for that, meaning that per the terms and conditions I can't share those images publicly.

Could you just blur out the faces? Or is that still not allowed?

2Swimmer963 (Miranda Dixon-Luinenburg)
I assume that would be allowed, but then it misses a lot of the point of sharing how impressive DALL-E's art is!
2cwillu
But… Firefly!  Season 2!  It's not all about the lantern jaw…

Amazing write up. Thanks so much. Can you share with us more about the terms and conditions? If you get early access are you allowed to use images for commercial purposes that involve resale of the images? What kind of license is offered for the images? Do you have to credit openai, etc?

Also, you explored your (on point) inferences about openai's AI ethics framework based on aspects of the T&C's (ie deep fakes); I'd love to hear more about this. Are there are terms that imply other beliefs that openai has about the ethics of AI and DallE2 in particular?

2Swimmer963 (Miranda Dixon-Luinenburg)
Their terms and conditions and content policy/sharing policy are public online: https://labs.openai.com/policies/terms https://labs.openai.com/policies/content-policy https://openai.com/api/policies/sharing-publication/
1Muskwalker
You mention a prohibition on photorealistic faces, but none of these terms appear to say anything about this. There is the prohibition "Do not upload images of people without their consent", but this appears to be bound to the matter of actually-existing humans whose consent could be involved (and notably isn't bound to what style actually-existing humans are depicted in, whether that's photorealistic or otherwise). DALL-E 2's main page does confirm that measures were taken to prevent the AI from making "photorealistic generations of real individuals’ faces"—but this again seems to be specifically about actually-existing humans. Is this guidance given anywhere specifically?
3Swimmer963 (Miranda Dixon-Luinenburg)
The guidance was in a google document they sent me in the email approving my access, which I think used to be the same as the document linked to in the "sharing publication" guidelines, but apparently now isn't?

Close-ups of cute animals. DALL-E can pull off scenes with several elements, and often produce something that I would buy was a real photo if I scrolled past it on Tumblr.

This is not surprising.

I was more puzzled by its inability to draw two characters consistently, the Iron Man + Captain America example was quite weird. I suppose that it basically calculates a score of "Iron Man-ness" and "Captain American-ness" on the whole image and tries to maximize those (the round shield of Captain America seems to be sort of an atomic trait, it was drawn almost perf... (read more)

Have you tried generating images with prompts that only describe the general vibe of a picture, without hinting at the content? Something like: "The best painting in history", "A very scary drawing", "A joyous photo".

2Swimmer963 (Miranda Dixon-Luinenburg)
Anyway, I ran "The best painting in history" and there sure is...a variety here...  I think I like #2 best, but #4 is funniest. 
2Swimmer963 (Miranda Dixon-Luinenburg)
At some point I ran "stunningly impressive digital art that is exactly what I ordered" and got the following:

Prompt I'd like to see: "Screenshot from 2020 Star trek the next generation reboot", maybe variations on the decade.  What does futuristic gritty wholesomeness look like?

3Swimmer963 (Miranda Dixon-Luinenburg)
 
3Swimmer963 (Miranda Dixon-Luinenburg)
Sorry you cannot post images in comments apparently, I will put them at the bottom of the main post. (Also, I ended up asking for the miyazaki anime because the prompt as-is gave me a bunch of photorealistic faces. 
2Raemon
I'm confused about this. If you copy an image, you should be able to paste it straightforwardly into a comment – what did you end up experiencing? (I just tested this by copying something from your post into a comment and it worked)
6Swimmer963 (Miranda Dixon-Luinenburg)
I will try again, I guess! (I had clicked and dragged it before, and it appeared in the edit window but not the published comment.) 
1cwillu
I was confused, seeing how much it favoured an anime interpretation. Then I read the prompt :p I suppose that was to avoid a public realistic human face term of service violation?
3Swimmer963 (Miranda Dixon-Luinenburg)
Yeah - I feel like it always gives me monstrous blob faces when I want faces, and perfectly normal realistic faces when I'm not even asking for that! (Though this one is more predictable, since "movie screenshots"; for the prompt "coordination" it kept giving me a bunch of guys in a business meeting.) 
1cwillu
I suppose I could be satisfied with an enterprise-d from the 2020 remake of sttng :D
7Swimmer963 (Miranda Dixon-Luinenburg)
"The Enterprise-D in space, screenshot from 2020 Star trek the next generation reboot" here you go.