Dall-E 3

p.b.

This is a linkpost for https://openai.com/dall-e-3

It seems like we do not yet have a post about Dall-E 3. It was announced a few days ago and can now be tried via bing image creator (long wait times).

I think it is worth mentioning, because it seems like a big jump.

I spend quite a lot of time playing around with Dalle mini, Dall-E 2, Stable Diffusion in its many flavours and it was generally the case that it was very hard to get specifically what I wanted, to the point where I was disillusioned and few of my projects materialised.

A teddy bear and a panda bear holding hands - it took until SDXL for that to work out at least occasionally (the wedding was long over).

Maritime paintings of awesome sailing ships - the rigging was generally completely messed up (finetuning on Montague Dawson and Co didn't help).

Comic books aka Prince Valiant fan-fic - impossible to reliably get specific poses or situations involving several characters.

Mammoths in space - don't ask, but I had to do the space hangar in SD and the mammoths in Dall-E and bash it all together.

The Stable Diffusion eco system build around these shortcoming with control nets and finetuned models and increasingly complicated workflows - but now it seems with Dall-E 3 many more things just work right out of the box.

From my few experiments and the examples I have seen elsewhere this might be the point where image generation just works.

Maybe I am just terrible at prompting - but so far, while very impressed with the tech in principle, I have found image generators useless for the kind of images I was interested in, and that problem persists with this edition. When I am looking to generate art, I am looking to make something new that I have not seen, and over and over, had the impression that a human would have understood my desire and created something novel that matched, while the AI just would not. :(

I struggle to get the AI to produce attractive and functional non-binary/androgynous/queer characters in the setting I want, and suspect that this is an issue for any depiction of beauty and exploration beyond the mainstream. Attractive is seen as synonymous with marked sexual dimorphism, e.g. hyperfeminine characters - unless you specify against it, it keeps drifting here, even if you starting characters are famous for their androgyny (like Ruby Rose), it will add long hair, huge breasts, blush, and remove muscle definition and prominent facial bones, as well as clothing; I can ask for a picture of her in literal armour, and still get cleavage so low it looks like her tits will fall out and she is just asking to be stabbed there. And if I negative prompt those out, I just get an amorphous blob, or, at best, Justin bloody Bieber. If I manage something somewhat acceptable, the moment I add further unusual constraints on style and context it collapses again. This is also why these AI generators that insert your image into fantasy settings were very popular with people fitting the binary, and very alienating for those that did not; they drift any input into mainstream attractiveness. There are images of real people that capture what I am looking for, yet this AI can't - it reproduces the very mainstream I want to make art against.

The fact that the AI cannot count also drives me nuts. For a DnD campaign, I play a druid who retains eight spider-like eyes through all wildshapes, and loved the idea of bringing sample images for the campaign. Eight, not ten. So I basically want eyes like this https://i.pinimg.com/originals/bb/e6/4d/bbe64ddd3308cb4c2409bf103b023768.jpg incorporated into various animal designs, on their heads. But it keeps miscounting eyes, giving me too many or too few, plonking them in various parts of the picture rather than incorporating them into the head design, and the designs keep retaining two human eyes on top of everything else.

And then especially when I want to combine this with a challenging animal design - say, a Deinonychus which is scientifically accurate (feathers, outturned wrists) but does not look dorky - it goes completely of the rails. If you are interested in dinosaurs, the depiction of their wrists as broken in most movies is bothersome, but having the AI correct it requires it to understand wrist position, when it can barely get the number of fingers or possible turns of joints in regular humans right.

I also struggle with getting contradictory themes to work; e.g. having a figure incorporate horror and scary elements, but in a setting where it is acting protective and loving, albeit with dark undertones (say a character like Schaffa Guardian Warrant in the Fifth Season).

The more complex it gets, the less one gets out of it. E.g. for a different problem, I was very intrigued by German rye demon ("Roggenmuhme") mythology - demons that were invented to scare children from playing in the rye fields and ruining the crops. They are strange mixes - they retain characteristics of earlier fertility Gods in that they are overtly sexual, but to trick the children into mistaking farming machinery for the demons, they also incorporate a lot of metal parts, and finally, in acknowledgement of how old the stories are, the demons are elderly women, and of course the whole imagery is tweaked towards horror. This comes together into something terrifying in a unique way - long, scythe-like arms that end in irons claws dragging children's heads and limbs, breasts sagging almost to the ground and leaking poisonous tar, and this whole cannibalistic creature hiding in a rye field, so at first glance, you think you are seeing machinery discarded on a peaceful field, and on the second, you spot that the shadows connect and that something is lurking in there, about to catch you. I wanted to see that in a picture. But when given the term, AI keeps giving me images of rye mills (Roggenmühle); and when I describe the thing itself, the closest I get to is an overt, creepy looking old woman in a field with longer than usual fingers, nothing lurking and at the border from human to machine, or elderly yet overtly sexual, let alone all those things at once.

I am also super bothered by the fact that the physics and ecology of the images is off, because they are generated by someone who has not experienced reality. You ask moss to grow over part of your picture - but it doesn't grow where moisture would collect, has no relation to the location of the sun or direction of wind. Or you ask for "moonrise at midnight" - and if the moon rises at midnight, it should be a half moon because of the moon phases. But it is a full moon.

All in all, image generation gives me far more of an "The AI really does not understand what it is doing" vibe than text generation currently does, which I find surprising - I would have expected the opposite, with text being more off.

I think the hyperfeminine traits are due to finetuning - you should get a lot less of that with the Stable Diffusion base model.

Eight eyes - yeah, counting is hard, but it's also hard to put eight eyes into a face build for two. If I would try to get that I would probably try control net where you can add a ton of eyes to a line drawing and use that as a starting point for the image creation. (Maybe create an image without the eyes first. Apply canny edge detector or similar, multiply the eyes and then use canny edge control net.)

Your Roggenmuhme should also be within the realm of the possible, I think, but I am not going to dwell on that, because I want to sleep at night.

For correct moon phases and Deinonychus's wrist position you'll have to wait for AGI.

These models are still less for making you what you already have in mind than for trying out creative and outlandish prompts, varying and combining them, being occasionally surprised by a picture looking really great, and getting a feel for the "talents" of the model.

Indeed, but an image generator is supposed to be useful for something other than generating an endless scroll of generic awesome pictures with wonky details; this kind of thing becomes boring really quickly. What most people actually need from an image generator is a sufficiently good replacement for the drawing skill they don't have.

To be clear, I share Portia's frustrarion here. I've been trying to get image generators to generate DnD portraits for months, and if the character is something more complicated than a Generic Tolkienian Elf or similar, you have to play increasingly complex shenanigans to obtain passable results. For example, I really really couldn't convince the AI to generate an elf literally made of green metal rather than just dressed in green (this was supposed to represent the effect of a particular prestige class turning the character into a construct).

SDXL gives me something like this. But I don't know, not what you had in mind?

I used this hugging face space: https://huggingface.co/spaces/google/sdxl

And a prompt roughly: An elven face made out of green metal - dungeons and dragons, fantasy, awesome lighting

Hu, actually I never tried just the face, I needed at least the upper torso and preferably the full figure.

Anyway, I spent a few hours today toying with that generator (I previously used mostly this). A very simple prompt like "An elf made out of green metal" can produce a somewhat okay result, but the elf will be either naked or dressed head to toe in green. You can try to add more bits to the prompt in a controlled manner: hair color/hairstyle, outfit/dress color, and the like, but the more details you add, the more the model is prone to forget some of them, and the first to be forgotten is often the most important (being made of green metal).

To be clear, the success rate is not 0%. I was eventually able to obtain an image kinda resembling what I wanted, but I had to sit through >200 bad images and it definitely wasn't an easy task. For these kind of things, we are totally not at the point where image generation "just works" (if you instead need a generic fantasy elf, sure, then it just works on the first try).

Yeah, though Dall-E 3 generally has better language understanding than other text-to-image models. (See e.g. here) I still think the "experimental" approach is more interesting for me personally than the deliberate one you describe. For example, with the previous Bing Image Creator (Dall-E 2.5), I "explored" photographs of fictional places, like Tlön, Uqbar, and strange art in an abandoned museum in Atlantis. It is a process of discovery rather than targeted creation. It's probably personal preference. I'm not very creative, so I wouldn't know what to draw if I could draw.

Thanks for sharing the link

The think the link to the OpenAI site won't get you the actual image creator yet, it is still under coming soon.

They were referencing the Bing image creator, which states it is powered by DALL-E, but afaiks not which version https://www.bing.com/images/create like they also didn't state for a while which GPT version they were using for Bing chat. But there, the ended up using version four for two of the modes.