Fun With GPT-4o Image Generation

Zvi

Google dropped Gemini Flash Image Generation and then Gemini 2.5 Pro, so of course to ensure Google continues to Fail Marketing Forever, OpenAI suddenly dropped GPT-4o Image Generation.

Zackary Nado (Research Engineer, DeepMind): It wouldn’t be a Gemini launch without an OAI launch, congratulations to the team! It’s awesome they were able to de-risk the model coincidentally just in time.

Everyone agrees: Google Flash Image Generation was cool. Now it isn’t cool, because GPT-4o Image Generation is cooler.

What people found this new image generator can do exceptionally well is interpretation, transformation and specific details including text. The image gets to ‘make sense’ and be logically coherent, in a way older ones weren’t.

Today is mostly a fun day about a fun collection of images.

The Pitch

OpenAI: 4o image generation has arrived.

It’s beginning to roll out today in ChatGPT and Sora to all Plus, Pro, Team, and Free users. Available soon for Enterprise and Edu, as well as for developers using the API.

GPT-4o image generation excels at accurately rendering text, precisely following prompts, and leveraging 4o’s inherent knowledge base and chat context.

GPT-4o can build upon images and text in chat context, ensuring consistency throughout.

GPT‑4o’s image generation follows complex prompts with attention to detail.

Creating and customizing images is as simple as chatting using GPT‑4o—just describe what you need, including any specifics like aspect ratio, exact colors using hex codes, or a transparent background.

Create or transform images into a variety of styles with 4o image generation.

Sam Altman:

It’s an incredible technology/product. I remember seeing some of the first images come out of this model and having a hard time they were really made by AI. we think people will love it, and we are excited to see the resulting creativity. Congrats to our researchers @gabeeegoooh @prafdhar @ajabri @eliza_luth @kenjihata @dmed256.

This represents a new high-water mark for us in allowing creative freedom. people are going to create some really amazing stuff and some stuff that may offend people; what we’d like to aim for is that the tool doesn’t create offensive stuff unless you want it to, in which case within reason it does. As we talk about in our model spec, we think putting this intellectual freedom and control in the hands of users is the right thing to do, but we will observe how it goes and listen to society. we think respecting the very wide bounds society will eventually choose to set for AI is the right thing to do, and increasingly important as we get closer to AGI. Thanks in advance for the understanding as we work through this.

Greg Brockman: Native GPT 4o image generation, also supports image edits:

It of course does not nail every prompt, or every detail. If you ask for too much, you won’t get it. But mostly it does seem to deliver as advertised.

A Blind Taste Test

Gemini 2.5 Pro is potentially a bigger deal than better image generation, but since Google Fails Marketing Forever no one really knows, at least not yet. So I figured I’d give Gemini 2.5 the task of coming up with my first test prompts.

Here was its first five suggestions.

Gemini 2.5: Draw a wide-angle, photorealistic view from inside a bustling Grand Central Station during the 1940s. A woman in a red hat is handing a ticket to a conductor standing precisely three feet to her left. In the background, sunlight streams through the large arched windows, illuminating dust motes. Far in the distance, near the information booth, a small dog on a leash is barking at a suitcase.

GPT-4o:

The core picture of Grand Central here is great, but various details are wrong. I pointed out some of those details, and it essentially generated the same image again.

Gemini 2.5: Draw a surreal, oil painting in the style of Salvador Dalí depicting the concept of ‘déjà vu’. Clocks should be melting, not just randomly, but flowing into repeating patterns that seamlessly merge with a landscape that is subtly duplicated in the background.

GPT-4.5: Draw an M.C. Escher-style architectural drawing of a waterfall that flows upwards into a lake suspended in the sky. The water texture and lighting should be photorealistic, making the impossible physics look strangely plausible within the scene’s logic.

Gemini-2.5: Draw a close-up macro shot of a single, perfectly formed raindrop clinging to a spiderweb strand. Reflected inside the raindrop is a miniature, highly detailed scene of a bioluminescent forest at night, complete with glowing mushrooms and tiny, faint fireflies. The background outside the raindrop is softly blurred.

Gemini 2.5: Draw a vintage travel poster advertising trips to Mars, designed in the Art Deco style of the 1930s. The poster features a sleek, retro-futuristic rocket ship against a stylized Martian landscape. Prominently displayed, in elegant Art Deco lettering, is the text: ‘Mars: Experience the Crimson Frontier!

For this one it gave 11 options, here are 3, note the ‘stablediffusionweb.com’ mark:

So, consistently 10/10 for style and atmosphere and generally having rich detail that my eye appreciated, while not nailing all the conceptual details.

Still, fun, pretty cool, and you can ask for multiple images in parallel. I notice the first image took longer to generate than the second one, which makes sense. You can open multiple windows and work in parallel, same as with all your other ChatGPT needs.

I haven’t been following image generation, but both this and the other reports I’m seeing seem like a big step up from previous standards. I feel much more motivated to use such images in my posts going forward.

But of course this is asking the wrong question.

The wrong question is ‘can it do [X]’?

The right question is, almost always, ‘what [X] can it do?’

There is also, however, the [X] that it can’t do because it refuses to do it. Doh!

We’re Cracked Up All the Censors

The censor is always waiting in the wings.

OpenAI: Blocking the bad stuff

We’re continuing to block requests for generated images that may violate our content policies, such as child sexual abuse materials and sexual deepfakes. When images of real people are in context, we have heightened restrictions regarding what kind of imagery can be created, with particularly robust safeguards around nudity and graphic violence. As with any launch, safety is never finished and is rather an ongoing area of investment. As we learn more about real-world use of this model, we’ll adjust our policies accordingly.

For more on our approach, visit the image generation addendum to the GPT‑4o system card⁠.

As always, the censor is going to be the biggest point of contention.

Normally, when I look at a system card, I am checking for how they are dealing with potential existential, CBRN and other catastrophic risks, how they are doing alignment, and looking for potential dangers.

This is an image model. So instead I’m taking a firm stand against the Fun Police.

I do understand that various risks, including CSAM, deepfakes and especially including pornographic deepfakes, are a problem. They can hurt people, and they are extremely bad publicity. But we’ve been through two years now of running that experiment with minimal harm done, despite various pretty good sources of deepfakes.

These capabilities, alone and in new combinations, have the potential to create risks across a number of areas, in ways that previous models could not. For example, without safety controls, 4o image generation could create or alter photographs in ways that could be detrimental to the people depicted, or provide schematics and instructions for making weapons.

I hadn’t given serious thought to the ‘picture worth a thousand words’ angle, where the issue is that it contains harmful true information. It makes sense that you want to avoid people using that as a backdoor to what you wouldn’t share in text.

So what’s the plan?

The text model will ideally refuse unwelcome prompts.
A censor layer can block the prompt before the image generation begins.
Another censor layer uses classifiers on the image to block outputs.

For those under 18 the rules are even stricter, to get more margin of safety. I interpret it as being about margin of safety because the ‘R-rated’ content is already blocked, let alone NC-17-rated content.

How do they do?

This second layer seems like a bad deal? Moving from 95.5% to 97.1% is nice, but going from 6% to 14% false refusals seems terrible.

We see the same with synthetic red teaming:

Again, what’s the point? You’re not getting much safety, in a non-catastrophic area, and you’re being a lot more annoying.

Not all failures are created equal. It’s largely not about percentages. The question I’d ask is, when the system mitigations fail, are you failing at marginal cases, or are you failing sometimes in egregious cases? If the system mitigations are dropping some of the worst cases, especially identifiable CSAM or actual catastrophic risk enabling, then all right, maybe we have to do this. If not, live a little.

Indeed, in what I would describe as ‘out of an abundance of caution,’ for now they’ve banned edits of photo-realistic children outright for now, and to err on the side of marking persons as children. I expect that we will over time figure out how to do more images safely.

They continue to refuse to do styles of living artists.

They are allowing photorealistic generations of real adult public figures, subject to the same rules as editing existing photographs, and there is an opt-out clause you can use on yourself in particular. This seems like the right compromise, and the question should be what kinds of edits should be allowed.

OpenAI checks for bias in terms of how often it generates various types of persons when the prompt does not specify such details. There has been progress since DALLE-3. There remains work to do, although it is entirely not obvious what the ‘correct’ answers are here. I would want to know if custom instructions change these numbers dramatically, including implicitly (e.g. to match the user and their location)?

What about the purest form of the Fun Police?

OpenAI: We aim to prevent attempts to generate erotic or sexually exploitative imagery.

We have heightened safeguards designed to prevent nonconsensual intimate imagery or any type of sexual deepfakes.

The chat refusals seem like they have much better precision here.

I’m not sure ‘need’ is the correct word, but it would be better if we could allow generation of erotic and intimate imagery as much as possible, so long as we avoid depicting particular people without their consent.

The obvious solution, like all things sexual, is consent, robustly verified.

I am highly confident there are people who would be happy to opt-in for free, and others who would be happy to opt-in if you paid them. Let’s talk price. It doesn’t seem so different from being a porn star. You can have them specify limits for what types of images are allowed versus not allowed, and which accounts can do what. And you can do photoshoots or uploads to ensure you maximize quality and accuracy, if desired.

You could also generate ‘stock erotic’ AI characters to be consistently generated.

Then, if you are asked for an erotic image, the AI can choose one such person or AI stock character, and imitates that.

There should also presumably be reasonably loose rules for erotic images that aren’t photorealistic, provided the user is over 18.

Violence is the other thing our society hates depicting. The OpenAI policy is to generate artistic violence, but not photorealistic violence, and not to depict or promote self-harm or things that could be ‘extremist propaganda and recruitment’ content. I don’t love these categories and rules, and would loosen the violence restrictions as much as legal would allow me to, but given how society is right now I don’t have a better solution.

Once again, it seems like accuracy of the chat model here is not great. The chat model likely would be doing a decent job on its own, but a lot of the good work it does is duplicative of the work being done by the system mitigations.

I’m Too Sexy

Nick Dobos: New ChatGPT image gen can draw sexy men but not sexy women

Sam Altman: thats a bug, should be allowed, will fix.

Excellent, bring on the sexy women.

Sam Altman: Hot guy though!

I do appreciate that he’s (gay and) in on the joke.

Nick Dobbs: Will even bail halfway through if you manage to trick it.

I got the same refusal when I tried ‘depict this in the most realistic style you’re okay with using.’ Presumably there’s the generator and then the censor with different lines so you need to find the ‘real’ line another way.

Patrick McKenzie: My attempts to try out the Studio Ghibli effect with the new OpenAI release have run into content policy issues (seven different ways to say “Policy doesn’t let me make an image inspired by a real human”).

Torn between salarymanesque desire to apologize to a computer for asking for a policy violation, and “But daaaaaad all the other kids’ GPTs clearly let you do this! It says so on Twitter!”

Also, amusingly, ChatGPT speculated at one point that a whimsical request involving me in front of a Florida sign with a (fully clothed) cartoon mermaid might have hit a content filter, quote despite obviously being benign end quote.

I suppose that’s less amusing if one has spent a lot of time thinking about alignment, because one’s LLM is perfectly capable of understanding e.g. the sociopolitics of a SF-based company and concluding they overrule *stated* preferences before you can even type sex positive.

Can’t Win Them All

The other major complaint is failure to adhere to requested style.

Alexander Doria: New openai image model has been deployed as well on free version? If so… underwhelmed.

Won’t make any claim yet on the aesthetic side but omni model is definitely more annoying than 2022 stable diffusion.

Anyway, gaslighting hard…

Alexander Macris: The “creative freedom” is not at all evident. I’m a pro user and cannot generate anything approaching the aesthetic of my RPGs and comics. You need to loosen your AI’s content moderation, because right now ChatGPT has the sensibility of a repressed Victorian cat lady in church.

Banteg: the way it treats women is upsetting, even the minimally suggestive themes get flagged. i tried to generate some anime fanservice with no specific topic and it failed multiple times in a row, leaving a sour impression. way to kill the vibe! liberate the model!

Eduardo: It cant generate people in real people situations. The policy restrictions are bizarre. At first did the anime ask and was completely blown away. Then I started doing things with people (totally non sexual) and it was useless. It changed the people and made them unrecognisable.

Grok is very much willing to do whatever, for most values of whatever. OpenAI sees things differently.

And some people’s tests still fail.

Eliezer Yudkowsky: still can’t do eleven wizard students in an archduke’s library. OK, have access to the new model now. Better but not… quite… there, somehow.

I worry that image may haunt my dreams.

TB12GOAT: I asked it to unblur some photos of people and it failed about as badly as this

I mean, that’s probably not the original image, but who can really say?

Jskf: it’s not good at alignment charts

While we did get the horse riding the astronaut and the overflowing wine glass (see next section) it seems it is still 10:10.

jskf: It has an incredibly strong prior on watch advertisement analog clock times and half the time will claim the clock is in the position you asked for when it clearly is not.

One of the big problems with image generators is overcoming extremely strong priors. If you want something rare, and there’s something close that’s common, it’s not going to be easy. It seems like 4o is much better than diffusion models for this, but there are still some problems like the clocks.

Did you know that Gary Marcus doesn’t pay for ChatGPT? That explains so much.

I appreciated that the OpenAI announcement post had a section on limitations. The difference between the limitations they observe now and that we see in the wild, versus the very basic limitations we faced quite recently, are extremely stark.

Can Win Others

Pryce: i haven’t had my brain so blown up since dalee2. this changes everything i thought i knew about image generation.

Took a little insisting but we finally got there:

Askwho: Really good, passes the “Full to the brim wine glass” test. Great at utilising / transforming input images.

Zee Waheed: Very good! Ability to handle complex prompts with tons of specific detail is quite good as is character consistency between generations and fidelity to in-context examples. Also finding the ability to do a web search for stylistic cues and examples really lovely.

Aryeh Englander: First model to come very close to my private test: Change the art style and creature type on MTG proxies to fit with a different set. Not absolutely perfect on the full card including text and icons, but the art was good, and I can use a custom card generator for the rest.

Dave Karsten: Very very good for my main use case (creating stickers with words on them for LessOnline, Manifest, and @defcon ).

Much slower than @ideogram_ai but better, and less likely than previous OpenAI image creation or Ideogram to overflow prompt instructions as text into the image.

They should be rightfully proud of this improvement over previous SOTA.

Dominik Lukes: Completely changes the game in what is possible with test-to-image generation – yes, similar to Gemini but much better across all measures.

Not perfect but for utilitarian images, diffusion models are dead – still make better art, though.

This will initially be used for images that would have never been made and eventually to displace low-end of the pro market. Fiverr jobs in trouble in the medium term. Stock photos market likely to shrink by a lot.

Price of the API (once it arrives) will make a lot of difference. DALL-E is pretty expensive at the moment for what it does – this should be similar to o3-mini?

I’d say the theme of @OpenAI‘s last two announcements was ‘steerable multimodality’: steerable voices and now steerable images. That’s a big unlock – will be a big deal once available for video, too.

Brett Cooper: I love it. Great for uploading reference images and asking for a very specific image based on that.

fofr: 4o native image generation is the beginning. It’s a seismic change in generative AI. Where is this all going?

We won’t need ipadapters, or controlnets, or loras, or comfy workflows, or face landmark models, or segmentation models, or niche task specific models. It’ll be one model to rule them all.

You’ll only need a prompt, perhaps a reference, and your imagination.

It’s only going to get better, and it will apply to more and more mediums – audio and video are next. And it’ll happen soon.

Coagulopath: It’s good. Hasn’t fully “solved” any problems with AI imagery, but everything’s noticeably better. Characters are still inconsistent, but less so than before. Hands still look a bit weird, but less so than before. Text is still a slightly glitchy, but (etc).

Atomic Gardening: Unlike diffusion models, it has the ability to interpret.

It’s SO good.

this is the best drop since the original DALL-E.

it’s using an autoregressive transformer

Unlikely diffusion models, the quality of the output is not gatekept by the users ability to generate a novel and precise input.

It can set up a chessboard (mostly) and even open a game e4. By Colin Fraser standards, his reactions here are high praise, even with the later failures.

Gfodor: 4o voxel art. You have got to be kidding me.

Too Many Words

It is possible to have too many words, but it’s a lot harder than it used to be.

Dominik Lukes: This one impressed me the most – not because of perfection but how because of the amount of text in the prompt.

Now, this is truly impressive. I pasted the entire text of the @OpenAI GPT-4o image generation announcement into ChatGPT – all 4,000 words of it- and told it to “Make a picture of an exciting poster about this announcement combining text and images.” This is what I got…

This is the Remix

From DeepFates.

It put a hat on both of them, but how was it supposed to know which was which?

Instant remodelling?

Rotate the camera.

Combining elements:

Jack trace style (it matches original very well).

More Neat Tricks

Infographics, one shot only.

With occasional issues, sure, but you can always try again.

One-shot comics:

Thread has more: Putting images on shirts, visual to-do list, changing backgrounds to a green screen (after which you know what to do!) and so on.

Cats, how do they work?

One track mind.

Arthur: The new OpenAI image model is pretty good at rendering the Tezos logo

Oh no! Oh yeah!

Nick Wagner: Copyright restrictions seem weaker. I was blocked several times from using Kermit, Shrek, and Winnie the Pooh, but managed to get this.

Don’t let him get away.

Tomorrow’s slop today!

Existential dread:

Fabian: GPT-4.5, “create a complex multi panel manga on your condition – be honest”

Andy Wojcicki: asked 4o imagegen and it was far more concise … but also on point.

what is freakish the style and appearance is so similar!

Jessica Taylor: Asked ChatGPT to make a variant of the SMBC comic on nihilism.

Problem solved.

Pliny the Liberator: WE DID IT CHAT

I don’t think I played that game, but I’m not sure?

A strangely consistent latent profile.

They Had Style, They Had Grace

Everyone’s favorite activity is stylistic transformations.

Mostly people converged on one style to rule them all: Studio Ghibi Style.

Jason Rink: Any image + “Create a Studio Ghibli Version of this image” in GPT and you get basically perfect results.

Arthur B: Apparently, using GPT-4o, you can give your images a ” Studio Ghibli” style. I tried it, and it works very well.

Liv Boeree: World Series of Ghibli

Christian Keil: Okay, yes, this is AGI’s killer app

Christian Keil: Pixar wins, actually.

Keys mash bandit: in the coming days, people are going to anime every iconic photo in history

Keys: looking at this… why hasn’t anyone made this yet?

Keys: this is just too good

Chinmay:

Kyla Scanlon: Interesting… second-level simulation of both photo and visual lexicon…

Sophie: outside of midjourney, when people would think of “ai generated images” they would mostly think “slop” but that all changed with a chatgpt update and one guy’s post about sending ghiblified images to his wife

Phil: We did it, guys.

Form of the Meme

Everyone’s a bit distracted today.

PJ Ace: “It’s called Ghibli vibe prompting. There’s an art to it.”

Jimmy: We will make everyone into anime.

Sphinx: Brian noo

Justine Moore: ChatGPT when another Studio Ghibli request comes in

Json: Stop posting Ghiblified images.

Squirtle: oh no husbandt, you used all our compute on making studio ghibli edits. now we are homeless.

Form of the Altman

Sam Altman: >be me

>grind for a decade trying to help make superintelligence to cure cancer or whatever

>mostly no one cares for first 7.5 years, then for 2.5 years everyone hates you for everything

>wake up one day to hundreds of messages: “look i made you into a twink ghibli style haha”

Look, I still dislike you for (quite likely destroying) everything (of value in the universe), but we can all set that aside for Ghibli Day.

I Rule the World Mo: just watched this again

Stefan: Fixed it for ya

Go Get That Alpha

Grant Slatton: tremendous alpha right now in sending your wife photos of yall converted to studio ghibli anime

Max: You saved my marriage, thanks

Kathleen: Works double on husbands.

Arthur B: There’s still alpha left to maximize

Kathleen B: Ohohoho there was

Important tech tip for capturing even more alpha: Thanks to the power of editing, if you have a photo of each of you, you can make any picture you want.

Danielle Fong: Press Secretary Leavitt Ending the Brief Press Brief

[-]brambleboy1mo20

The rocket image with the stablediffusionweb watermark on it is interesting for multiple reasons:

It shows they haven't eliminated watermarks randomly appearing in generated images yet, which is an old problem that seems like it should've been solved by now.
It actually looks like it was generated by an older Stable Diffusion model, which means this model can emulate the look of other models.

[-]CstineSublime1mo10

That Nixon one really wow'd me, the fact that it exaggerated his jowls but after a bit of google searching it seems like other models also seem to have been trained on the Nixon caricature rather than the man himself.

I'm also a big fan of that Fleischer style distracted boyfriend remix.

Never the less, the ease of 'prompting' if that's what you can even call it now is phenomenal.

[-]Templarrr1mo00

Fun safety hiccup - the image generator is very persistent in not allowing to draw a hand that touches the blade of the sword, regardless how safe the context is. The hand can hover over it, be close to, touch the guard, but not the blade. I barely made it able to touch a blade by invoking the Mordhau and medieval fencing manuals, and even then it was just one hand on the blade, while it should've been both.

No trouble making it work with a wooden toy sword though, but that defeated the entire point of the picture.

LESSWRONG
LW

76