The rocket image with the stablediffusionweb watermark on it is interesting for multiple reasons:
That Nixon one really wow'd me, the fact that it exaggerated his jowls but after a bit of google searching it seems like other models also seem to have been trained on the Nixon caricature rather than the man himself.
I'm also a big fan of that Fleischer style distracted boyfriend remix.
Never the less, the ease of 'prompting' if that's what you can even call it now is phenomenal.
Fun safety hiccup - the image generator is very persistent in not allowing to draw a hand that touches the blade of the sword, regardless how safe the context is. The hand can hover over it, be close to, touch the guard, but not the blade. I barely made it able to touch a blade by invoking the Mordhau and medieval fencing manuals, and even then it was just one hand on the blade, while it should've been both.
No trouble making it work with a wooden toy sword though, but that defeated the entire point of the picture.
Google dropped Gemini Flash Image Generation and then Gemini 2.5 Pro, so of course to ensure Google continues to Fail Marketing Forever, OpenAI suddenly dropped GPT-4o Image Generation.
Everyone agrees: Google Flash Image Generation was cool. Now it isn’t cool, because GPT-4o Image Generation is cooler.
What people found this new image generator can do exceptionally well is interpretation, transformation and specific details including text. The image gets to ‘make sense’ and be logically coherent, in a way older ones weren’t.
Today is mostly a fun day about a fun collection of images.
Table of Contents
The Pitch
It of course does not nail every prompt, or every detail. If you ask for too much, you won’t get it. But mostly it does seem to deliver as advertised.
A Blind Taste Test
Gemini 2.5 Pro is potentially a bigger deal than better image generation, but since Google Fails Marketing Forever no one really knows, at least not yet. So I figured I’d give Gemini 2.5 the task of coming up with my first test prompts.
Here was its first five suggestions.
The core picture of Grand Central here is great, but various details are wrong. I pointed out some of those details, and it essentially generated the same image again.
For this one it gave 11 options, here are 3, note the ‘stablediffusionweb.com’ mark:
So, consistently 10/10 for style and atmosphere and generally having rich detail that my eye appreciated, while not nailing all the conceptual details.
Still, fun, pretty cool, and you can ask for multiple images in parallel. I notice the first image took longer to generate than the second one, which makes sense. You can open multiple windows and work in parallel, same as with all your other ChatGPT needs.
I haven’t been following image generation, but both this and the other reports I’m seeing seem like a big step up from previous standards. I feel much more motivated to use such images in my posts going forward.
But of course this is asking the wrong question.
The wrong question is ‘can it do [X]’?
The right question is, almost always, ‘what [X] can it do?’
There is also, however, the [X] that it can’t do because it refuses to do it. Doh!
We’re Cracked Up All the Censors
The censor is always waiting in the wings.
As always, the censor is going to be the biggest point of contention.
Normally, when I look at a system card, I am checking for how they are dealing with potential existential, CBRN and other catastrophic risks, how they are doing alignment, and looking for potential dangers.
This is an image model. So instead I’m taking a firm stand against the Fun Police.
I do understand that various risks, including CSAM, deepfakes and especially including pornographic deepfakes, are a problem. They can hurt people, and they are extremely bad publicity. But we’ve been through two years now of running that experiment with minimal harm done, despite various pretty good sources of deepfakes.
I hadn’t given serious thought to the ‘picture worth a thousand words’ angle, where the issue is that it contains harmful true information. It makes sense that you want to avoid people using that as a backdoor to what you wouldn’t share in text.
So what’s the plan?
For those under 18 the rules are even stricter, to get more margin of safety. I interpret it as being about margin of safety because the ‘R-rated’ content is already blocked, let alone NC-17-rated content.
How do they do?
This second layer seems like a bad deal? Moving from 95.5% to 97.1% is nice, but going from 6% to 14% false refusals seems terrible.
We see the same with synthetic red teaming:
Again, what’s the point? You’re not getting much safety, in a non-catastrophic area, and you’re being a lot more annoying.
Not all failures are created equal. It’s largely not about percentages. The question I’d ask is, when the system mitigations fail, are you failing at marginal cases, or are you failing sometimes in egregious cases? If the system mitigations are dropping some of the worst cases, especially identifiable CSAM or actual catastrophic risk enabling, then all right, maybe we have to do this. If not, live a little.
Indeed, in what I would describe as ‘out of an abundance of caution,’ for now they’ve banned edits of photo-realistic children outright for now, and to err on the side of marking persons as children. I expect that we will over time figure out how to do more images safely.
They continue to refuse to do styles of living artists.
They are allowing photorealistic generations of real adult public figures, subject to the same rules as editing existing photographs, and there is an opt-out clause you can use on yourself in particular. This seems like the right compromise, and the question should be what kinds of edits should be allowed.
OpenAI checks for bias in terms of how often it generates various types of persons when the prompt does not specify such details. There has been progress since DALLE-3. There remains work to do, although it is entirely not obvious what the ‘correct’ answers are here. I would want to know if custom instructions change these numbers dramatically, including implicitly (e.g. to match the user and their location)?
What about the purest form of the Fun Police?
The chat refusals seem like they have much better precision here.
I’m not sure ‘need’ is the correct word, but it would be better if we could allow generation of erotic and intimate imagery as much as possible, so long as we avoid depicting particular people without their consent.
The obvious solution, like all things sexual, is consent, robustly verified.
I am highly confident there are people who would be happy to opt-in for free, and others who would be happy to opt-in if you paid them. Let’s talk price. It doesn’t seem so different from being a porn star. You can have them specify limits for what types of images are allowed versus not allowed, and which accounts can do what. And you can do photoshoots or uploads to ensure you maximize quality and accuracy, if desired.
You could also generate ‘stock erotic’ AI characters to be consistently generated.
Then, if you are asked for an erotic image, the AI can choose one such person or AI stock character, and imitates that.
There should also presumably be reasonably loose rules for erotic images that aren’t photorealistic, provided the user is over 18.
Violence is the other thing our society hates depicting. The OpenAI policy is to generate artistic violence, but not photorealistic violence, and not to depict or promote self-harm or things that could be ‘extremist propaganda and recruitment’ content. I don’t love these categories and rules, and would loosen the violence restrictions as much as legal would allow me to, but given how society is right now I don’t have a better solution.
Once again, it seems like accuracy of the chat model here is not great. The chat model likely would be doing a decent job on its own, but a lot of the good work it does is duplicative of the work being done by the system mitigations.
I’m Too Sexy
Excellent, bring on the sexy women.
I do appreciate that he’s (gay and) in on the joke.
I got the same refusal when I tried ‘depict this in the most realistic style you’re okay with using.’ Presumably there’s the generator and then the censor with different lines so you need to find the ‘real’ line another way.
Can’t Win Them All
The other major complaint is failure to adhere to requested style.
Grok is very much willing to do whatever, for most values of whatever. OpenAI sees things differently.
And some people’s tests still fail.
I worry that image may haunt my dreams.
I mean, that’s probably not the original image, but who can really say?
While we did get the horse riding the astronaut and the overflowing wine glass (see next section) it seems it is still 10:10.
One of the big problems with image generators is overcoming extremely strong priors. If you want something rare, and there’s something close that’s common, it’s not going to be easy. It seems like 4o is much better than diffusion models for this, but there are still some problems like the clocks.
Did you know that Gary Marcus doesn’t pay for ChatGPT? That explains so much.
I appreciated that the OpenAI announcement post had a section on limitations. The difference between the limitations they observe now and that we see in the wild, versus the very basic limitations we faced quite recently, are extremely stark.
Can Win Others
Took a little insisting but we finally got there:
It can set up a chessboard (mostly) and even open a game e4. By Colin Fraser standards, his reactions here are high praise, even with the later failures.
Too Many Words
It is possible to have too many words, but it’s a lot harder than it used to be.
This is the Remix
From DeepFates.
It put a hat on both of them, but how was it supposed to know which was which?
Instant remodelling?
Rotate the camera.
Combining elements:
Jack trace style (it matches original very well).
More Neat Tricks
Infographics, one shot only.
With occasional issues, sure, but you can always try again.
One-shot comics:
Thread has more: Putting images on shirts, visual to-do list, changing backgrounds to a green screen (after which you know what to do!) and so on.
Cats, how do they work?
One track mind.
Oh no! Oh yeah!
Don’t let him get away.
Tomorrow’s slop today!
Existential dread:
Problem solved.
I don’t think I played that game, but I’m not sure?
A strangely consistent latent profile.
They Had Style, They Had Grace
Everyone’s favorite activity is stylistic transformations.
Mostly people converged on one style to rule them all: Studio Ghibi Style.
Form of the Meme
Everyone’s a bit distracted today.
Form of the Altman
Look, I still dislike you for (quite likely destroying) everything (of value in the universe), but we can all set that aside for Ghibli Day.
Go Get That Alpha
Important tech tip for capturing even more alpha: Thanks to the power of editing, if you have a photo of each of you, you can make any picture you want.