Comment Permalink

Moderately interesting news in AI image gen:

It's been a good while since we've had AI chat assistants able to generate images on user request. Unfortunately, for about as long, we've had people being peeved at the disconnect between what they asked for, and what they actually got. Particularly annoying was the tendency for the assistants to often claim to have generated what you desired, or that they edited an image to change it, without *actually* doing that.

This was an unfortunate consequence of the LLM, being the assistant persona you speak to, and the *actual* image generator that spits out images from prompts, actually being two entirely separate entities. The LLM doesn't have any more control over the image model than you do when running something like Midjourney or Stable Diffusion. It's sending a prompt through a function call, getting an image in response, and then trying to modify prompts to meet user needs. Depending on how lazy the devs are, it might not even be 'looking' at the final output at all.

The image models, on the other hand, are a fundamentally different architecture, usually being diffusion-based (Google a better explanation, but the gist of it is that they hallucinate iteratively from a sample of random noise till it resembles the desired image) whereas LLMs use the Transformer architecture. The image models do have some understanding of semantics, but they're far stupider than LLMs when it comes to understanding finer meaning in prompts.

This has now changed.

Almost half a year back, OpenAI [teased](https://x.com/gdb/status/1790869434174746805) the ability of their then unreleased GPT-4o to generate images *natively*. It was the LLM (more of a misnomer now than ever) actually making the image, in the same manner it could output text or audio.

The LLM doesn’t just “talk” to the image generator - it *is* the image generator, processing everything as tokens, much like it handles text or audio.

Unfortunately, we had nothing but radio silence since then, barring a few leaks of front-end code suggesting OAI would finally switch from DALLE-3 for image generation to using GPT-4o, as well as Altman's assurances that they hadn't canned the project on the grounds of safety.

Unfortunately for him, [Google has beaten them to the punch](https://developers.googleblog.com/en/experiment-with-gemini-20-flash-native-image-generation/) . Gemini 2.0 Flash Experimental (don't ask) has now been blessed with the ability to directly generate images. I'm not sure if this has rolled out to the consumer Gemini app, but it's readily accessible on their developer preview.

First impressions: [It's good.](https://x.com/robertriachi/status/1899854394751070573)

You can generate an image, and then ask it to edit a feature. It will then edit the *original* image and present the version modified to your taste, unlike all other competitors, who would basically just re-prompt and hope for better luck on the second roll.

Image generation just got way better, at least in the realm of semantic understanding. Most of the usual give-aways of AI generated imagery, such as butchered text, are largely solved. It isn't perfect, but you're looking at a failure rate of 5-10% as opposed to >80% when using DALLE or Flux. It doesn't beat Midjourney on aesthetics, but we'll get there.

You can imagine the scope for chicanery, especially if you're looking to generate images with large amounts of verbiage or numbers involved. I'd expect the usual censoring in consumer applications, especially since the LLM has finer control over things. But it certainly massively expands the mundane utility of image generation, and is something I've been looking forward to ever since I saw the capabilities demoed.

Flash 2.0 Experimental is also a model that's dirt cheap on the API, and while image gen definitely burns more tokens, it's a trivial expense. I'd strongly expect Google to make this free just to steal OAI's thunder.

Reply

See in context

AlphaAndOmega's Shortform

by AlphaAndOmega

27th Dec 2024

1 min read

10

4

This is a special post for quick takes by AlphaAndOmega. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 1:00 PM

[-]AlphaAndOmega3mo536

I happen to be a doctor with an interest in LW and associated concerns, who discovered a love for ML far too late for me to reskill and embrace it.

My younger cousin is a mathematician currently doing an integrated Masters and PhD. About a year back, I'd been trying to demonstrate to him the every increasing capability of SOTA LLMs at maths, and asked him to raise questions that it couldn't trivially answer.

He chose "is the one-point compactification of a Hausdorff space itself Hausdorff?".

At the time, all the models insisted invariably that that's a no. I ran the prompt multiple times on the best models available then. My cousin said it was incorrect, and provided to sketch out a proof (which was quite simple when I finally understood that much of the jargon represented rather simple ideas at their core).

I ran into him again when we're both visiting home, and I decided to run the same question through the latest models to gauge their improvements.

I tried Gemini 1206, Gemini Flash Thinking Experimental, Claude 3.5 Sonnet (New) and GPT-4o.

Other than reinforcing the fact that AI companies have abysmal naming schemes, to my surprise almost all of them gave the correct answer, barring Claude, but it was hampered by Anthropic being cheapskates and turning on the concise responses mode.

I showed him how the extended reasoning worked for Gemini Flash (it doesn't hide its thinking tokens unlike o1) and I could tell that he was shocked/impressed, and couldn't fault the reasoning process it and the other models went through.

To further shake him up, I had him find some recent homework problems he'd been assigned at his course (he's in a top 3 maths program in India) and used the multimodality inherent in Gemini to just take a picture of an extended question and ask it to solve it.* It did so, again, flawlessly.

*So I wouldn't have to go through the headache of reproducing it in latex or markdown.

He then demanded we try with another, and this time he expressed doubts that the model could handle a compact, yet vague in the absence of context not presented problem, and no surprises again.

He admitted that this was the first time he took my concerns seriously, though getting a rib in by saying doctors would be off the job market before mathematicians. I conjectured that was unlikely, given that maths and CS performance are more immediately beneficial to AI companies as they are easier to drop-in and automate, while also having direct benefits for ML, with the goal of replacing human programmers and having the models recursively self-improve. Not to mention that performance in those domains is easier to make superhuman with the use of RL and automated theorem providers for ground truth. Oh well, I reassured him, we're probably all screwed and in short order, to the point where there's not much benefit in quibbling about the other's layoffs being a few months later.

Reply

2

[-]notfnofn3mo*40

I similarly felt in the past that by the time computers were pareto-better than I at math, there would already be mass-layoffs. I no longer believe this to be the case at all, and have been thinking about how I should orient myself in the future. I was very fortunate to land an offer for an applied-math research job in the next few months, but my plan is to devote a lot more energy to networking + building people skills while I'm there instead of just hyperfocusing on learning the relevant fields.

o1 (standard, not pro) is still not the best at math reasoning though. I occasionally give it linear algebra lemmas that I suspect it to be able to help with, but it always has major errors. Here are some examples:

I have a finite-dimensional real vector space equipped with a symmetric bilinear form $(\cdot, \cdot)$ which is not necessarily non-degenerate. Let $n$ be the dimension of $V$ , $K$ be the subspace of $V$ with $(K, V) = 0$ , and $k$ be the dimension of $K$ . Let $W_{1}$ and $W_{2}$ be $n + k$ dimensional real vector spaces that contain $V$ and are equipped with symmetric non-degenerate bilinear forms that extend $(\cdot, \cdot)$ . Show that there exists an isometry from $W_{1}$ to $W_{2}$ that restricts to the identity on $V$ . To its credit, it gave me some references that helped me prove this, but its argument was completely bogus.
Let $V$ be a real finite-dimensinoal vector space equipped with a symmetric non-degenerate bilinear form $(\cdot, \cdot)$ and let $σ$ be an isometry of $V$ . Prove or disprove that the restriction of $(\cdot, \cdot)$ to the fixed-point subspace of $σ$ on $V$ is non-degenerate. (Here it sort of had the right idea but its counter-examples were never right).
Does there exist a symmetric irreducible square matrix with diagonal entries $2$ and non-positive integer off-diagonal entries such that the corank is more than $1$ ? Here it gave a completely wrong proof of "no" and, no matter how many times I corrected its errors, kept gaslighting me into believing that the general idea must work and that it's a standard result in the field that it follows from a book that I happened to actually have read. It kept insisting this, no matter how many times I corrected its errors, until I presented with an example of a corank-1 matrix that made it clear that its idea was unfixable.

I have a strong suspicion that o3 will be much better than o1 though.

Reply

[-]notfnofn2mo40

Update: R1 found bullet point 3 after prompting it to try 16x16. It's 2 minus the adjacency matrix of the tesseract graph

Reply

[-]AlphaAndOmega3mo40

Thank you for your insight. Out of idle curiosity, I tried putting your last query into Gemini 2 Flash Thinking Experimental and it told me yes first-shot.

Here's the final output, it's absolutely beyond my ability to evaluate, so I'm curious if you think it went about it correctly. I can also share the full COT if you'd like, but it's lengthy:

https://ibb.co/album/rx5Dy1

(Image since even copying the markdown renders it ugly here)

Reply

[-]notfnofn3mo50

corank has to be more than 1, not equal to 1. I'm not sure if such a matrix exists; the reason I was able to change its mind by supplying a corank-1 matrix was that its kernel behaved in a way that significantly violated its intuition.

Reply

[-]T4313mo31

I decided to run the same question through the latest models to gauge their improvements.

Not exactly sure if there is much advantage at all in you having done this, but I feel inclined to say Thank You for persisting in persuading your cousin to at least consider concerns regarding AI, even if he perceptually filters those concerns to mostly regard job automation over others, such as a global catastrophe.

In my own life, over the last several years, I have found it difficult to persuade those close to me to really consider concerns from AI.

I thought that capabilities advancing observably before them might stoke them to think more about their own future and how possibly to behave and or live differently conditional on different AI capabilities, but this has been of little avail.

Expanding capabilities seem to best dissolve skepticism but conversations seem to have not had as large an effect as I would have expected. I've not thought or acted as much as I want to on how to coordinate more of humanity around decision-making regarding AI (or the consequences of AI), partially since I do not have a concrete notion where to steer humanity or justification for where to steer (even I knew it was highly likely I was actually contributing to the steering through my actions).

Reply

[-]Kajus2mo20

I also think that giving medical advice/being a doctor and so on requires you to have a specific degree and anyone can do math. Also math seems somewhat easier to verify. I would say he is more likely to loss his job sooner.

Reply

[-]AlphaAndOmega2mo10

Who knows how long regulatory inertia might last? I agree it'll probably add at least a few years to my employability, past the date where an AI can diagnose, plan and prescribe better than I can. It might not be something to rely on, if you end up with a regime where a single doctor rubberstamps hundreds of decisions, in place of what a dozen doctors did before. There's not that much difference between 90% and 100% unemployment!

Reply

[-]Kajus2mo20

True! Still I do think he is going to be the first one to lose the job.

Reply

[-]AlphaAndOmega16d10