This graph reports relative performance of different size models. We know the sizes of Nano 1 and Nano 2, so this is a massive hint given how scaling laws work for the size of Pro and Ultra.
The weird thing is, I looked at this graph, I did that (because of course), and I got insane results. Applying that approach, Pro has somewhere ~13B parameters, and Ultra has somewhere ~26B parameters. That's from just eyeballing the step sizes and doing mental arithmetic, but the results are crazy enough that I'm not going to bother breaking out a ruler and calculator to try to do a more detailed estimate. It does not take a significant proportion of all of Google's latest TPUs months to train an ~26B parameter model, and if Google can really get narrowly-beating-GPT-4-level performance out of a model only the size of a mid-sized-LLama, or 3–4 times the size of Mistral, then I am deeply impressed by their mad skillz. Also, you don't call something "Nano" if it's 1/8 or 1/16 the size of your biggest model: the parameter count ratios have to be a lot larger than that between a data-center model and a phone model. I don't believe those sizes for a moment, so my conclusion is that the reason this graph was released is exactly because it's effectively on some sort of distinctly screwy scale that makes it impossible to use it to meaningfully estimate Pro and Ultra's parameter counts. I think it's marketing shlock, probably aimed at Pixel phone buyers.
My first guess would be that Nano 1 and Nano 2 were not originally pretrained at those parameter counts, but have been heavily distilled down (or some newer whizzier equivalent of distillation) in certain fairly narrow skillsets or even specific use-cases valuable for their on-Pixel-phone role (because they'd be dumb not to, and DeepMind+theBrain aren't dumb), and that those tests, while their 1-word descriptions sound general (well, apart from "summarization"), were carefully arranged to be inside those fairly narrow skillsets. (It also wouldn't surprise me if Nano 1 & 2 came in language-specific versions, each separately distilled, and that graph is from the EN versions. Though one of the test names was "multilingual", and those versions wouldn't be entirely monolingual, they'd still have some basic translation ability to/from useful languages.) So I suspect the tests are designed to heavily favor the Nano models. Or, the scoring is arranged in some way such that the scale is distinctly non-linear. Or indeed quite possibly some cunning combination of both of these. I'm fairly sure that if Jeff Dean (or one of his senior PMs) had sent me a sketch of what that graph needed to look like, say, a few months ago and assigned me a team of researchers, we could have come up with a set of measurements that looked like that.
It also seems likely that the Nano models are extremely overtrained compared to the scaling laws. The scaling laws are for optimal compute during training, but here they want to minimize inference cost so it would make sense to train for significantly longer.
Agreed (well, except for a nitpick that post-Chinchilla versions of scaling laws also make predictions for scaling data and parameter count separately, including in overtraining regions): overtraining during distillation seems like the obvious approach, using a lot of data (possibly much of it synthetic, which would let you avoid issues like memorization of PII and copyright) rather than many epochs, in order to minimize memorization. Using distillation also effectively increases the size of your distillation training set for scaling laws, since the trainee model now gets more data per example: not just the tokens in the correct answer, but their logits and those of all the top alternative tokens according to the larger trainer model. So each document in the distillation training set becomes worth several times as much.
What assumptions is this making about scaling laws for these benchmarks? I wouldn't know how to convert laws for losses into these kind of fuzzy benchmarks.
I was simply looking at the average across bar charts of the improvement step size between Nano 1 at 1.8B and Nano 2 at 3.25B parameter count, a factor of ~2 in parameter count. The step size up to Pro is about twice that, then from Pro up to Ultra about the same as Nano 1 to Nano 2, each again averaged across bar chats. Assuming only that the scaling law is a power law, i.e a straight line on a log-linear graph, as they always are (well, unless you use a screwy nonlinear scoring scheme), that would mean that Pro's parameter count was around times that of Nano 2, which is ~13B, and Ultra was another doubling again at ~26B. Those numbers are obviously wrong, so this is not a simple log-linear chart of something that just scales with a power law from parameter count. And I'm sure they wouldn't have published it if it was.
Liv Boeree: This is pretty nuts, looks like they’ve surpassed GPT4 on basically every benchmark… so this is most powerful model in the world?! Woweee what a time to be alive.
Link doesn't work. Maybe she changed her mind?
It’s happening. Here is CEO Pichai’s Twitter announcement. Here is Demis Hassabis announcing. Here is the DeepMind Twitter announcement. Here is the blog announcement. Here is Gemini co-lead Oriol Vinyals, promising more to come. Here is Google’s Chief Scientist Jeff Dean bringing his best hype.
EDIT: This post has been updated for the fact that I did not fully appreciate how fake Google’s video demonstration was.
Technical Specifications
Let’s check out the specs.
Context length trained was 32k tokens, they report 98% accuracy on information retrieval for Ultra across the full context length. So a bit low, both lower than GPT—4 and Claude and lower than their methods can handle. Presumably we should expect that context length to grow rapidly with future versions.
There are three versions of Gemini 1.0.
This makes sense. I do think there are, mostly, exactly these three types of tasks. Nano tasks are completely different from non-Nano tasks.
This graph reports relative performance of different size models. We know the sizes of Nano 1 and Nano 2, so this is a massive hint given how scaling laws work for the size of Pro and Ultra.
Gemini is natively multimodal, which they represent as being able to seamlessly integrate various inputs and outputs.
They say their benchmarking on text beats the existing state of the art.
I love that ‘above 90%’ turns out to be exactly 90.04%, whereas human expert is 89.8%, prior SOTA was 86.4%. Chef’s kiss, 10/10, no notes. I mean, what a coincidence, that is not suspicious at all and no one was benchmark gaming that, no way.
I wonder when such approaches will be natively integrated into the UI for such models. Ideally, I should be able to, after presumably giving them my credit card information, turn my (Bard?) to ‘Gemini k-sample Chain of Thought’ and then have it take care of itself.
Here’s their table of benchmark results.
So the catch with MMLU is that Gemini Ultra gets more improvement from CoT@32, where GPT-4 did not improve much, but Ultra’s baseline performance on 5-shot is worse than GPT-4’s.
Except the other catch is that GPT-4, with creative prompting, can get to 89%?
GPT-4 is pretty excited about this potential ‘Gemini Ultra’ scoring 90%+ on the MMLU, citing a variety of potential applications and calling it a substantial advancement in AI capabilities.
They strongly imply that GPT-4 got 95.3% on HellaSwag due to data contamination, noting that including ‘specific website extracts’ improved Gemini’s performance there to a 1-shot 96%. Even if true, performance there is disappointing.
What does this suggest about Gemini Ultra? One obvious thing to do would be to average all the scores together for GPT-4, GPT-3.5 and Gemini, to place Gemini on the GPT scale. Using only benchmarks where 3.5 has a score, we get an average of 61 for GPT 3.5, 79.05 for GPT-4 and 80.1 for Gemini Ultra.
By that basic logic, we would award Gemini a benchmark of 4.03 GPTs. If you take into account that improvements matter more as scores go higher, and otherwise look at the context, and assume these benchmarks were not selected for results, I would increase that to 4.1 GPTs.
On practical text-only performance, I still expect GPT-4-turbo to be atop the leaderboards.
Gemini Pro clearly beat out PaLM-2 head-to-head on human comparisons, but not overwhelmingly so. It is kind of weird that we don’t have a win rate here for GPT-4 versus Gemini Ultra.
Image understanding benchmarks seem similar. Some small improvements, some big enough to potentially be interesting if this turns out to be representative.
Similarly they claim improved SOTA for video, where they also have themselves as the prior SOTA in many cases.
For image generation, they boast that text and images are seamlessly integrated, such as providing both text and images for a blog, but provide no examples of Gemini doing such an integration. Instead, all we get are some bizarrely tiny images.
One place we do see impressive claimed improvement is speech recognition. Note that this is only Gemini Pro, not Gemini Ultra, which should do better.
Those are error rate declines you would absolutely notice. Nano can run on-device and it is doing importantly better on YouTube than Whisper. Very cool.
Here’s another form of benchmarking.
I read the training notes mostly as ‘we used all the TPUs, no really there were a lot of TPUs’ with the most interesting note being this speed-up. Does this mean they now have far fewer checkpoints saved, and if so does this matter?
Their section on training data drops a few technical hints but wisely says little. They deliberately sculpted their mix of training data, in ways they are keeping private.
In section 6 they get into responsible deployment. I appreciated them being clear they are focusing explicitly on questions of deployment.
They focus (correctly) exclusively on the usual forms of mundane harm, given Gemini is not yet breaking any scary new ground.
Their instruction tuning used supervised fine tuning and RLHF.
A particular focus was on attribution, which makes sense for Google.
Another was to avoid reasoning from a false premise and to otherwise refuse to answer ‘unanswerable’ questions. We need to see the resulting behavior but it sounds like the fun police are out in force.
It doesn’t sound like their mitigations for factuality were all that successful? Unless I am confusing what the numbers mean.
Looking over the appendix and its examples, it is remarkable how unimpressive were all of the examples given.
I notice that I watch how honestly DeepMind approaches reporting capabilities and attacking benchmarks as an important sign for their commitment to safety. There are some worrying signs that they are willing to twist quite a ways. Whereas the actual safety precautions do not bother me too much one way or the other?
The biggest safety precaution is one Google is not even calling a safety precaution. They are releasing Gemini Pro, and holding back Gemini Ultimate. That means they have a gigantic beta test with Pro, whose capabilities are such that it is harmless. They can use that to evaluate and tune Ultimate so it will be ready.
The official announcement offers some highlights.
Demis Hassabis talked to Wired about Gemini. Didn’t seem to add anything.
Level Two Bard
Gemini Pro, even without Gemini Ultra should be a substantial upgrade to Bard. The question is, will that be enough to make it useful when we have Claude and ChatGPT available? I will be trying it to find out, same as everyone else. Bard does have some other advantages, so it seems likely there will be some purposes, when you mostly want information, where Bard will be the play.
This video represents some useful prompt engineering and reasoning abilities, used to help plan a child’s birthday party, largely by brainstorming possibilities and asking clarifying questions. If they have indeed integrated this functionality in directly, that’s pretty cool.
Pete says Bard is finally at a point where he feels comfortable recommending it. The prompts are not first rate, but he says it is greatly improved since September and the integrations with GMail, YouTube and Maps are useful. It definitely is not a full substitute at this time, the question is if it is a good complement.
Even before Gemini, Bard did a very good job helping my son with his homework assignments, such that I was sending him there rather than to ChatGPT.
Returning a clean JSON continues to require extreme motivation.
When will Bard Advanced (with Gemini Ultra) be launched? Here’s a market on whether it happens in January.
Gemini Reactions
Some were impressed. Others, not so much.
The first unimpressive thing is that all we are getting for now is Gemini Pro. Pro is very clearly not so impressive, clearly behind GPT-4.
Simeon? Not impressed.
Simeon: Gemini is here. Tbh it feels like it’s GPT-4 + a bit more multimodality + epsilon capabilities. So my guess is that it’s not a big deal on capabilities, although it might be a big deal from a product standpoint which seems to be what Google is looking for.
As always, one must note that everything involved was chosen to be what we saw, and potentially engineered or edited. The more production value, the more one must unwind.
For the big multimodal video, this issue is a big deal.
Was this faked? EDIT: Yes. Just yes. Shame on Google on several levels.
Set aside the integrity issues, wow are we all jaded at this point, but when I watched that video, even when I assumed it was real, the biggest impression I got was… big lame dad energy?
I do get that this was supposedly happening in real time, but none of this is surprising me. Google put out its big new release, and I’m not scared. If anything, I’m kind of bored? This is the best you could do?
Whereas when watching the exact same video, others react differently.
Does it even if real? I mean, I guess, if you didn’t already assume all of it, and it was this smooth for regular users? I can think of instances in which a camera feed hooked up to Gemini with audio discussions could be a big game. To me this is a strange combination of the impressive parts already having been ‘priced into’ my world model, and the new parts not seeming impressive.
So I’m probably selling it short somewhat to be bored by it as a potential thing that could have happened. If this was representative of a smooth general multimodal experience, there is a lot to explore.
Arthur thinks Gemini did its job, but that this is unsurprising and it is weird people thought Google couldn’t do it.
Liv Boeree? Impressed.
Gary Marcus? Impressed in some ways, not in others.
I love that this is saying that OpenAI isn’t valuable both because Gemini is so good and also because Gemini is not good enough.
Roon offers precise praise.
Joey Krug is super unimpressed by the fudging on the benchmarks, says they did it across the board not only MMLU.
Google’s central problem is not wokeness, it is that they are a giant company with lots of internal processes and powers that prevent or slow or derail innovation, and prevent moving fast or having focus. And there are especially problems making practical products, integrating the work of various teams, making incentives line up. There is lots of potential, tons of talent, plenty of resources, but can they turn that into a product?
Too soon to tell. Certainly they are a long way from ‘beat OpenAI’ but this is the first and only case where someone might be in the game. The closest anyone else has come is Claude’s longer context window.