Expectations for Gemini: hopefully not a big deal

Maxime Riché

Introduction

My goal is to register and share my expectations and hear others' opinions on their expectation for the relative performances of Gemini VS GPT-4.

My expectations

GPT-4 to Gemini will likely not be as big a jump in capabilities as GPT-3 to GPT-4 was.

Gemini could bring surprises by being more agentic than GPT-4. Being better at planning and longer horizon tasks. But this is likely difficult to achieve, or strong LLM agents would already be making the buzz.

Comparison

From GPT-3 to GPT-4

Scaling Factor: x100 more compute than GPT-3.
Optimization: Chinchilla scaling laws (for MoE) over OpenAI/Kaplan scaling laws.
MoE Over Dense: Utilizes Mixture of Experts (MoE) instead of dense layers.
Data Quality: Likely higher-quality data, not sure.
Image Generation: Not publicly released, possibly due to subpar performance or security risks.
Tools are added during finetuning.
Algorithmic Gains: 3 years between GPT-3 and GPT-4.
GPT-4 may already employ process-based feedback.
GPT-4 aimed for training compute efficiency. GPT-4 was not designed to be commercially deployed at scale.

GPT-4 to Gemini

Scaling Factor: ~x5 (x20) more compute than GPT-4.
Supercomputer Constraint: No existing supercomputer could feasibly provide x100 more compute than used for GPT-4. (Not sure but likely)
Multimodal: maybe image, audio, speech.
Data Efficiency: Possibly better quality data like Google Books, fewer epochs.
Tools could be added either during finetuning or pretraining.
Algorithmic Gains: ~1 year between GPT-4 and Gemini.
Gemini more likely aims for inference efficiency, given its intended extensive usage by Google. Maybe sacrificing training efficiency.
Gemini trained to be more agentic, better at planning, etc. ("GPT-4 + AlphaGo").

Note: I drafted that before news of Gemini's release and capabilities but failed to finish writing... Since then, there have been some reports of Gemini being roughly at the level of GPT-4...

My guess is that it will be a scaled-up Gato - https://www.lesswrong.com/posts/7kBah8YQXfx6yfpuT/what-will-the-scaled-up-gato-look-like-updated-with. I think there might be some interesting features when the models are fully multi-modal - e.g. being able to play games, perform simple actions on a computer etc. Based on the announcement from google I would expect full multimodal training - image, audio, video, text in/out. Based on deepmind's hiring needs I would expect they want it to also generate audio/video and extend the model to robotics (the brain of something similar to a Tesla Bot) in the near future. Elon claims that training just from video input/output can result in full self-driving, so I'm very curious what training on youtube videos can achieve. If they've managed to make a solid progress in long-term planning/reasoning and can deploy the model with a sufficiently small latency it might be a quite significant release, that could simplify many office jobs.

My current assumption is that extracting "intelligence" from images and even more so from videos is much less efficient than from text. Text is just extremely information dense.

So I wouldn't expect Gemini to initially feel more intelligent than GPT4 even if it used 5 times the compute.

I mostly wonder about qualitative differences maybe induced by algorithmic improvements like actually using RL or search components for a kind of self-supervised finetuning, that's one area where I can easily see Deepmind outcompeting OpenAI.

GPT-4 was not designed to be commercially deployed at scale.

What makes you say that?

This comes from OpenAI saying they didn't expect ChatGPT to be a big commercial success. It was not a top-priority project.

ChatGPT was not GPT-4. It was a relatively minor fixup of GPT-3, GPT-3.5, with an improved RLHF variant, that they released while working on GPT-4's evaluations & productizing, which was supposed to be the big commercial success.