I agree that Gemini will give us an update on timelines. But even if it's not particularly impressive, there's another route to LLM improvements that should be mentioned in any discussion on LLM timelines.
The capabilities of LLMs can be easily and dramatically improved, at least in some domains, by using scaffolding scripts that prompt the LLM to do internal reasoning and call external tools, as in HuggingGPT. These include creating sensory simulations with generative networks, then interpeting those simulations to access modality-specific knowledge. SmartGPT and Tree of Thoughts show massive improvements in logical reasoning using simple prompt arrangements. Whether or not these expand to be full language model based cognitive architectures (LMCAs), LLMs don't need to have sensory knowledge embedded to use it. Given the ease of fine-tuning, adding this knowledge in an automated way seems within reach as well.
It seems like multi-modality will also result in AIs that are much less interpretable than pure LLMs.
This is not obvious to me. It seems somewhat likely that the multimodaility actually induces more explicit representations and uses of human-level abstract concepts, e.g. a Jennifer Aniston neuron in a human brain is multimodal.
Relevant: Goh et al. finding multimodal neurons (ones responding to the same subject in photographs, drawings, and images of their name) in the CLIP image model, including ones for Spiderman, USA, Donald Trump, Catholicism, teenage, anime, birthdays, Minecraft, Nike, and others.
To caption images on the Internet, humans rely on cultural knowledge. If you try captioning the popular images of a foreign place, you’ll quickly find your object and scene recognition skills aren't enough. You can't caption photos at a stadium without recognizing the sport, and you may even need to know specific players to get the caption right. Pictures of politicians and celebrities speaking are even more difficult to caption if you don’t know who’s talking and what they talk about, and these are some of the most popular pictures on the Internet. Some public figures elicit strong reactions, which may influence online discussion and captions regardless of other content.
With this in mind, perhaps it’s unsurprising that the model invests significant capacity in representing specific public and historical figures — especially those that are emotional or inflammatory. A Jesus Christ neuron detects Christian symbols like crosses and crowns of thorns, paintings of Jesus, his written name, and feature visualization shows him as a baby in the arms of the Virgin Mary. A Spiderman neuron recognizes the masked hero and knows his secret identity, Peter Parker. It also responds to images, text, and drawings of heroes and villians from Spiderman movies and comics over the last half-century. A Hitler neuron learns to detect his face and body, symbols of the Nazi party, relevant historical documents, and other loosely related concepts like German food. Feature visualization shows swastikas and Hitler seemingly doing a Nazi salute.
Which people the model develops dedicated neurons for is stochastic, but seems correlated with the person's prevalence across the dataset 16 and the intensity with which people respond to them. The one person we’ve found in every CLIP model is Donald Trump. It strongly responds to images of him across a wide variety of settings, including effigies and caricatures in many artistic mediums, as well as more weakly activating for people he’s worked closely with like Mike Pence and Steve Bannon. It also responds to his political symbols and messaging (eg. “The Wall” and “Make America Great Again” hats). On the other hand, it most *negatively* activates to musicians like Nicky Minaj and Eminem, video games like Fortnite, civil rights activists like Martin Luther King Jr., and LGBT symbols like rainbow flags.
There is a genre of LLM critique that criticises LLMs for being, well, LLMs.
Yann LeCun for example points to the inability of GPT-4 to visually imagine the rotation of interlocking gears as a fact that shows how far away AGI is, instead of a fact that shows how GPT-4 has not been trained on video data yet.
There are many models now that "understand" images or videos or even more modalities. However, they are not end-to-end trained on these multiple modalities. Instead they use an intermediary model like CLIP, that translates into the language domain. This is a rather big limitation, because CLIP can only represent concepts in images that are commonly described in image captions.
Why do I consider this a big limitation? Currently it looks like intelligence emerges from learning to solve a huge number of tiny problems. Language seems to contain a lot of useful tiny problems. Additionally it is the interface to our kind of intelligence, which allows us to assess and use the intelligence extracted from huge amounts of text.
This means that adding a modality with a CLIP-like embedding and than doing some fine-tuning does not add any intelligence to the system. It only adds eyes or ears or gears.
Training end-to-end on multi-modal data should allow the model to extract new problem solving circuits from the new modalities. The resulting model would not just have eyes, but visual understanding.
Deepmind did a mildly convincing proof-of-concept with Gato last year, a small transformer trained on text, images, computer games and robotics. Now it seems they will try to scale Gato to Gemini, leapfrogging GPT-4 in the process.
GPT-4 itself has image processing capabilities that are not yet available to the general public. But whether these are an add-on or result of integrated image modelling we don't know yet.
To me it seems very likely, that a world where the current AI boom fizzles is a world where multi-modality does not bring much benefits or we cannot figure out how to do it right or possibly the compute requirements of doing it right is still prohibitive.
I think Gemini will give us a good chunk of information about whether that is the world we are living in.