Crossposted from my personal blog.
Epistemic status: This is a short post meant to highlight something I do not yet understand and therefore a potential issue with my models. I would also be interested to hear if anybody else has a good model of this.
Why do vision (and audio) models work so well despite being so small? State of the art models like stable diffusion and midjourney work exceptionally well, generating near-photorealistic art and images and give users a fair degree of controllability over their generations. I would estimate with a fair degree of confidence that the capabilities of these models probably surpass the mental imagery abilities of almost all humans (they definitely surpass mine and a number of people I have talked to). However, these models are also super small in terms of parameters. The original stable diffusion is only 890M parameters.
In terms of dataset size, image models are at a rough equality with humans. The stable diffusion dataset is 2 billion images. Assuming that you see 10 images per second every second you are awake and that you are awake 18 hours a day, you can observe 230 million images per year and so get the same data input as stable diffusion after 10 years. Of course, the images you see are much more redundant and we made some highly aggressive assumptions but after a human lifetime being in the same OOM as a SOTA image model is not insane. On the other hand, the hundreds of billions to trillions of tokens fed to LLMs is orders of magnitude beyond what humans could ever experience.
A similar surprising smallness occurs in audio models. OpenAI's Whisper can do almost flawless audio transcription (including multilingual translation!) with just 1.6B parameters.
Let's contrast this to the brain. Previously, I estimated that we should expect the visual cortex to have on the order of 100B parameters, if not more. The auditory cortex should be of roughly the same order of magnitude, but slightly smaller than the visual cortex. That is two orders of magnitude larger than state of the art DL models in these modalities.
This contrasts with state of the art language models which appear to be approximately equal to the brain in parameter count and abilities. Small (1-10B) language models are clearly inferior to the brain at producing valid text and completions as well as standard question answering and factual recall tasks. Human parity in factual knowledge is reached somewhere between GPT-2 and GPT-3. Human language abilities are still not entirely surpassed with GPT-3 (175B parameters) or GPT-4 (presumably significantly larger). This puts large language models within approximately the same order of magnitude as the human linguistic cortex.
What could be the reasons for this discrepancy? Off the top of my head I can think of a number which are below (and ranked by rough intuitive plausibility), and it would be interesting to try to investigate these further. Also, if anybody has ideas or evidence either way please send me a message.
1.) The visual cortex vs image models is not a fair comparison. The brain does lots of stuff image generation models can't do such as parse and render very complex visual scenes, deals with saccades and having two eyes, and, crucially, handle video data and moving stimuli. We haven't fully cracked video yet and it is plausible that to do so existing vision models require an OOM or two more of scale.
2.) There are specific inefficiencies in the brain's processing of images that image models skip which do not apply to language models. One very obvious example of this is convolutions. While CNNs have convolutional filters which are applied to all tiles of the image individually, the brain cannot do this and so must laboriously have separate neurons and synapses encode each filter. Indeed, much of the processing in the retina, lateral geniculate nucleus, and even V1 appears to be taken up with extremely simple filters (such as Gabors, edge detectors, line detectors etc) copied over and over again for different image patches. This 'artificially' inflates the parameter count of the visual cortex vs ML models such that the visual cortex' 'effective parameter count' is much smaller than appears. However, I doubt this can be the whole story as recent image models such as stable diffusion use increasingly transformer-like architectures (residual stream + attention) rather than convolutions for most of the image processing pipeline. Similarly, Whisper only has 1 conv block at the beginning before transitioning into an attention based architecture.
3.) Parameter count is the wrong way to assess diffusion models. Unlike feedforward NNs such as transformers or earlier vision models such as GANs/VAEs, diffusion models generate (and are trained) using a reasonably large number of diffusion steps to iteratively 'decode' an image. This process is very similar to the iterative inference via recurrence that occurs in the brain. However, unlike diffusion models, the brain supports a single feedforward amortized sweep to achieve core object recognition (otherwise your vision would be too slow to detect important things such as predators in time). It is possible that the iterative inference supported by diffusion models is more parameter efficient than a direct amortized net would be, and thus gets a saving over the brain in this way. While there are very good VAEs/GANs in existence and at scale, it may be that these need to have an OOM or more parameters to be competitive with diffusion models. Note that in terms of computational cost, since a forward pass through an amortized net is so much cheaper than a generation with a diffusion network (a diffusion network generation is effectively N amortized forward passes where N is the number of diffusion steps) then comparable VAEs/GANs may actually be cheaper to run even if much larger.
4.) Our assessment of LLM abilities is wrong and existing LLMs are just vastly superhuman and GPT-2 style models are actually at human parity. This seems strongly unlikely from actually interacting with these models, but on the other hand, even GPT-2 models possess a lot of arcane knowledge which is superhuman and it may be that the very powerful cognition of these small models is just smeared across such a wide range of weird internet data that it appears much weaker than us in any specific facet. Intuitively, this would be that a human and GPT-2 possess the same 'cognitive/linguistic power' but that since GPT-2's cognition is spread over a much wider data range than a human, it's 'linguistic power density' is lower and therefore appears much less intelligent in the much smaller human-relevant domain in which we test it. I am highly unclear whether these concepts are actually correct or a useful frame through which to view things.
5.) Language models are highly inefficient and can be made much smaller without sacrificing much performance. For whatever reason, we may just be training language models badly or doing something else wrong and it is in fact possible to get 1 or 2 OOMs of parameter efficiency out of current language models. If this were true, it would be massive since it would shrink a GPT-4 level model into a trivially open-sourceable and highly hackable 'small' LLM. For instance, GPT-4 is unlikely to be more than 1 trillion dense parameters. Two orders of magnitude would shrink it to a 10B model, approximately the same sizes as the Llama 11B and smaller than neox-20B, and which would be straightforward to inference on even consumer-grade cards. There is some evidence for this in reasonably large amounts of pruning being possible, but to me it seems that an actual 2 OOM shrinking is unlikely.
I'm not sure that "even higher level of output quality" is actually true, but I recognize that it can be difficult to judge when an image generation model has succeeded. In particular, I think current image models are fairly bad at specifics in much the same way as early language models.
But I think the real problem is that we seem to still be stuck on "words". When I ask GPT-4 a logic question, and it produces a grammatically correct sentence that answers the logic puzzle correctly, only part of that is related to "words" -- the other part is a nebulous blob of reasoning.
I went all the way back to GPT-1 (117 million parameters) and tested next word prediction -- specifically, I gave a bunch of prompts, and I looked for only if the very next word was what I would have expected. I think it's incredibly good at that! Probably better than most humans.
No, because this is already how image generators work. That's what I said in my first post when I noted the architectural differences between image generators and language models. An image generator, as a system, consists of multiple models. There is a text -> image space, and then an image space -> image. The text -> image space encoder is generally trained first, then it's normally frozen during the training of the image decoder.[1] Meanwhile, the image decoder is trained on a straightforward task: "given this image, predict the noise that was added". In the actual system, that decoder is put into a loop to generate the final result. I'm requoting the relevant section of my first post below:
Refer to figure 2 in https://cdn.openai.com/papers/dall-e-2.pdf. Or read this:
This is the idea that I'm saying could be applied to language models, or rather, to a thing that we want to demonstrate "general intelligence" in the form of reasoning / problem solving / Q&A / planning / etc. First train a LLM, then train a larger system with the LLM as a component within it.