Crossposted from my personal blog.
Epistemic status: This is a short post meant to highlight something I do not yet understand and therefore a potential issue with my models. I would also be interested to hear if anybody else has a good model of this.
Why do vision (and audio) models work so well despite being so small? State of the art models like stable diffusion and midjourney work exceptionally well, generating near-photorealistic art and images and give users a fair degree of controllability over their generations. I would estimate with a fair degree of confidence that the capabilities of these models probably surpass the mental imagery abilities of almost all humans (they definitely surpass mine and a number of people I have talked to). However, these models are also super small in terms of parameters. The original stable diffusion is only 890M parameters.
In terms of dataset size, image models are at a rough equality with humans. The stable diffusion dataset is 2 billion images. Assuming that you see 10 images per second every second you are awake and that you are awake 18 hours a day, you can observe 230 million images per year and so get the same data input as stable diffusion after 10 years. Of course, the images you see are much more redundant and we made some highly aggressive assumptions but after a human lifetime being in the same OOM as a SOTA image model is not insane. On the other hand, the hundreds of billions to trillions of tokens fed to LLMs is orders of magnitude beyond what humans could ever experience.
A similar surprising smallness occurs in audio models. OpenAI's Whisper can do almost flawless audio transcription (including multilingual translation!) with just 1.6B parameters.
Let's contrast this to the brain. Previously, I estimated that we should expect the visual cortex to have on the order of 100B parameters, if not more. The auditory cortex should be of roughly the same order of magnitude, but slightly smaller than the visual cortex. That is two orders of magnitude larger than state of the art DL models in these modalities.
This contrasts with state of the art language models which appear to be approximately equal to the brain in parameter count and abilities. Small (1-10B) language models are clearly inferior to the brain at producing valid text and completions as well as standard question answering and factual recall tasks. Human parity in factual knowledge is reached somewhere between GPT-2 and GPT-3. Human language abilities are still not entirely surpassed with GPT-3 (175B parameters) or GPT-4 (presumably significantly larger). This puts large language models within approximately the same order of magnitude as the human linguistic cortex.
What could be the reasons for this discrepancy? Off the top of my head I can think of a number which are below (and ranked by rough intuitive plausibility), and it would be interesting to try to investigate these further. Also, if anybody has ideas or evidence either way please send me a message.
1.) The visual cortex vs image models is not a fair comparison. The brain does lots of stuff image generation models can't do such as parse and render very complex visual scenes, deals with saccades and having two eyes, and, crucially, handle video data and moving stimuli. We haven't fully cracked video yet and it is plausible that to do so existing vision models require an OOM or two more of scale.
2.) There are specific inefficiencies in the brain's processing of images that image models skip which do not apply to language models. One very obvious example of this is convolutions. While CNNs have convolutional filters which are applied to all tiles of the image individually, the brain cannot do this and so must laboriously have separate neurons and synapses encode each filter. Indeed, much of the processing in the retina, lateral geniculate nucleus, and even V1 appears to be taken up with extremely simple filters (such as Gabors, edge detectors, line detectors etc) copied over and over again for different image patches. This 'artificially' inflates the parameter count of the visual cortex vs ML models such that the visual cortex' 'effective parameter count' is much smaller than appears. However, I doubt this can be the whole story as recent image models such as stable diffusion use increasingly transformer-like architectures (residual stream + attention) rather than convolutions for most of the image processing pipeline. Similarly, Whisper only has 1 conv block at the beginning before transitioning into an attention based architecture.
3.) Parameter count is the wrong way to assess diffusion models. Unlike feedforward NNs such as transformers or earlier vision models such as GANs/VAEs, diffusion models generate (and are trained) using a reasonably large number of diffusion steps to iteratively 'decode' an image. This process is very similar to the iterative inference via recurrence that occurs in the brain. However, unlike diffusion models, the brain supports a single feedforward amortized sweep to achieve core object recognition (otherwise your vision would be too slow to detect important things such as predators in time). It is possible that the iterative inference supported by diffusion models is more parameter efficient than a direct amortized net would be, and thus gets a saving over the brain in this way. While there are very good VAEs/GANs in existence and at scale, it may be that these need to have an OOM or more parameters to be competitive with diffusion models. Note that in terms of computational cost, since a forward pass through an amortized net is so much cheaper than a generation with a diffusion network (a diffusion network generation is effectively N amortized forward passes where N is the number of diffusion steps) then comparable VAEs/GANs may actually be cheaper to run even if much larger.
4.) Our assessment of LLM abilities is wrong and existing LLMs are just vastly superhuman and GPT-2 style models are actually at human parity. This seems strongly unlikely from actually interacting with these models, but on the other hand, even GPT-2 models possess a lot of arcane knowledge which is superhuman and it may be that the very powerful cognition of these small models is just smeared across such a wide range of weird internet data that it appears much weaker than us in any specific facet. Intuitively, this would be that a human and GPT-2 possess the same 'cognitive/linguistic power' but that since GPT-2's cognition is spread over a much wider data range than a human, it's 'linguistic power density' is lower and therefore appears much less intelligent in the much smaller human-relevant domain in which we test it. I am highly unclear whether these concepts are actually correct or a useful frame through which to view things.
5.) Language models are highly inefficient and can be made much smaller without sacrificing much performance. For whatever reason, we may just be training language models badly or doing something else wrong and it is in fact possible to get 1 or 2 OOMs of parameter efficiency out of current language models. If this were true, it would be massive since it would shrink a GPT-4 level model into a trivially open-sourceable and highly hackable 'small' LLM. For instance, GPT-4 is unlikely to be more than 1 trillion dense parameters. Two orders of magnitude would shrink it to a 10B model, approximately the same sizes as the Llama 11B and smaller than neox-20B, and which would be straightforward to inference on even consumer-grade cards. There is some evidence for this in reasonably large amounts of pruning being possible, but to me it seems that an actual 2 OOM shrinking is unlikely.
First, when we say "language model" and then we talk about the capabilities of that model for "standard question answering and factual recall tasks", I worry that we've accidentally moved the goal posts on what a "language model" is.
Originally, a language model was a stochastic parrot. They were developed to answer questions like "given these words, what comes next?" or "given this sentence, with this unreadable word, what is the most likely candidate?" or "what are the most common words?"[1] It was not a problem that required deep learning.
Then, we applied deep learning to it, because the path of history so far has been to take straightforward algorithms, replace them with a neural network, and see what happens. From that, we got ... stochastic parrots! Randomizing the data makes perfect sense for that.
Then, we scaled it. And we scaled it more. And we scaled it more.
And now we've arrived at a thing we keep calling a "language model" due to history, but it isn't a stochastic parrot anymore.
Second, I'm not saying "don't randomize data", I'm saying "use a tiered approach to training". We would use all of the same techniques: randomization, masking, adversarial splits, etc. What we would not do is throw all of our data and all of our parameters into a single, monolithic model and expect that would be efficient.[2] Instead, we'd first train a "minimal" LLM, then we'd use that LLM as a component within a larger NN, and we'd train that combined system (LLM + NN) on all of the test cases we care about for abstract reasoning / problem solving / planning / etc. It's that combined system that I think would end up being vastly more efficient than current language models, because I suspect the majority of language model parameters are being used for embedding trivia that doesn't contribute to the core capabilities we recognize as "general intelligence".
This wasn't for auto-complete, it was generally for things like automatic text transcription from images, audio, or videos. Spam detection was another use-case.
Recall that I'm trying to offer a hypothesis for why a system like GPT-3.5 takes so much training and has so many parameters and it still isn't "competent" in all of the ways that a human is competent. I think "it is being trained in an inefficient way" is a reasonable answer to that question.