Decoder is a reputable source and 6x as large is about the estimated size most of us were expecting. Keep in mind that's about 1/100th of what the Twitter memes started by Newsletter writers had been hyping and suggesting.
Semafor's sources are to be trusted, unlike some people.
My estimate is about 400 billion parameters (100 billion - 1 trillion) based on EpochAI's estimate of GPT-4's training compute and scaling laws which can be used to calculate the optimal number of parameters and training tokens that should be used for language models given a certain compute budget.
Although 1 trillion sounds impressive and bigger models tend to achieve a lower loss given a fixed amount of data, an increased number of parameters is not necessarily more desirable because a bigger model uses more compute and therefore can't be trained on as much data.
If the model is made too big, the decrease in training tokens actually exceeds the benefit of the larger model leading to worse performance.
Extract from the Training Compute-Optimal Language Models paper:
"our analysis clearly suggests that given the training compute budget for many current LLMs, smaller models should have been trained on more tokens to achieve the most performant model."
Another quote from the paper:
"Unless one has a compute budget of FLOPs (over 250× the compute used to train Gopher), a 1 trillion parameter model is unlikely to be the optimal model to train."
So unless the EpochAI estimate is too low by about an order of magnitude [1] or OpenAI has discovered new and better scaling laws, the number of parameters in GPT-4 is probably lower than 1 trillion.
My Twitter thread estimating the number of parameters in GPT-4.
I don't think it is but it could be.
GPT-4 was trained on OpenAI's new supercomputer which is composed of [edit] NVIDIA DGX A100 nodes.
I'm assuming each individual instance of GPT-4 runs on one DGX A100 node.
Each DGX node has 8x A100 GPUs. Each A100 can have either 40 or 80GB vram. So a single DGX node running GPT-4 has either 320 or 640 GB. That can allow us to calculate an upper limit to the number of parameters in a single GPT-4 instance.
Assuming GPT-4 uses float16 to represent parameters (same as GPT-3), and assuming they're using the 80GB A100s, that gives us an upper limit of 343 billion parameters in one GPT-4 instance.
GPT-3 had 175 Billion parameters. I've seen a few references online to some interview where Sam Altman said GPT-4 actually has fewer parameters than GPT-3 but different architecture and more training. I can't find the original source so I can't verify that quote, but assuming that's true, that gives us a lower bound of slightly smaller than GPT-3's 175 Billion parameters.
Looking at the compute architecture that it's running on, an upper bound of 343 billion parameters seems reasonable.
[edited: removed incorrect estimate of number of DGX nodes as 800. Figure wasn't used in parameter estimate anyways.]
GPT-4 was trained on OpenAI's new supercomputer which is composed of 800 NVIDIA DGX A100 nodes...Each DGX node has 8x A100 GPUs
What are you getting that? 8x800=6400 A100s sounds off by a factor of three.
(Also, it does not follow that they are storing parameters in the same precision rather than mixed-precision nor solely in RAM, and most of the serious scaling frameworks support various model offload approaches.)
I don't remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they're using mixed precision and model offload approaches?
[Edit] I've been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then it's more likely that they're splitting up the layers over multiple nodes rather than offloading weights.
I already assumed that they're using float16 like GPT-3 when calculating the total number of parameters that could be stored in one DGX VRAM. Unless they're using something even smaller like float8, mixed precision with float32 or float64 would only increase VRAM requirements.
You're missing the possibility that parameters during training were larger than models used for inference. It is common practice now to train large, then distill into a series of smaller models that can be used based on the task need.
This source (Chinese news website) Not 175 billion!OpenAI CEO's announcement: GPT-4 parameters do not increase but decrease - iMedia (min.news)
cites the Sam Altman quote on GPT-4 having few parameters as being from the AC10 online meetup, however I can't seem to find any transcript or videos of that meetup to verify it.
I know that OpenAI has tried to keep GPT-4's specs under wraps, but I've also heard some reports that anonymous sources are converging at a trillion parameters for the model. Can anyone confirm or deny, or qualify that statement? https://the-decoder.com/gpt-4-has-a-trillion-parameters/