GödelPilled
GödelPilled has not written any posts yet.

GödelPilled has not written any posts yet.

My understanding is that GPT style transformer architecture already incorporates random seeds at various points. In which case, adding this functionality to the random seeds wouldn't cause any significant "cost" in terms of competing with other implementations.
This source (Chinese news website) Not 175 billion!OpenAI CEO's announcement: GPT-4 parameters do not increase but decrease - iMedia (min.news)
cites the Sam Altman quote on GPT-4 having few parameters as being from the AC10 online meetup, however I can't seem to find any transcript or videos of that meetup to verify it.
GPT-4 was trained on OpenAI's new supercomputer which is composed of [edit] NVIDIA DGX A100 nodes.
I'm assuming each individual instance of GPT-4 runs on one DGX A100 node.
Each DGX node has 8x A100 GPUs. Each A100 can have either 40 or 80GB vram. So a single DGX node running GPT-4 has either 320 or 640 GB. That can allow us to calculate an upper limit to the number of parameters in a single GPT-4 instance.
Assuming GPT-4 uses float16 to represent parameters (same as GPT-3), and assuming they're using the 80GB A100s, that gives us an upper limit of 343 billion parameters in one GPT-4 instance.
GPT-3 had 175 Billion parameters. I've seen a... (read more)
Also since it's my first post here, nice to meet everyone! I've been lurking on LessWrong since ~2016 and figured now is as good a time as any to actively start participating.
Everett Insurance as a Misalignment Backup
(Posting this here because I've been lurking a long time and just decided to create an account. Not sure if this idea is well structured enough to warrent a high level post and I need to accrue karma before I can post anyways.)
Premise:
There exist some actions, which have such low cost, as well as high payoff in the case that the Everett interpretation happens to be true, that their expected value makes them worth doing even if P(everett=true) is very low. (personal prior: P(everett=true) = %5)
Proposal:
The proposal behind Everett Insurance is straightforward: use Quantum Random Number Generators (QRNGs) to introduce quantum random seeds into large AI systems.... (read 1130 more words →)
I don't remember, I probably misremembered on the number of DGX nodes. Will edit the comment to remove that figure.
Assuming that GPT-4 is able to run on a single DGX node with 640 GB VRAM, what would be a reasonable upper bound on the parameter count assuming that they're using mixed precision and model offload approaches?
[Edit] I've been researching various offloading approaches. It seems unlikely that they are using anything like weight offloading as the process of loading layers too and from VRAM would be way too time consuming for them to make a usable API out of it. If it is too large to fit on a single DGX node then... (read more)