For example, 70B model trained on next-token prediction only on the entire 20TB GenBank dataset will have better performance at next-nucleotide prediction than a 70B model that has been trained both on the 20TB GenBank dataset and on all 14TB of code on Github.

I don't believe that's obvious, and to the extent that it's true, I think it's largely irrelevant (and part of the general prejudice against scaling & Bitter Lesson thinking, where everyone is desperate to find an excuse for small specialist models with complicated structures & fancy inductive biases because that feels right).

Once you have a bunch of specialized models "the weights are identical" and "a fine tune can be applied to all members" no longer holds.

Nor do I see how this is relevant to your original claim. If you have lots of task-specialist models, how does this refute the claim that those will be able to coordinate? Of course they will. They will just share weight updates in exactly the way I just outlined, which works so well in practice. You may not be able to share parameter-updates across your protein-only and your Python-only LLMs, but they will be able to share updates within that model family and the original claim ("AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members.") remains true, no matter how you swap out your definition of 'model'.

DL models are fantastically good at collaborating and updating each other, in many ways completely impossible for humans, whether you are talking about AGI models or narrow specialist models.

Reply

ErioirE's Shortform

gwern1d53

You might find my notes of interest.

Reply

We are headed into an extreme compute overhang

gwern1d127

I think this only holds if fine tunes are composable, which as far as I can tell they aren't

You know 'finetunes are composable', because a finetune is just a gradient descent step on a batch of data and a parameter update, and if you train on more than one GPU and share updates, DL training still works {{citation needed}}.

If you can train asynchronously on a thousand, or 20,000, or 100,000 GPUs, that is what you are doing; this is especially true in DRL, where you might be, say, training across 170,000 CPU-cores. This works because you don't insist on everything being up to date every moment and you accept that there will be degrees of inconsistency/outdatedness. (You are certainly not accumulating the gradient across the entire cluster by waiting for every single node, pausing everything, calculating a single global step, and pushing it out, and only then resuming, as if it were a single GPU! Really, you don't even want to do that on a single GPU for DRL if you gotta go fast.) This works so well that people will casually talk about training "an" AlphaZero, even though they actually mean something more like "the 512 separate instances of AlphaZero we are composing finetunes of" (or more).*

You do have issues with stale gradients and off-policyness of updates and how to best optimize throughput of all of the actors vs training nodes and push out model updates efficiently so nodes stop executing outdated parameters as quickly as possible, and DeepMind & OpenAI etc have done a lot of work on that - but at that point, as in the joke, you have conceded that finetunes are composable and you can keep a very large number of replicas in sync, and it is merely a matter of haggling over how much efficiency you lose.

Also note that it takes a lot less compute to keep a model up to date doing simple online learning on new data than it does to train it from scratch on all historical data summed together (obviously), so what devrandom is talking about is actually a lot easier than creating the model in the first place.

A better model to imagine is not "somehow finetunes from millions of independent models magically compose" (although actually they would compose pretty well), but more like, "millions of independent actors do their ordinary business, while spending their spare bandwidth downloading the latest binary delta from peer nodes (which due to sparsity & not falling too far out of sync, is always on the order of megabytes, not terabytes), and once every tens of thousands of forward passes, discover a novel or hard piece of data, and mail back a few kilobytes of text to the central training node of a few thousand GPUs, who are continually learning on the hard samples being passed back to them by the main fleet, and who keep pushing out an immediately updated model to all of the actor models, and so 'the model' is always up to date and no instance is more than hours out of date with 'the model' (aside from the usual long tail of stragglers or unhealthy nodes which will get reaped)".

* I fear this is one of those cases where our casual reification of entities leads to poor intuitions, akin to asking 'how many computers are in your computer you are using right now?'; usually, the answer is just '1', because really, who cares how exactly your 'smartphone' or 'laptop' or 'desktop' or 'server' is made up of a bunch of different pieces of silicon - unless you're discussing something like device performance or security, in which case it may matter quite a lot and you'd better not think of yourself as owning 'a' smartphone.

Reply

KAN: Kolmogorov-Arnold Networks

gwern1d74

(likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?)

Pretraining, specifically: https://gwern.net/doc/reinforcement-learning/meta-learning/continual-learning/index#scialom-et-al-2022-section

The intuition is that after pretraining, models can map new data into very efficient low-dimensional latents and have tons of free space / unused parameters. So you can easily prune them, but also easily specialize them with LoRA (because the sparsity is automatic, just learned) or just regular online SGD.

But yeah, it's not a real problem anymore, and the continual learning research community is still in denial about this and confining itself to artificially tiny networks to keep the game going.

Reply

Andrew Burns's Shortform

gwern2d71

Altman made a Twitter-edit joke about 'gpt-2 i mean gpt2', so at this point, I think it's just a funny troll-name related to the 'v2 personality' which makes it a successor to the ChatGPT 'v1', presumably, 'personality'. See, it's gptv2 geddit not gpt-2? very funny, everyone lol at troll

Reply

Andrew Burns's Shortform

gwern2d42

Sure, the poem prompt I mentioned using is like 3500 characters all on its own, and it had no issues repeatedly revising and printing out 4 new iterations of the poem without apparently forgetting when I used up my quota yesterday, so that convo must've been several thousand BPEs.

Reply

Andrew Burns's Shortform

gwern3d77

It definitely exceeds 1024 BPEs context (we wouldn't be discussing it if it didn't, I don't think people even know how to write prompts that, combined with the system prompt etc, even fit in 1024 BPEs anymore), and it is almost certainly not GPT-2, come on.

Reply

Andrew Burns's Shortform

gwern3d95

And they already have a Sora clone called Vidu, for heaven's sake.

No, they don't. They have a video generation model, which is one of a great many published over the past few years as image generation increasingly became solved, such as Imagen Video or Phenaki from Google years ago, and the Vidu samples are clearly inferior to Sora (despite heavy emphasis on the 'pan over static scene' easy niche): https://www.youtube.com/watch?v=u1R-jxDPC70

Here we are in 2024, and we're still being told how Real Soon Now Chinese DL will crush Westerners. I've been hearing this for almost a decade now, and I've stopped being impressed by the likes of Hsu talking about how "China graduates a million engineers a year!" or whatever. Somehow, the Next Big Thing never comes out of Chinese DL, no matter how many papers or citations or patents they have each year. Something to think about.

(I also have an ongoing Twitter series where every half year or so, I tweet a few of the frontier-pushing Western DL achievements, and I ask for merely 3 Chinese things as good - not better, just plausibly as good, including in retrospect from previous years. You know how many actual legitimate answers I've gotten? Like 1. Somehow, all the e/accs and China hawks like Alexandr Wang can't seem to think of even a single one which was at or past the frontier, as opposed to the latest shiny 'catches up to GPT-4!* * [on narrow benchmarks, YMMV]' clone model.)

Reply

avturchin's Shortform

gwern3d75

Nah, it's just a PR stunt. Remember when DeepMind released AlphaGo Master by simply running a 'Magister' Go player online which went undefeated?* Everyone knew it was DeepMind simply because who else could it be? And IIRC, didn't OA also pilot OA5 'anonymously' on DoTA2 ladders? Or how about when Mistral released torrents? (If they had really wanted a blind test, they wouldn't've called it "gpt2", or they could've just rolled it out to a subset of ChatGPT users, who would have no way of knowing the model underneath the interface had been swapped out.)

* One downside of that covert testing: DM AFAIK never released a paper on AG Master, or all the complicated & interesting things they were trying before they hit upon the AlphaZero approach.

Reply

avturchin's Shortform

gwern3d105

https://rentry.org/GPT2

I ran out of tokens quickly trying out poetry but I didn't get the impression that this is a big leap over GPT-4 like GPT-5 presumably is designed to be. (It could, I suppose, be a half-baked GPT-5 similar to 'Prometheus' for GPT-4.) My overall impression from poetry was that it was a GPT-4 which isn't as RLHF-damaged as usual, and more like Claude in having a RLAIF-y creative style. So I could believe it's a better GPT-4 where they are experimenting with new tuning/personality to reduce the ChatGPT-bureaucratese.

HN: https://news.ycombinator.com/item?id=40199715

Reply

1