It's helpful to include a summary with linkposts.
So here's a super quick one. I didn't listen to it closely, so I could've missed something.
It's about the article No "Zero-Shot" Without Exponential Data
Here's the key line from the abstract:
We consistently find that, far from exhibiting "zero-shot" generalization, multimodal models require exponentially more data to achieve linear improvements in downstream "zero-shot" performance, following a sample inefficient log-linear scaling trend
So, we might not continue to get better performance if we need exponentially larger datasets to get small linear improvements. This seems quite plausible, if nobody comes up with some sort of clever bootstrapping in which automatic labeling of images and videos, with a little human feedback, creates useful unlimited size datasets.
I think this isn't going to much of a slowdown on AGI progress, because we don't need much more progress on foundation models to build scaffolded agentic cognitive architectures that use system 2 type cognition to gauage their accuracy and the importance of the judgment, and use multiple tries on multiple models for important cognitive acts. That's how humans are as effective as we are; we monitor and double-check our own cognition when appropriate.
Key graph: https://arxiv.org/pdf/2404.04125#page=6
I don't think they 'consistently find' anything, except about possibly CLIP (which we've known to have severe flaws ever since it came out, as expected from a contrastive loss, despite of course many extremely valuable uses). Unless I'm missing something in the transcript, Computerphile doesn't mention this at all.