Dwarkesh had a podcast recently with Francois Chollet (creator of Keras)

 

He seems fairly skeptical we are anywhere near AGI with LLMs. He mostly bases his intuition that LLMs fail on OOD tasks and don't seem to be good at solving simple abstract reasoning problems he calls the ARC challenge. It seems he thinks system 2 thinking will be a much harder unlock than people think and that scaling LLMs will go nowhere. In fact he goes so far as to say the scaling maximalists have set back AGI progress by 5-10 years. Current LLMs to him are just simply information retrieval databases.

He, along with the CEO of Zapier, have launched a 1 million dollar prize to beating the ARC bench marks, which are apparently hard for LLMs. I didn't believe it at first, given how easy they seem, but barely any progress has been made on the ARC bench marks in the last 4 years. In retrospect, it's odd that so many existing benchmarks rely heavily on memorized knowledge, and the ARC results check out with LLMs being bad at playing sudoku (so maybe not that surprising). 

This seems to be in contradiction with what people on this site generally think. Is the disagreement mainly that system 2 thinking will be a relatively fast unlock (this is my take at least[1]) whereas Francois thinks it will take a long time? 

Or does it go deeper? 

 

  1. ^

    Personally my intuition that LLMs are world modelers and system 2 thinking will be a relatively simple unlock as they get better at modeling the world. 

New Answer
New Comment

3 Answers sorted by

Seth Herd

83

I think he's totally right that there's a missing ability in LLMs. He doesn't claim this will be a big blocker. I think we'd be fools to assume this gives us much more time.

My previous comment on this podcast and pretty much this question says more. 

Briefly: There might be a variety of fairly easy ways to add more System 2 thinking and problem-solving and reasoning in genuinely new domains. Here's one for system 2 thinking, and here's one for better reasoning and knowledge discover. There might easily be six more that people are busy implementing, half of which will work pretty quickly.

This could be a bottleneck that gives us extra years, but assuming that seems like a very bad idea. We should step lively on the whole alignment project in case this intelligence stuff happens to be a lot easier than we've been thinking prior to getting enough compute and deep nets that really work.

WRT the consensus you mention: there's no consensus, here or elsewhere. Nobody knows. Taking an average would be a bad idea. The distribution among people who've got the right expertise (or as close as we get now) and spend time on prediction is still very broad. This says pretty clearly that nobody knows. That includes this question as well as all other timeline questions. We can't be sure until it's built and working. I've got lots of reasoning behind my guess that this won't take long to solve, but I wouldn't place heavy odds on being right. 

That broad distribution is why the smart bet is to have an alignment solution ready for the shorter projected timelines. 

[-]O O44

He doesn't claim this will be a big blocker.

He does, otherwise the claim that OpenAI pushed back AGI timelines by 5-10 years doesn’t make sense.

4Seth Herd
I stand corrected. I didn't remember that his estimate of increase was that large. That seems like a very high estimate to me, for the reasons above.
3sliqz
However, part of that was about going from open research to closed.

There might be a variety of fairly easy ways to add more System 2 thinking and problem-solving and reasoning in genuinely new domains. Here's one for system 2 thinking

 

This approach seemed more plausible to me a year ago than it does now. It seemed feasible enough that I sketched out takeover scenarios along those lines (eg here). But I think we should update on the fact that there doesn't seem to have been much progress in this direction since then, despite eg Auto-GPT getting $12M in funding in November, and lots of other startups racing for commerc... (read more)

4eggsyntax
Note: one possible reason that scaffolded agents haven't succeeded better yet is the argument that Sholto Douglas & Trenton Bricken make on a recent Dwarkesh Patel podcast: that you just need another couple of orders of magnitude of reliability before you can string together substantial chains of subgoals and get good results. 
2Seth Herd
This is a good point. I keep expecting to see useful agents released, or to hear the open-source community get excited about their success with non-commercial projects. Neither have happened. So we should update at least a bit. This is harder or less useful than it first seemed. But I think there's still a good chance that this is the fastest and most obvious route to AGI. In the article I linked, about a year old now, I didn't predict that GPT4 could be turned into AGI, just that LLMs could - and I noted that it would more likely be GPT5 or GPT6 combined with scaffolding that becomes very useful and very dangerous, relatively easily. The dumber the model, the better and more elaborate the scaffolding needs to be to get it past the point of autonomous reasoning and usefuluness. There are two factors you don't mention. One is that the biggest blocker to commercial usefulness wasn't reasoning ability, it was ability to correctly interpret a webpage or other software. Multimodal models largely solved that. So most of the commercial dev effort probably went there until the availability of natively multimodal LLM APIs. That was about six months ago, still a long time. And that doesn't account for less-commercial efforts. The second is the possibility that GPT4 and the current gen just aren't quite smart enough to have scaffolded System 2 work. The article Large Language Models Cannot Self-Correct Reasoning Yet from DeepMind and academic authors in oct. 23 draws this conclusion. (The many reports of useful self-correction were based on terrible methodology that miscalculated base rates when you allow multiple guesses. Computer scientists are even worse at methodology than social scientists, apparently. ) The "Yet" in their title is important. They think a little more native reasoning ability would get them over the hump to doing useful self-correction. That's one huge application of System 2 reasoning, but not all of it. My current fear is not that this system 2 scaf
1eggsyntax
Agreed that it's quite plausible that LLMs with scaffolding basically scale to AGI. Mostly I'm just arguing that it's an open question with important implications for safety & in particular timelines. I'm very skeptical of this with respect to web pages. Some pages include images (eg charts) that are necessary to understand the page content, but for many or most pages, the important content is text in an HTML file, and we know LLMs handle HTML just fine (since they can easily create it on demand). Agreed, this seems like a totally live possibility. Personally I'd have to be a lot more confident that alignment of such systems just works to favor alignment researchers advancing capabilities; to me having additional time before AGI seems much more clearly valuable.
3Seth Herd
I realized I didn't really reply to your first point, and that it's a really important one. We're in agreement that scaffolded LLMs are a possible first route to AGI, but not a guaranteed one. If that's the path, timelines are relatively short. If that's a possibility, we'd better have alignment solutions for that possible path, ASAP. That's why I continue to focus on aligning LMAs. If other paths to AGI turn out to be the first routes, timelines are probably a little longer, so we've got a little longer to work on alignment for those types of systems. And there are more people working on RL-based alignment schemes (I think?)
1eggsyntax
I'm glad you're working on it! I think your arguments are plausible and your approach is potentially promising. 
3Seth Herd
I was also surprised that interpreting webpages was a major blocker. They're in text and HTML, as you say. I don't remember who said this, but I remember believing them since they'd actually tried to make useful agents. They said that actual modern webpages are such a flaming mess of complex HTML that the LLMs get confused easily. Your last point, whether the direction to easier-to-align AGI or more time to work on alignment is preferable is a very complex issue. I don't have a strong opinion since I haven't worked through it all. But I think there are very strong reasons to think LLM-based AGI is far easier to align than other forms, particularly if the successful approach doesn't heavily rely on RL. So I think your opinion is in the majority, but nobody has worked it through carefully enough to have a really good guess. That's a project I'd like to embark on by writing a post making the controversial suggestion that maybe we should be actively building LMA AGI as the safest of a bad set of options.
1eggsyntax
I think that'd be a really valuable post! I also think we'll get substantial info about the feasibility of LMA in the next six months. Progress on ARC-AGI will tell us a lot about LLMs as general reasoners, I think (and Redwood's excellent new work on ARC-AGI has already updated me somewhat toward this not being a fundamental blocker). And I think GPT-5 will tell us a lot. 'GPT-4 comes just short of being capable and reliable enough to work well for agentic scaffolding' is a pretty plausible view. If that's true, then we should see such scaffolding working a lot better with GPT-5; if it's false, then we should see continued failures to make it really work.

RogerDearnaley

51

CoT prompting and agentic behavior are basically supplying System 2 thinking. Currently LLMs tend to use and benefit from them for a little while, then sooner or later go off the rails/get caught in a loop/get confused, and are seldom able to get unstuck when they do. What we need is for them to be able to much more reliably carry out abilities that they have already demonstrated: which is bread-and-butter for scaling. So I don't see System 2 thinking as a blocker, just work-in-progress. It might take a few years.

As for the ARC challenge, it clearly requires a visual LLM, so systems capable of attempting it have only really existed for about 18 months. My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn't seem like it would be that hard. I'd guess the main problem is the shortage of visual puzzle training material for tasks like this in most training sets.

[-]O O40

My guess is that it will fall soon: progress on math and programming benchmarks has been rapid, so visual logic puzzles doesn't seem like it would be that hard.

His argument is that with millions of examples of these puzzles, you can train an LLM to be good at this particular task, but that doesn’t mean reasoning if it fails at a similar task it doesn’t see. He thinks you should be able to train an LLM to do this without ever training on tasks like these.

I can buy this argument, but still have some doubts. It may be this reasoning is just derived from vi... (read more)

3 comments, sorted by Click to highlight new comments since:
[-]lc20

Francois seems almost to assume that just because an algorithm takes millions or billions of datapoints to train, that means its output is just "memorization". In fact it seems to me that the learning algorithms just work pretty slowly, and that the thing that's learned after those millions or billions of tries is the actual generative concepts.

Redwood's new work on ARC-AGI provides some useful evidence on this question IMO.

My hypothesis is that poor performance on ARC is largely due to lack of training data. If there were billions of diverse input/output examples to train on, I would guess standard techniques would work.

Efficiently learning from just a few examples is something that humans are still relatively good at, especially in simple cases where system1and system 2 synergize well. I’m not aware of many cases where AI approaches human level without orders of magnitude more training data than a human ever sees in a lifetime.

I think the ARC challenge can be solved within a year or two, but doing so won’t be super interesting to me unless it breaks new ground in sample efficiency (not trained on billions of synthetic examples) or generalization (e.g. solved using existing LLMs rather than a specialized net).