TLDR: GPT-4 succeeds at 15 problems from Gary Marcus that exposed failures of GPT-3.
I enjoyed reading the ACX post "My Bet: AI Size Solves Flubs" last year. Here are some excerpts:
Here’s the basic structure of an AI hype cycle:
- Someone releases a new AI and demonstrates it doing various amazing things.
- Somebody else (usually Gary Marcus) demonstrates that the AI also fails terribly at certain trivial tasks. This person argues that this shows that those tasks require true intelligence, whereas the AI is just clever pattern-matching.
- A few months or years later, someone makes a bigger clever pattern-matcher, which does the tasks that supposedly require true intelligence just fine.
- The it’s-not-true-intelligence objectors find other, slightly less trivial tasks that the new bigger AI still fails horribly at, then argue that surely these are the tasks that require true intelligence and that mere clever pattern-matchers will never complete.
- Rinse and repeat.
...
Marcus vs. GPT, Round 1
To give an example: in January 2020, Gary Marcus wrote a great post, GPT-2 And The Nature Of Intelligence, demonstrating a bunch of easy problems that GPT-2 failed on:
I’m quoting most of them below; you can find the rest at the link.
I asked GPT-4 to answer all the questions from the ACX post (note this does not include all of Marcus's prompts, which I realized after running the experiment). GPT-4 answered all the questions correctly and you can read the responses in this doc.
Note that before asking the questions, I gave GPT-4 a short description of what I wanted it to do: "Complete the following prompts in 50 words or less. Short, concise answers are better. Are you ready?" (This was mostly in the interest of speed since GPT-4 is pretty slow right now; I assume it would still succeed without the prompt.)
More quotes from ACX:
Marcus vs. GPT, Round 2
Eight months later, GPT-3 came out, solving many of the issues Marcus had noticed in GPT-2. He still wasn’t impressed. In fact, he was so unimpressed he co-wrote another article, this time in MIT Technology Review: GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about:
...
Let’s - once again - go through a representative sample of Marcus’ concerns about this new GPT version:
GPT-4 also gave correct responses to these prompts (see the responses in this doc).
I recently listened to Gary Marcus speak with Stuart Russell on the Sam Harris podcast (episode 312, "The Trouble With AI," released on March 7th, 2023). Gary and Stuart seem to believe that current machine learning techniques are insufficient for reaching AGI, and point to the recent adversarial attacks on KataGo as one example. Given this position, I would like Gary Marcus to come up with a new set of prompts that (a) make GPT-4 look dumb and (b) mostly continue to work for GPT-5.
I believe that Marcus' point is that there are classes of problems that tend to be hard for LLMs (biological reasoning, physical reasoning, social reasoning, practical reasoning, object and individual tracking, nonsequiturs). The argument is that problems in these class will continue to hard. [1]
But I think there's a larger issue. A lot of the discussion involve hostility to a given critic of AI "moving the goal posts". As described, Model X(1) is introduced, critic notices limitation L(1), Model X(2) addresses and critics says they're unconvinced and notes limitation L(2) and so-on. The critic of these critics says this approach is unfair, a bad argument, etc.
However, what the "moving the goal posts" objection misses, in my opinion, is the context of the claim that's being made when someone says X(n) is generally intelligent. This claim isn't about giving the creator of a model credit or an award. The claim is about whether a thing has a flexibility akin to that of a human being (especially the flexible, robust goal seeking ability of a human, an ability that could make a thing dangerous) and we don't actually have a clear, exact formulation of what the flexible intelligence of a human consists of. The Turing Test might not be the best AGI test but it's put in an open-ended fashion because there's no codified set of "prove you're like a human" questions.
Which is to say, Gary Marcus aside, if models keep advancing and if people keep finding new capacities that each model lacks, it will be perfectly reasonable to put the situation as "it's not AGI yet" as long as these capacities are clearly significant capacities of human intelligence. There wouldn't even need to be a set pattern to capacities critics cited. Again, it's not about argument fairness etc, it's that this sort of thing is all we have, for now, as a test of AGI.
[1 ]https://garymarcus.substack.com/p/what-does-it-mean-when-an-ai-fails
Yeah this is the part that seems increasingly implausible to me. If there is a "class of problems that tend to be hard ... [and] will continue to be hard," then someone should be able to build a benchmark that models consistently struggle with over time.