TLDR: GPT-4 succeeds at 15 problems from Gary Marcus that exposed failures of GPT-3.
I enjoyed reading the ACX post "My Bet: AI Size Solves Flubs" last year. Here are some excerpts:
Here’s the basic structure of an AI hype cycle:
- Someone releases a new AI and demonstrates it doing various amazing things.
- Somebody else (usually Gary Marcus) demonstrates that the AI also fails terribly at certain trivial tasks. This person argues that this shows that those tasks require true intelligence, whereas the AI is just clever pattern-matching.
- A few months or years later, someone makes a bigger clever pattern-matcher, which does the tasks that supposedly require true intelligence just fine.
- The it’s-not-true-intelligence objectors find other, slightly less trivial tasks that the new bigger AI still fails horribly at, then argue that surely these are the tasks that require true intelligence and that mere clever pattern-matchers will never complete.
- Rinse and repeat.
...
Marcus vs. GPT, Round 1
To give an example: in January 2020, Gary Marcus wrote a great post, GPT-2 And The Nature Of Intelligence, demonstrating a bunch of easy problems that GPT-2 failed on:
I’m quoting most of them below; you can find the rest at the link.
I asked GPT-4 to answer all the questions from the ACX post (note this does not include all of Marcus's prompts, which I realized after running the experiment). GPT-4 answered all the questions correctly and you can read the responses in this doc.
Note that before asking the questions, I gave GPT-4 a short description of what I wanted it to do: "Complete the following prompts in 50 words or less. Short, concise answers are better. Are you ready?" (This was mostly in the interest of speed since GPT-4 is pretty slow right now; I assume it would still succeed without the prompt.)
More quotes from ACX:
Marcus vs. GPT, Round 2
Eight months later, GPT-3 came out, solving many of the issues Marcus had noticed in GPT-2. He still wasn’t impressed. In fact, he was so unimpressed he co-wrote another article, this time in MIT Technology Review: GPT-3, Bloviator: OpenAI’s language generator has no idea what it’s talking about:
...
Let’s - once again - go through a representative sample of Marcus’ concerns about this new GPT version:
GPT-4 also gave correct responses to these prompts (see the responses in this doc).
I recently listened to Gary Marcus speak with Stuart Russell on the Sam Harris podcast (episode 312, "The Trouble With AI," released on March 7th, 2023). Gary and Stuart seem to believe that current machine learning techniques are insufficient for reaching AGI, and point to the recent adversarial attacks on KataGo as one example. Given this position, I would like Gary Marcus to come up with a new set of prompts that (a) make GPT-4 look dumb and (b) mostly continue to work for GPT-5.
Noticed this as well. I tried to get it to solve some integration problems, and it could try different substitutions and things, but if they did not work, it kind of gave up and said to numerically integrate it. Also, it would make small errors, and you would have to point it out, though it was happy to fix them.
I'm thinking that most documents it reads tend to omit the whole search/backtrack phase of thinking. Even work that is posted online that shows all the steps, usually filters out all the false starts. It's like how most famous mathematicians were known for throwing away their scratchwork, leaving everyone to wonder how exactly they formed their thought processes...